feat(prover-node): checkpoint-driven optimistic proving#23002
feat(prover-node): checkpoint-driven optimistic proving#23002PhilWindle wants to merge 6 commits into
Conversation
087e60b to
f181313
Compare
f181313 to
3694a9b
Compare
…pair Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy `EpochProver` / `ProvingOrchestrator` / `EpochProvingState` path untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers with the new `TopTreeOrchestrator`. Split out from #22990 so it can land independently. ## What's new - **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`): extends `ProvingOrchestrator`, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves a `SubTreeResult` instead of escalating to the checkpoint root — the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty challenges)` to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running. - **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven cancel from a genuine proving failure. - **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof. - **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller list, and a unified `deferredProving<S, T>(state, request, callback, isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is just `verifyState()` + `reject(reason)`. - **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans; top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to bridge the two states' differing `resolve` signatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ. - **`EpochProverFactory` interface on `ProverClient`**: new factory methods `createEpochProvingContext(epochNumber)`, `createCheckpointSubTreeOrchestrator(...)`, and `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is owned by `ProverClient` and shared across every orchestrator. ## What changes in existing code - `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`, `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree can override them; `provingState` and `provingPromise` likewise become `protected` so the sub-tree can hook the parent's failure stream onto `subTreeResult`. No public API change on `ProvingOrchestrator`. - `CheckpointProvingState`: gains two read-only accessors used by the sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and `getLastArchiveSiblingPath()`. No state changes. - `ProverClient` keeps `createEpochProver()` exactly as before (each call spawns its own `BrokerCircuitProverFacade`); the new factory methods share a `getFacade()` set up in `start()` and torn down in `stop()`. `EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`, the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`, and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the prover-node and e2e tests continue to build against the existing `EpochProver` API. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR. ## Test plan - [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`). - [x] `yarn build` clean against current merge-train/spartan (modulo the pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
da67473 to
7342efc
Compare
fd62ad7 to
093fdde
Compare
Drives the prover-node onto the split `CheckpointSubTreeOrchestrator` +
`TopTreeOrchestrator` pair, with checkpoint-driven proving that pipelines
sub-trees against tx-gathering and the top-tree against the in-flight
sub-trees.
## What's new
### `prover-node` — `EpochProvingJob` job-model rewrite
`EpochProvingJob` becomes an orchestrator over a `Map<string, CheckpointJob>`
keyed by `${number}:${slot}`. Each `CheckpointJob` owns a single
`CheckpointSubTreeOrchestrator` with its own per-checkpoint context (txs,
attestations, previous-block header, l1ToL2 messages, archive sibling path).
A `TopTreeJob` drives the epoch root rollup once all checkpoint sub-trees
have started block-level proving.
Public API:
- `registerCheckpoint` — synchronous; sets up sub-tree, kicks off chonk-verifier
cache fill, attaches the `blockProofs` promise to the eventual top-tree job.
- `provideTxs` — supplies simulated txs, transitions the checkpoint job from
registered → block-proving.
- `removeCheckpoint(synchronous, idempotent)` — drops a single checkpoint by
`(number, slot)`, fire-and-forget cancels its sub-tree. Tolerates re-add of
the same checkpoint number under a different slot.
- `removeCheckpointsAfter`, `getCheckpointCount`, `getCheckpointNumbers`,
`cancelPendingCheckpoints`.
### `prover-node` — `L2BlockStream`-driven checkpoint pipeline
The prover-node consumes `chain-checkpointed` / `chain-pruned` events from an
`L2BlockStream` rooted at the first block of the first unproven epoch. On
each `chain-checkpointed`:
1. Resolve the epoch via `getEpochAtSlot`.
2. Get-or-create the per-epoch `EpochProvingJob`.
3. Detached-task gather txs + register the checkpoint with the job.
On `chain-pruned`: call `removeCheckpointsAfter(threshold)` on every job
whose first checkpoint sits at or above the threshold. Pending gather tasks
are cancelled via `AbortSignal`.
Finalization is driven by the union of three signals: epoch-monitor sees
the epoch close on L1, a checkpoint for a strictly later epoch arrives, or
all expected checkpoints (per archiver) are registered while the epoch is
already complete on L1.
### `prover-node` — reorg-after-finalization restart
When the L2BlockStream emits a prune that retroactively invalidates an
epoch already in finalize, the prover-node aborts the in-flight publish,
clears the job, and restarts proving from the new tip.
### e2e
- New `epochs_optimistic_proving.parallel.test.ts`: full e2e covering the
pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
- `epochs_proof_fails`, `epochs_upload_failed_proof`, `epochs_long_proving_time`,
`epochs_multi_proof` updated to assert the new in-flight epoch behaviour.
## What's removed
`EpochProver` interface and `ServerEpochProver` are removed: the prover-node
no longer drives a single-class epoch prover, so the legacy API has no
production callers. `ProvingOrchestrator` survives only as a base class for
`CheckpointSubTreeOrchestrator` and as the single-class driver used by
`prover-client`'s integration tests; it no longer implements `EpochProver`.
## Test plan
- `yarn workspace @aztec/prover-client test` — 261 tests pass.
- `yarn workspace @aztec/prover-node test` — 89 tests pass.
- e2e tests covering optimistic proving, reorgs during proving, failed
proof publish, and multi-checkpoint flows are included in this PR.
093fdde to
81dc825
Compare
spalladino
left a comment
There was a problem hiding this comment.
Got halfway through it, but looks good so far. Pasting here some observations by Claude that I didn't get to review myself:
-
Same-epoch replacement checkpoints are silently dropped after
completeEpoch().prover-node.ts:213-220skips any checkpoint for an epoch whose job hasisEpochComplete(). Sequence: (1) all checkpoints registered,completeEpoch()fires; (2) prune removes the suffix; (3) the L1 reorg replacement for the same epoch arrives. The replacement is dropped becauseexistingJob.isEpochComplete()is true. The finalize loop then restarts against an incomplete prefix of the epoch and eventually publishes a proof for a checkpoint range that does not match the canonical L1 chain. This contradicts the stated "orphan + replacement coexist briefly" model. The check should be conditioned on the surviving checkpoint set, not on the boolean flag. -
A prune that lands after
topTreeJob.start()but duringpublishProofis not interceptable.epoch-proving-job.ts:478clearsthis.topTreeJobimmediately after the prove resolves; the publish atepoch-proving-job.ts:487operates on the capturedsnapshotarray. MeanwhileremoveCheckpointsAfter(epoch-proving-job.ts:332-369) only acts on the livetopTreeJob— which is nowundefined— so the publish proceeds with stale data and the (out-of-date) attestations fromsnapshot.at(-1). Best case L1 rejects the proof (gas burn, retry confusion); worst case it lands against a window the canonical chain has since invalidated. Either re-validate the snapshot's checkpoint set (by id) against the live registry just before submitting, or hold the top-tree-job reference until publish completes and route prune through it. -
EpochMonitor permanently records "processed" on a no-op
tryCompleteEpoch.prover-node.ts:145-153returnstruefromhandleEpochReadyToProvewhenevertryCompleteEpochdoesn't throw — even if it bailed atprover-node.ts:394(archiver says incomplete) orprover-node.ts:399-405(not all checkpoints known to the job).epoch-monitor.ts:93then setslatestEpochNumber = epochToProve, andepoch-monitor.ts:76-79ensures the same epoch is never re-considered. In a healthy network the next checkpoint event for a later epoch reseeds via the tertiary signal, but if the epoch in question is the last produced (network stall, end-of-test) it stays stuck forever.tryCompleteEpochshould return whethercompleteEpoch()was actually invoked, andhandleEpochReadyToProveshould propagate that.
| // First, update the local tips store so the stream can track our state. | ||
| await this.tipsStore.handleBlockStreamEvent(event); |
There was a problem hiding this comment.
Note that by updating the store first, any error in the handle methods below will not be retried in the next iteration of the L2BlockStream. Not sure if this is intentional.
There was a problem hiding this comment.
Yes it should go at the end I think.
| // (e.g. an early proof submission landed for an earlier epoch that ends mid-way | ||
| // through this epoch's slot range) does not prove the entire epoch. We need every | ||
| // checkpoint of a partially-proven epoch to feed the orchestrator. | ||
| if (await this.isEpochFullyProven(epochNumber, l1Constants)) { |
There was a problem hiding this comment.
We should check in the archiver and blockstream what happens first: do checkpoints get emitted, or does the proven block number get updated? I'm worried that, when starting a long sync after offline for a while, all the checkpoints are emitted first and then the proven tip is updated, so the prover node would trigger a ton of jobs only to cancel them immediately afterwards.
There was a problem hiding this comment.
Ah just saw the compute starting block number for the blockstream. And as long as the prover node waits for the archiver initial sync to be completed before starting, we should be good I guess.
| * to the job for finalization. If checkpoints are still being delivered by the | ||
| * L2BlockStream, the handoff happens later when the last one arrives. | ||
| */ | ||
| async handleEpochReadyToProve(epochNumber: EpochNumber): Promise<boolean> { |
There was a problem hiding this comment.
I think we can now replace the entire epoch monitor with a running promise that just calls l2BlockSource.isEpochComplete periodically.
| // Add to an existing job, or create a new one if allowed. | ||
| let job = existingJob; | ||
| if (!job) { | ||
| const { proverNodeMaxPendingJobs: maxPendingJobs } = this.config; | ||
| if (maxPendingJobs > 0 && this.jobs.size >= maxPendingJobs) { | ||
| this.log.debug( | ||
| `Skipping checkpoint ${checkpoint.number} for epoch ${epochNumber}: max pending jobs ${maxPendingJobs} reached`, | ||
| ); | ||
| return; | ||
| } | ||
| job = await this.createEpochJob(epochNumber); | ||
| } |
There was a problem hiding this comment.
Won't this create a hole in a given epoch? I don't see another code path where we patch an epoch job with checkpoint jobs we failed to add due to lack of capacity. I'm worried about a scenario where we don't add a checkpoint to an epoch, but then a prior epoch finishes, so we do add the next, and that epoch job ends up waiting for that missing checkpoint forever.
Speaking of which: is there anything that cleans up very old jobs when the epoch proof submission window closes?
There was a problem hiding this comment.
Yes, was going to ask about this. As it stands here there could be problems if this is set. Do you think this is needed or makes sense anymore? Proving is more of a continuous process rather than a set of large jobs.
|
|
||
| const taskId = `${checkpoint.number} - ${checkpoint.slot}`; | ||
| const task = this.gatherAndAddCheckpoint(job, checkpoint, epochNumber, abortSignal); | ||
| this.pendingGatherTasks.set(taskId, task); |
There was a problem hiding this comment.
Maybe it makes sense to move the responsibility of collecting their data to the checkpoint jobs, to keep the prover node leaner?
There was a problem hiding this comment.
I think it does. As a further refactor, I want to break the checkpoint proving out of EpochProvingJob. Then we maintain a collection of checkpoint jobs being proven. EpochProvingJob then becomes a collection of references to a subset of those checkpoints and a top tree. This would enable us to create instances of EpochProvingJob with arbitrary checkpoints (of a valid range of course) enabling efficient partial epoch proving.
There was a problem hiding this comment.
Looks comprehensive. I'd add tests for cross-chain messages as well, those are usually a source of problems.
| const wallClockEpoch = Number(test.epochCache.getEpochAndSlotNow().epoch); | ||
| const job = proverNode.epochJobs.get(wallClockEpoch); | ||
| if (job && job.getCheckpointCount() > 0) { | ||
| observed.add(wallClockEpoch); | ||
| } |
There was a problem hiding this comment.
Heads up this should pass even without optimistic proving: prover would start as soon as the last checkpoint of the epoch was pushed to L1, which could happen a few L1 slots before the actual end of the epoch. Maybe we should offset this a bit so the check is more meaningful, or check for specific checkpoint jobs vs all jobs in the epoch?
| // Wait for the 2nd checkpoint within this epoch. | ||
| const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber; | ||
| const midCheckpoint = CheckpointNumber(initialCheckpoint + 2); | ||
| await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6); | ||
| const checkpointBeforeReorg = test.monitor.checkpointNumber; | ||
| logger.info(`Reached checkpoint ${checkpointBeforeReorg}`); |
There was a problem hiding this comment.
I'd also add an assertion that the prover actually started the job for the given checkpoint before reorging it out. Otherwise, this test could pass if the prover just takes a bit more time to start assembling the checkpoint job.
| // Wait for 2 checkpoints mid-epoch. | ||
| const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber; | ||
| const midCheckpoint = CheckpointNumber(initialCheckpoint + 2); | ||
| await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6); | ||
| const checkpointBeforeReorg = test.monitor.checkpointNumber; | ||
| logger.info(`Reached checkpoint ${checkpointBeforeReorg}`); |
There was a problem hiding this comment.
This should also assert that both checkpoints landed on the same epoch, which could not be the case if this test started near the end of an epoch.
There was a problem hiding this comment.
We should also add txs. And it'd be interesting to see a case where one of the reorgs means that a tx that was mined in a checkpoint is now mined in a different one.
| const proverNode = test.proverNodes[0].getProverNode() as TestProverNode; | ||
| const proverManager = proverNode.getProver(); | ||
| const origCreateTopTree = proverManager.createTopTreeOrchestrator.bind(proverManager); | ||
| let releaseProvingGate: () => void = () => {}; | ||
| const provingGate = new Promise<void>(resolve => { | ||
| releaseProvingGate = resolve; | ||
| }); | ||
| proverManager.createTopTreeOrchestrator = () => { | ||
| const topTree = origCreateTopTree(); | ||
| const origProve = topTree.prove.bind(topTree); | ||
| topTree.prove = async (...args: Parameters<typeof origProve>) => { | ||
| logger.warn('Top-tree proving gated — waiting for test to release'); | ||
| await provingGate; | ||
| logger.warn('Proving gate released'); | ||
| return origProve(...args); | ||
| }; | ||
| return topTree; | ||
| }; |
There was a problem hiding this comment.
I'm wondering if there's a cleaner way of doing this, perhaps reusing the EpochProvingJobHooks? Not sure if worth stressing over it though.
| if (snapshot.length === 0) { | ||
| throw new Error(`Cannot finalize epoch ${this.epochNumber}: no surviving checkpoints`); | ||
| } |
There was a problem hiding this comment.
Is there a race condition where a reorg removes all checkpoints from an epoch and then new ones are immediately added again, but we hit this case before the new ones get added? Seems like something that would never happen in prod, but could affect e2e tests that have short epochs.
There was a problem hiding this comment.
There shouldn't be. The prover node does this in it's prune handler.
if (job.getCheckpointCount() === 0) {
this.log.info(`Cancelling epoch ${epochNum} job — all checkpoints pruned`);
await this.cancelAndCleanupJob(epochNum, job);
}
So that epoch job instance is removed and replaced with another when a new checkpoint arrives for the same epoch number.
There was a problem hiding this comment.
I will add a test though
| // Capture proving data *before* we clear the live job map, so a failure-upload | ||
| // attempt that runs after teardown still has the per-checkpoint txs / messages / | ||
| // previous-block-header to work with. | ||
| if (!this.provingDataSnapshot) { | ||
| try { | ||
| this.provingDataSnapshot = this.buildProvingDataSnapshot(); | ||
| } catch { | ||
| // No completed checkpoints — failure-upload will have nothing to upload, which | ||
| // is fine. Snapshot stays undefined and getProvingData will rebuild and rethrow. | ||
| } |
There was a problem hiding this comment.
How about we move the responsibility of uploading this data to the epoch-proving-job itself, so we don't need to collect this just for the prover-node to maybe read if it's configured for uploads?
|
|
||
| /** Stable identifier: `${checkpoint number}:${slot}`. Used as the parent's map key. */ | ||
| public static idFor(checkpoint: Checkpoint): string { | ||
| return `${checkpoint.number}:${checkpoint.header.slotNumber}`; |
There was a problem hiding this comment.
I don't think it makes a difference based on how reorgs are handled, but I'd throw in the checkpoint archive as part of the identifier, just in case
| const publicTxs = allTxs.filter(tx => tx?.data.forPublic); | ||
| if (publicTxs.length > 0) { | ||
| this.deps.log.verbose( | ||
| `Kicking off ${publicTxs.length} chonk-verifier circuits for checkpoint ${this.checkpoint.number}`, | ||
| { | ||
| checkpointNumber: this.checkpoint.number, | ||
| publicTxCount: publicTxs.length, | ||
| }, | ||
| ); | ||
| await this.subTree.startChonkVerifierCircuits(publicTxs); |
There was a problem hiding this comment.
I know this didn't change with this PR, but for private-only txs, who verifies the chonk proof?
There was a problem hiding this comment.
The private base rollup. Public txs have a different process. They require a dedicated chonk verifier circuit.
| public start(): Promise<TopTreeProof> { | ||
| void this.run(); |
There was a problem hiding this comment.
I take it it's responsibility of the caller to check that all dependencies are ready, right?
There was a problem hiding this comment.
Which dependencies are you referring to? At the point this class is constructed, it's arguments contain all information required to start the proving process (e.g. compute blob challenges). If the checkpoint root input proofs aren't ready yet that is fine, the orchestrator fires the appropriate events as and when they are.
There was a problem hiding this comment.
I think this is a great idea! I'd consider extending it to Base/Root parity circuits too because they could benefit from the same caching behaviour across reorged checkpoints
Drives the prover-node onto the split
CheckpointSubTreeOrchestrator+TopTreeOrchestratorpair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees.Stacked on #22996 (orchestrator split). Diff above is just the prover-node + e2e + cleanup delta.
What's new
prover-node—EpochProvingJobjob-model rewriteEpochProvingJobbecomes an orchestrator over aMap<string, CheckpointJob>keyed by${number}:${slot}. EachCheckpointJobowns a singleCheckpointSubTreeOrchestratorwith its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). ATopTreeJobdrives the epoch root rollup once all checkpoint sub-trees have started block-level proving.Public API:
registerCheckpoint— synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches theblockProofspromise to the eventual top-tree job.provideTxs— supplies simulated txs, transitions the checkpoint job from registered → block-proving.removeCheckpointsAfter(threshold)— sole removal API. Suffix-only: cancels every job whose checkpoint number sits strictly above the threshold, atomically. Suffix-only by design — middle-removal would leave a non-contiguous survivor set that cannot be proved.getCheckpointCount,getCheckpointNumbers.TopTreeJobconstructor defensively asserts that its snapshot's checkpoint numbers form a contiguous run (prev + 1 === curr). With suffix-only removal this is structurally guaranteed; the assertion is a runtime tripwire against future regressions.prover-node—L2BlockStream-driven checkpoint pipelineThe prover-node consumes
chain-checkpointed/chain-prunedevents from anL2BlockStreamrooted at the first block of the first unproven epoch. On eachchain-checkpointed:getEpochAtSlot.EpochProvingJob.tryCompleteEpochfor both older epochs (tertiary signal) and the current epoch (covers the case where the EpochMonitor already fired and we were waiting on this last checkpoint to arrive).On
chain-pruned: callremoveCheckpointsAfter(threshold)on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled viaAbortSignal.Finalization is driven by the union of three signals — EpochMonitor (epoch closes on L1), tertiary (a checkpoint for a strictly later epoch arrives), or final-checkpoint registration on an already-monitored epoch. Each signal calls
tryCompleteEpoch, which queriesl2BlockSource.isEpochCompletedirectly: the archiver is the source of truth, so the prover-node carries no in-memory cache of "epochs complete on L1".prover-node— reorg-after-finalization restartWhen the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip.
e2e
epochs_optimistic_proving.parallel.test.ts: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.epochs_proof_fails,epochs_upload_failed_proof,epochs_long_proving_time,epochs_multi_proofupdated to assert the new in-flight epoch behaviour.Logging
Structured-context logs added across the
CheckpointJob/TopTreeJob/EpochProvingJoblifecycles: register/provide-txs/sub-tree-created/per-block-progress/blockProofs-ready/cancel/teardown for checkpoints, and create/cancel/prove-start/prove-success for top-tree. Every log entry indexes onepochNumber/checkpointNumberper repo convention.What's removed
EpochProverinterface andServerEpochProverare removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers.ProvingOrchestratorsurvives only as a base class forCheckpointSubTreeOrchestratorand as the single-class driver used byprover-client's integration tests; it no longer implementsEpochProver.EpochProvingJob.removeCheckpoint(checkpointNumber)andcancelPendingCheckpoints()are also removed: middle-removal could leave a non-contiguous survivor set that theTopTreeJobconstructor would reject. Suffix-onlyremoveCheckpointsAfteris the only checkpoint-removal API now.Test plan
yarn workspace @aztec/prover-client test— 261 tests pass.yarn workspace @aztec/prover-node test— 116 tests pass (incl. new dedicatedtop-tree-job.test.tsandcheckpoint-job.test.ts).