Skip to content

feat(prover-node): checkpoint-driven optimistic proving#23002

Open
PhilWindle wants to merge 6 commits into
merge-train/spartanfrom
pw/optimistic-final
Open

feat(prover-node): checkpoint-driven optimistic proving#23002
PhilWindle wants to merge 6 commits into
merge-train/spartanfrom
pw/optimistic-final

Conversation

@PhilWindle
Copy link
Copy Markdown
Collaborator

@PhilWindle PhilWindle commented May 6, 2026

Drives the prover-node onto the split CheckpointSubTreeOrchestrator + TopTreeOrchestrator pair, with checkpoint-driven proving that pipelines sub-trees against tx-gathering and the top-tree against the in-flight sub-trees.

Stacked on #22996 (orchestrator split). Diff above is just the prover-node + e2e + cleanup delta.

What's new

prover-nodeEpochProvingJob job-model rewrite

EpochProvingJob becomes an orchestrator over a Map<string, CheckpointJob> keyed by ${number}:${slot}. Each CheckpointJob owns a single CheckpointSubTreeOrchestrator with its own per-checkpoint context (txs, attestations, previous-block header, l1ToL2 messages, archive sibling path). A TopTreeJob drives the epoch root rollup once all checkpoint sub-trees have started block-level proving.

Public API:

  • registerCheckpoint — synchronous; sets up sub-tree, kicks off chonk-verifier cache fill, attaches the blockProofs promise to the eventual top-tree job.
  • provideTxs — supplies simulated txs, transitions the checkpoint job from registered → block-proving.
  • removeCheckpointsAfter(threshold) — sole removal API. Suffix-only: cancels every job whose checkpoint number sits strictly above the threshold, atomically. Suffix-only by design — middle-removal would leave a non-contiguous survivor set that cannot be proved.
  • getCheckpointCount, getCheckpointNumbers.

TopTreeJob constructor defensively asserts that its snapshot's checkpoint numbers form a contiguous run (prev + 1 === curr). With suffix-only removal this is structurally guaranteed; the assertion is a runtime tripwire against future regressions.

prover-nodeL2BlockStream-driven checkpoint pipeline

The prover-node consumes chain-checkpointed / chain-pruned events from an L2BlockStream rooted at the first block of the first unproven epoch. On each chain-checkpointed:

  1. Resolve the epoch via getEpochAtSlot.
  2. Get-or-create the per-epoch EpochProvingJob.
  3. Register the checkpoint synchronously and detach a tx-gathering task.
  4. Call tryCompleteEpoch for both older epochs (tertiary signal) and the current epoch (covers the case where the EpochMonitor already fired and we were waiting on this last checkpoint to arrive).

On chain-pruned: call removeCheckpointsAfter(threshold) on every job whose first checkpoint sits at or above the threshold. Pending gather tasks are cancelled via AbortSignal.

Finalization is driven by the union of three signals — EpochMonitor (epoch closes on L1), tertiary (a checkpoint for a strictly later epoch arrives), or final-checkpoint registration on an already-monitored epoch. Each signal calls tryCompleteEpoch, which queries l2BlockSource.isEpochComplete directly: the archiver is the source of truth, so the prover-node carries no in-memory cache of "epochs complete on L1".

prover-node — reorg-after-finalization restart

When the L2BlockStream emits a prune that retroactively invalidates an epoch already in finalize, the prover-node aborts the in-flight publish, clears the job, and restarts proving from the new tip.

e2e

  • New epochs_optimistic_proving.parallel.test.ts: full e2e covering the pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
  • The two happy-path tests use a wall-clock-vs-job-epoch sampler to assert that the prover-node has registered checkpoints with the in-flight job for an epoch while still inside that epoch — proves checkpoints aren't deferred until epoch end. The multi-epoch test runs 4 epochs and asserts this for each.
  • epochs_proof_fails, epochs_upload_failed_proof, epochs_long_proving_time, epochs_multi_proof updated to assert the new in-flight epoch behaviour.

Logging

Structured-context logs added across the CheckpointJob / TopTreeJob / EpochProvingJob lifecycles: register/provide-txs/sub-tree-created/per-block-progress/blockProofs-ready/cancel/teardown for checkpoints, and create/cancel/prove-start/prove-success for top-tree. Every log entry indexes on epochNumber / checkpointNumber per repo convention.

What's removed

EpochProver interface and ServerEpochProver are removed: the prover-node no longer drives a single-class epoch prover, so the legacy API has no production callers. ProvingOrchestrator survives only as a base class for CheckpointSubTreeOrchestrator and as the single-class driver used by prover-client's integration tests; it no longer implements EpochProver.

EpochProvingJob.removeCheckpoint(checkpointNumber) and cancelPendingCheckpoints() are also removed: middle-removal could leave a non-contiguous survivor set that the TopTreeJob constructor would reject. Suffix-only removeCheckpointsAfter is the only checkpoint-removal API now.

Test plan

  • yarn workspace @aztec/prover-client test — 261 tests pass.
  • yarn workspace @aztec/prover-node test — 116 tests pass (incl. new dedicated top-tree-job.test.ts and checkpoint-job.test.ts).
  • e2e tests covering optimistic proving, reorgs during proving, failed proof publish, and multi-checkpoint flows are included in this PR.

@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from 087e60b to f181313 Compare May 6, 2026 17:47
@PhilWindle PhilWindle changed the base branch from merge-train/spartan to pw/optimistic-orchestrator May 6, 2026 17:47
@PhilWindle PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 6, 2026
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from f181313 to 3694a9b Compare May 6, 2026 17:51
…pair

Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing
single-class proving orchestrator along the natural state-coupling boundary —
per-checkpoint block-level work vs. epoch-level top-tree work — while leaving
every existing API on the legacy `EpochProver` / `ProvingOrchestrator` /
`EpochProvingState` path untouched. The prover-node and e2e tests build
unchanged; this PR is purely additive in surface area, with structural
refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers
with the new `TopTreeOrchestrator`.

Split out from #22990 so it can land independently.

## What's new

- **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`):
  extends `ProvingOrchestrator`, single-checkpoint by construction. Drives
  chonk-verifier / base / merge / block-root / block-merge for one checkpoint
  and resolves a `SubTreeResult` instead of escalating to the checkpoint root —
  the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to
  short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty
  challenges)` to set up a single-checkpoint mini-epoch; the count and
  challenges are never read because the override prevents the parent's
  finalize / root path from running.

- **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver
  from checkpoint-root through epoch-root rollup. Takes per-checkpoint
  block-proof promises and pipelines its hint chain against them. Cancellation
  surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven
  cancel from a genuine proving failure.

- **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared
  cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that
  gets reorged out and re-appears in a replacement checkpoint reuses the
  cached proof.

- **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the
  `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller
  list, and a unified `deferredProving<S, T>(state, request, callback,
  isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is
  just `verifyState()` + `reject(reason)`.

- **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends
  `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup
  drivers (plus tree-walking helpers) shared by both orchestrators. Wraps
  circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans;
  top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to
  bridge the two states' differing `resolve` signatures. The per-checkpoint
  root driver stays subclass-specific because input-building flows differ.

- **`EpochProverFactory` interface on `ProverClient`**: new factory methods
  `createEpochProvingContext(epochNumber)`,
  `createCheckpointSubTreeOrchestrator(...)`, and
  `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is
  owned by `ProverClient` and shared across every orchestrator.

## What changes in existing code

- `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline
  broker-job submit envelope, queue lifecycle, and the top-tree-section
  drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs
  logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal
  methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`,
  `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree
  can override them; `provingState` and `provingPromise` likewise become
  `protected` so the sub-tree can hook the parent's failure stream onto
  `subTreeResult`. No public API change on `ProvingOrchestrator`.
- `CheckpointProvingState`: gains two read-only accessors used by the
  sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and
  `getLastArchiveSiblingPath()`. No state changes.
- `ProverClient` keeps `createEpochProver()` exactly as before (each call
  spawns its own `BrokerCircuitProverFacade`); the new factory methods share
  a `getFacade()` set up in `start()` and torn down in `stop()`.

`EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`,
the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`,
and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the
prover-node and e2e tests continue to build against the existing `EpochProver`
API. Migrating the prover-node onto the new factories (and the deferred-finalize
flow that goes with optimistic proving) is the follow-up PR.

## Test plan

- [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`).
- [x] `yarn build` clean against current merge-train/spartan (modulo the
  pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
@PhilWindle PhilWindle force-pushed the pw/optimistic-orchestrator branch from da67473 to 7342efc Compare May 6, 2026 18:09
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch 17 times, most recently from fd62ad7 to 093fdde Compare May 8, 2026 08:50
Drives the prover-node onto the split `CheckpointSubTreeOrchestrator` +
`TopTreeOrchestrator` pair, with checkpoint-driven proving that pipelines
sub-trees against tx-gathering and the top-tree against the in-flight
sub-trees.

## What's new

### `prover-node` — `EpochProvingJob` job-model rewrite

`EpochProvingJob` becomes an orchestrator over a `Map<string, CheckpointJob>`
keyed by `${number}:${slot}`. Each `CheckpointJob` owns a single
`CheckpointSubTreeOrchestrator` with its own per-checkpoint context (txs,
attestations, previous-block header, l1ToL2 messages, archive sibling path).
A `TopTreeJob` drives the epoch root rollup once all checkpoint sub-trees
have started block-level proving.

Public API:
- `registerCheckpoint` — synchronous; sets up sub-tree, kicks off chonk-verifier
  cache fill, attaches the `blockProofs` promise to the eventual top-tree job.
- `provideTxs` — supplies simulated txs, transitions the checkpoint job from
  registered → block-proving.
- `removeCheckpoint(synchronous, idempotent)` — drops a single checkpoint by
  `(number, slot)`, fire-and-forget cancels its sub-tree. Tolerates re-add of
  the same checkpoint number under a different slot.
- `removeCheckpointsAfter`, `getCheckpointCount`, `getCheckpointNumbers`,
  `cancelPendingCheckpoints`.

### `prover-node` — `L2BlockStream`-driven checkpoint pipeline

The prover-node consumes `chain-checkpointed` / `chain-pruned` events from an
`L2BlockStream` rooted at the first block of the first unproven epoch. On
each `chain-checkpointed`:
1. Resolve the epoch via `getEpochAtSlot`.
2. Get-or-create the per-epoch `EpochProvingJob`.
3. Detached-task gather txs + register the checkpoint with the job.

On `chain-pruned`: call `removeCheckpointsAfter(threshold)` on every job
whose first checkpoint sits at or above the threshold. Pending gather tasks
are cancelled via `AbortSignal`.

Finalization is driven by the union of three signals: epoch-monitor sees
the epoch close on L1, a checkpoint for a strictly later epoch arrives, or
all expected checkpoints (per archiver) are registered while the epoch is
already complete on L1.

### `prover-node` — reorg-after-finalization restart

When the L2BlockStream emits a prune that retroactively invalidates an
epoch already in finalize, the prover-node aborts the in-flight publish,
clears the job, and restarts proving from the new tip.

### e2e

- New `epochs_optimistic_proving.parallel.test.ts`: full e2e covering the
  pipelining, replacement-checkpoint reuse, and reorg-during-proving paths.
- `epochs_proof_fails`, `epochs_upload_failed_proof`, `epochs_long_proving_time`,
  `epochs_multi_proof` updated to assert the new in-flight epoch behaviour.

## What's removed

`EpochProver` interface and `ServerEpochProver` are removed: the prover-node
no longer drives a single-class epoch prover, so the legacy API has no
production callers. `ProvingOrchestrator` survives only as a base class for
`CheckpointSubTreeOrchestrator` and as the single-class driver used by
`prover-client`'s integration tests; it no longer implements `EpochProver`.

## Test plan

- `yarn workspace @aztec/prover-client test` — 261 tests pass.
- `yarn workspace @aztec/prover-node test` — 89 tests pass.
- e2e tests covering optimistic proving, reorgs during proving, failed
  proof publish, and multi-checkpoint flows are included in this PR.
@PhilWindle PhilWindle force-pushed the pw/optimistic-final branch from 093fdde to 81dc825 Compare May 8, 2026 09:23
Copy link
Copy Markdown
Contributor

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got halfway through it, but looks good so far. Pasting here some observations by Claude that I didn't get to review myself:

  • Same-epoch replacement checkpoints are silently dropped after completeEpoch(). prover-node.ts:213-220 skips any checkpoint for an epoch whose job has isEpochComplete(). Sequence: (1) all checkpoints registered, completeEpoch() fires; (2) prune removes the suffix; (3) the L1 reorg replacement for the same epoch arrives. The replacement is dropped because existingJob.isEpochComplete() is true. The finalize loop then restarts against an incomplete prefix of the epoch and eventually publishes a proof for a checkpoint range that does not match the canonical L1 chain. This contradicts the stated "orphan + replacement coexist briefly" model. The check should be conditioned on the surviving checkpoint set, not on the boolean flag.

  • A prune that lands after topTreeJob.start() but during publishProof is not interceptable. epoch-proving-job.ts:478 clears this.topTreeJob immediately after the prove resolves; the publish at epoch-proving-job.ts:487 operates on the captured snapshot array. Meanwhile removeCheckpointsAfter (epoch-proving-job.ts:332-369) only acts on the live topTreeJob — which is now undefined — so the publish proceeds with stale data and the (out-of-date) attestations from snapshot.at(-1). Best case L1 rejects the proof (gas burn, retry confusion); worst case it lands against a window the canonical chain has since invalidated. Either re-validate the snapshot's checkpoint set (by id) against the live registry just before submitting, or hold the top-tree-job reference until publish completes and route prune through it.

  • EpochMonitor permanently records "processed" on a no-op tryCompleteEpoch. prover-node.ts:145-153 returns true from handleEpochReadyToProve whenever tryCompleteEpoch doesn't throw — even if it bailed at prover-node.ts:394 (archiver says incomplete) or prover-node.ts:399-405 (not all checkpoints known to the job). epoch-monitor.ts:93 then sets latestEpochNumber = epochToProve, and epoch-monitor.ts:76-79 ensures the same epoch is never re-considered. In a healthy network the next checkpoint event for a later epoch reseeds via the tertiary signal, but if the epoch in question is the last produced (network stall, end-of-test) it stays stuck forever. tryCompleteEpoch should return whether completeEpoch() was actually invoked, and handleEpochReadyToProve should propagate that.

Comment on lines +160 to +161
// First, update the local tips store so the stream can track our state.
await this.tipsStore.handleBlockStreamEvent(event);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that by updating the store first, any error in the handle methods below will not be retried in the next iteration of the L2BlockStream. Not sure if this is intentional.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it should go at the end I think.

// (e.g. an early proof submission landed for an earlier epoch that ends mid-way
// through this epoch's slot range) does not prove the entire epoch. We need every
// checkpoint of a partially-proven epoch to feed the orchestrator.
if (await this.isEpochFullyProven(epochNumber, l1Constants)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check in the archiver and blockstream what happens first: do checkpoints get emitted, or does the proven block number get updated? I'm worried that, when starting a long sync after offline for a while, all the checkpoints are emitted first and then the proven tip is updated, so the prover node would trigger a ton of jobs only to cancel them immediately afterwards.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah just saw the compute starting block number for the blockstream. And as long as the prover node waits for the archiver initial sync to be completed before starting, we should be good I guess.

* to the job for finalization. If checkpoints are still being delivered by the
* L2BlockStream, the handoff happens later when the last one arrives.
*/
async handleEpochReadyToProve(epochNumber: EpochNumber): Promise<boolean> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can now replace the entire epoch monitor with a running promise that just calls l2BlockSource.isEpochComplete periodically.

Comment on lines +222 to +233
// Add to an existing job, or create a new one if allowed.
let job = existingJob;
if (!job) {
const { proverNodeMaxPendingJobs: maxPendingJobs } = this.config;
if (maxPendingJobs > 0 && this.jobs.size >= maxPendingJobs) {
this.log.debug(
`Skipping checkpoint ${checkpoint.number} for epoch ${epochNumber}: max pending jobs ${maxPendingJobs} reached`,
);
return;
}
job = await this.createEpochJob(epochNumber);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this create a hole in a given epoch? I don't see another code path where we patch an epoch job with checkpoint jobs we failed to add due to lack of capacity. I'm worried about a scenario where we don't add a checkpoint to an epoch, but then a prior epoch finishes, so we do add the next, and that epoch job ends up waiting for that missing checkpoint forever.

Speaking of which: is there anything that cleans up very old jobs when the epoch proof submission window closes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was going to ask about this. As it stands here there could be problems if this is set. Do you think this is needed or makes sense anymore? Proving is more of a continuous process rather than a set of large jobs.


const taskId = `${checkpoint.number} - ${checkpoint.slot}`;
const task = this.gatherAndAddCheckpoint(job, checkpoint, epochNumber, abortSignal);
this.pendingGatherTasks.set(taskId, task);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to move the responsibility of collecting their data to the checkpoint jobs, to keep the prover node leaner?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does. As a further refactor, I want to break the checkpoint proving out of EpochProvingJob. Then we maintain a collection of checkpoint jobs being proven. EpochProvingJob then becomes a collection of references to a subset of those checkpoints and a top tree. This would enable us to create instances of EpochProvingJob with arbitrary checkpoints (of a valid range of course) enabling efficient partial epoch proving.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks comprehensive. I'd add tests for cross-chain messages as well, those are usually a source of problems.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Comment on lines +73 to +77
const wallClockEpoch = Number(test.epochCache.getEpochAndSlotNow().epoch);
const job = proverNode.epochJobs.get(wallClockEpoch);
if (job && job.getCheckpointCount() > 0) {
observed.add(wallClockEpoch);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads up this should pass even without optimistic proving: prover would start as soon as the last checkpoint of the epoch was pushed to L1, which could happen a few L1 slots before the actual end of the epoch. Maybe we should offset this a bit so the check is more meaningful, or check for specific checkpoint jobs vs all jobs in the epoch?

Comment on lines +172 to +177
// Wait for the 2nd checkpoint within this epoch.
const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
const checkpointBeforeReorg = test.monitor.checkpointNumber;
logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add an assertion that the prover actually started the job for the given checkpoint before reorging it out. Otherwise, this test could pass if the prover just takes a bit more time to start assembling the checkpoint job.

Comment on lines +239 to +244
// Wait for 2 checkpoints mid-epoch.
const initialCheckpoint = (await test.monitor.run(true)).checkpointNumber;
const midCheckpoint = CheckpointNumber(initialCheckpoint + 2);
await test.waitUntilCheckpointNumber(midCheckpoint, L2_SLOT_DURATION_IN_S * 6);
const checkpointBeforeReorg = test.monitor.checkpointNumber;
logger.info(`Reached checkpoint ${checkpointBeforeReorg}`);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also assert that both checkpoints landed on the same epoch, which could not be the case if this test started near the end of an epoch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add txs. And it'd be interesting to see a case where one of the reorgs means that a tx that was mined in a checkpoint is now mined in a different one.

Comment on lines +391 to +408
const proverNode = test.proverNodes[0].getProverNode() as TestProverNode;
const proverManager = proverNode.getProver();
const origCreateTopTree = proverManager.createTopTreeOrchestrator.bind(proverManager);
let releaseProvingGate: () => void = () => {};
const provingGate = new Promise<void>(resolve => {
releaseProvingGate = resolve;
});
proverManager.createTopTreeOrchestrator = () => {
const topTree = origCreateTopTree();
const origProve = topTree.prove.bind(topTree);
topTree.prove = async (...args: Parameters<typeof origProve>) => {
logger.warn('Top-tree proving gated — waiting for test to release');
await provingGate;
logger.warn('Proving gate released');
return origProve(...args);
};
return topTree;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if there's a cleaner way of doing this, perhaps reusing the EpochProvingJobHooks? Not sure if worth stressing over it though.

Comment on lines +435 to +437
if (snapshot.length === 0) {
throw new Error(`Cannot finalize epoch ${this.epochNumber}: no surviving checkpoints`);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a race condition where a reorg removes all checkpoints from an epoch and then new ones are immediately added again, but we hit this case before the new ones get added? Seems like something that would never happen in prod, but could affect e2e tests that have short epochs.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be. The prover node does this in it's prune handler.

      if (job.getCheckpointCount() === 0) {
        this.log.info(`Cancelling epoch ${epochNum} job — all checkpoints pruned`);
        await this.cancelAndCleanupJob(epochNum, job);
      }

So that epoch job instance is removed and replaced with another when a new checkpoint arrives for the same epoch number.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a test though

Comment on lines +579 to +588
// Capture proving data *before* we clear the live job map, so a failure-upload
// attempt that runs after teardown still has the per-checkpoint txs / messages /
// previous-block-header to work with.
if (!this.provingDataSnapshot) {
try {
this.provingDataSnapshot = this.buildProvingDataSnapshot();
} catch {
// No completed checkpoints — failure-upload will have nothing to upload, which
// is fine. Snapshot stays undefined and getProvingData will rebuild and rethrow.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we move the responsibility of uploading this data to the epoch-proving-job itself, so we don't need to collect this just for the prover-node to maybe read if it's configured for uploads?


/** Stable identifier: `${checkpoint number}:${slot}`. Used as the parent's map key. */
public static idFor(checkpoint: Checkpoint): string {
return `${checkpoint.number}:${checkpoint.header.slotNumber}`;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes a difference based on how reorgs are handled, but I'd throw in the checkpoint archive as part of the identifier, just in case

Comment on lines +237 to +246
const publicTxs = allTxs.filter(tx => tx?.data.forPublic);
if (publicTxs.length > 0) {
this.deps.log.verbose(
`Kicking off ${publicTxs.length} chonk-verifier circuits for checkpoint ${this.checkpoint.number}`,
{
checkpointNumber: this.checkpoint.number,
publicTxCount: publicTxs.length,
},
);
await this.subTree.startChonkVerifierCircuits(publicTxs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this didn't change with this PR, but for private-only txs, who verifies the chonk proof?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The private base rollup. Public txs have a different process. They require a dedicated chonk verifier circuit.

Comment on lines +123 to +124
public start(): Promise<TopTreeProof> {
void this.run();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it it's responsibility of the caller to check that all dependencies are ready, right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which dependencies are you referring to? At the point this class is constructed, it's arguments contain all information required to start the proving process (e.g. compute blob challenges). If the checkpoint root input proofs aren't ready yet that is fine, the orchestrator fires the appropriate events as and when they are.

Base automatically changed from pw/optimistic-orchestrator to merge-train/spartan May 13, 2026 10:03
Copy link
Copy Markdown
Contributor

@alexghr alexghr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a great idea! I'd consider extending it to Base/Root parity circuits too because they could benefit from the same caching behaviour across reorged checkpoints

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants