refactor(prover-client): split orchestrator into sub-tree + top-tree pair#22996
Conversation
9af9856 to
3f459a5
Compare
3f459a5 to
da67473
Compare
…pair Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy `EpochProver` / `ProvingOrchestrator` / `EpochProvingState` path untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers with the new `TopTreeOrchestrator`. Split out from #22990 so it can land independently. ## What's new - **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`): extends `ProvingOrchestrator`, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves a `SubTreeResult` instead of escalating to the checkpoint root — the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty challenges)` to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running. - **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven cancel from a genuine proving failure. - **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof. - **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller list, and a unified `deferredProving<S, T>(state, request, callback, isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is just `verifyState()` + `reject(reason)`. - **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans; top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to bridge the two states' differing `resolve` signatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ. - **`EpochProverFactory` interface on `ProverClient`**: new factory methods `createEpochProvingContext(epochNumber)`, `createCheckpointSubTreeOrchestrator(...)`, and `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is owned by `ProverClient` and shared across every orchestrator. ## What changes in existing code - `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`, `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree can override them; `provingState` and `provingPromise` likewise become `protected` so the sub-tree can hook the parent's failure stream onto `subTreeResult`. No public API change on `ProvingOrchestrator`. - `CheckpointProvingState`: gains two read-only accessors used by the sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and `getLastArchiveSiblingPath()`. No state changes. - `ProverClient` keeps `createEpochProver()` exactly as before (each call spawns its own `BrokerCircuitProverFacade`); the new factory methods share a `getFacade()` set up in `start()` and torn down in `stop()`. `EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`, the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`, and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the prover-node and e2e tests continue to build against the existing `EpochProver` API. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR. ## Test plan - [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`). - [x] `yarn build` clean against current merge-train/spartan (modulo the pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
da67473 to
7342efc
Compare
spalladino
left a comment
There was a problem hiding this comment.
The split looks correct, but it took me a while to grok the whole inheritance chain. We now have the CheckpointSubTreeOrchestrator which extends from ProvingOrchestrator to hide some methods, while the ProvingOrchestrator now extends from the TopTreeProvingScheduler which in turn extends from the ProvingScheduler. I'm worried that maybe composition would have been a better fit than inheritance for reusing orchestrator logic, or at least easier to follow.
All that said, the PR looks good, and given there's another PR stacked on top that makes use of this, I'm fine merging. But let's add a README or something for explaining how things get structured, otherwise this is getting really difficult to follow.
| if (this.facade) { | ||
| try { | ||
| await this.facade.stop(); | ||
| } catch (err) { | ||
| this.log.error('Error stopping shared broker facade', err); | ||
| } | ||
| this.facade = undefined; | ||
| } |
There was a problem hiding this comment.
tryStop is your friend here
There was a problem hiding this comment.
Always focusing on the important stuff, I know, I know
| * Callers (`EpochProvingJob`, or unit tests) construct one context per epoch and pass | ||
| * it into every sub-tree they create. `stop()` aborts every in-flight chonk job. | ||
| */ | ||
| export class EpochProvingContext { |
There was a problem hiding this comment.
I'd rename to something like ChonkProofCache to avoid confusions with EpochProvingState
| export abstract class ProvingScheduler { | ||
| protected pendingProvingJobs: AbortController[] = []; | ||
| protected logger: Logger; | ||
| private deferredJobQueue: SerialQueue; |
There was a problem hiding this comment.
Given all jobs from all orchestrators are funneled to a single prover broker, shouldn't the job queue be shared across all orchestrators for the same epoch? Or handled directly at the prover broker facade?
There was a problem hiding this comment.
Yes, it should. Will consider the most appropriate solution.
| finalBlobBatchingChallenges, | ||
| startBlobAccumulator, | ||
| resolve, | ||
| reason => reject(this.cancelled ? new TopTreeCancelledError(reason) : new Error(reason)), |
There was a problem hiding this comment.
Top-tree masks genuine failures as cancellation. top-tree-orchestrator.ts:149 and top-tree-orchestrator.ts:196-201 both gate on the current value of this.cancelled. A real circuit failure can land first (rejecting completionPromise with the real error), and a cancel() arriving in the same tick sets this.cancelled = true before the outer catch runs — turning the genuine error into TopTreeCancelledError. Track what caused the rejection, not the current flag value (e.g. capture a cancelledReason only when cancel() itself triggered the rejection).
(reported by Claude)
| outHashHint: OutHashHint, | ||
| startBlobAccumulator: BatchedBlobAccumulator, | ||
| ) { | ||
| void this.buildCheckpointRootInputs(blockProofs, cd, outHashHint, startBlobAccumulator).then(inputs => { |
There was a problem hiding this comment.
Checkpoint-root input building can hang prove(). top-tree-orchestrator.ts:238 does void this.buildCheckpointRootInputs(...).then(inputs => ...) with no rejection handler. If buildBlobHints, toProofData, or the private-input constructors throw/reject, state.reject() is never invoked → completionPromise never settles → prove() hangs. Add .catch(err => { if (!this.cancelled) state.reject(\...${err}`); })`.
(reported by Claude)
| // Single-checkpoint mini-epoch by construction. The total/challenges supplied to | ||
| // `super.startNewEpoch` are never read, because the sub-tree overrides | ||
| // `checkAndEnqueueCheckpointRootRollup` to short-circuit before the parent's | ||
| // checkpoint-root / finalize machinery would consume them. | ||
| super.startNewEpoch(epochContext.epochNumber, 1, FinalBlobBatchingChallenges.empty()); |
| public override startNewEpoch( | ||
| _epochNumber: EpochNumber, | ||
| _totalNumCheckpoints: number, | ||
| _finalBlobBatchingChallenges: FinalBlobBatchingChallenges, | ||
| ): void { | ||
| throw new Error('CheckpointSubTreeOrchestrator starts its epoch in the constructor; do not call startNewEpoch.'); | ||
| } |
There was a problem hiding this comment.
This feels hacky as well - maybe the catch is that having this class inheriting from the main orchestrator is not the right abstraction?
There was a problem hiding this comment.
These are hacky. I'm going to incrementally work towards better abstractions with the end goal of getting rid of the monolithic ProvingOrchestrator.
|
The Linear project https://linear.app/aztec-labs/project/optimistic-proving-3c2a5b387385/overview contains issues for all follow up work. |
Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy
EpochProver/ProvingOrchestrator/EpochProvingStatepath untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors onProvingOrchestratorto share scheduling and top-tree drivers with the newTopTreeOrchestrator.What's new
CheckpointSubTreeOrchestrator(checkpoint-sub-tree-orchestrator.ts): extendsProvingOrchestrator, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves aSubTreeResultinstead of escalating to the checkpoint root — the parent'scheckAndEnqueueCheckpointRootRollupis overridden to short-circuit. The constructor callssuper.startNewEpoch(epoch, 1, empty challenges)to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running.TopTreeOrchestrator+TopTreeProvingState: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces asTopTreeCancelledErrorso callers can distinguish reorg-driven cancel from a genuine proving failure.EpochProvingContext(epoch-proving-context.ts): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof.ProvingScheduler(proving-scheduler.ts): abstract base owning theSerialQueuedeferred-job lifecycle, thependingProvingJobscontroller list, and a unifieddeferredProving<S, T>(state, request, callback, isCancelled?)submit envelope. The minimalProvingStateLikecontract is justverifyState()+reject(reason).TopTreeProvingScheduler(top-tree-proving-scheduler.ts): extendsProvingSchedulerand holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via awrapCircuitCallhook (orchestrator overrides for spans; top-tree leaves identity) and resolves via anonRootRollupCompletehook to bridge the two states' differingresolvesignatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ.EpochProverFactoryinterface onProverClient: new factory methodscreateEpochProvingContext(epochNumber),createCheckpointSubTreeOrchestrator(...), andcreateTopTreeOrchestrator(). A single sharedBrokerCircuitProverFacadeis owned byProverClientand shared across every orchestrator.What changes in existing code
ProvingOrchestratorextendsTopTreeProvingScheduler; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited.cancel()delegates the queue-recreate + abort-jobs logic toresetSchedulerState(this.cancelJobsOnStop). Three internal methods (getOrEnqueueChonkVerifier,checkAndEnqueueBaseRollup,checkAndEnqueueCheckpointRootRollup) becomeprotectedso the sub-tree can override them;provingStateandprovingPromiselikewise becomeprotectedso the sub-tree can hook the parent's failure stream ontosubTreeResult. No public API change onProvingOrchestrator.CheckpointProvingState: gains two read-only accessors used by the sub-tree's checkpoint-root override —getSubTreeOutputProofs()andgetLastArchiveSiblingPath(). No state changes.ProverClientkeepscreateEpochProver()exactly as before (each call spawns its ownBrokerCircuitProverFacade); the new factory methods share agetFacade()set up instart()and torn down instop().EpochProver,EpochProverManager,ServerEpochProver,EpochProvingState, the integration tests inorchestrator_*.test.ts,bb_prover_full_rollup.test.ts, andstdlib/interfaces/*are all unchanged frommerge-train/spartan— the prover-node and e2e tests continue to build against the existingEpochProverAPI. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR.Test plan
yarn workspace @aztec/prover-client test).yarn buildclean against current merge-train/spartan (modulo the pre-existing@aztec/sqlite3mc-wasmissue inherited from baseline).