Skip to content

refactor(prover-client): split orchestrator into sub-tree + top-tree pair#22996

Merged
PhilWindle merged 1 commit into
merge-train/spartanfrom
pw/optimistic-orchestrator
May 13, 2026
Merged

refactor(prover-client): split orchestrator into sub-tree + top-tree pair#22996
PhilWindle merged 1 commit into
merge-train/spartanfrom
pw/optimistic-orchestrator

Conversation

@PhilWindle
Copy link
Copy Markdown
Collaborator

@PhilWindle PhilWindle commented May 6, 2026

Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing single-class proving orchestrator along the natural state-coupling boundary — per-checkpoint block-level work vs. epoch-level top-tree work — while leaving every existing API on the legacy EpochProver / ProvingOrchestrator / EpochProvingState path untouched. The prover-node and e2e tests build unchanged; this PR is purely additive in surface area, with structural refactors on ProvingOrchestrator to share scheduling and top-tree drivers with the new TopTreeOrchestrator.

What's new

  • CheckpointSubTreeOrchestrator (checkpoint-sub-tree-orchestrator.ts): extends ProvingOrchestrator, single-checkpoint by construction. Drives chonk-verifier / base / merge / block-root / block-merge for one checkpoint and resolves a SubTreeResult instead of escalating to the checkpoint root — the parent's checkAndEnqueueCheckpointRootRollup is overridden to short-circuit. The constructor calls super.startNewEpoch(epoch, 1, empty challenges) to set up a single-checkpoint mini-epoch; the count and challenges are never read because the override prevents the parent's finalize / root path from running.

  • TopTreeOrchestrator + TopTreeProvingState: self-contained driver from checkpoint-root through epoch-root rollup. Takes per-checkpoint block-proof promises and pipelines its hint chain against them. Cancellation surfaces as TopTreeCancelledError so callers can distinguish reorg-driven cancel from a genuine proving failure.

  • EpochProvingContext (epoch-proving-context.ts): per-epoch shared cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that gets reorged out and re-appears in a replacement checkpoint reuses the cached proof.

  • ProvingScheduler (proving-scheduler.ts): abstract base owning the SerialQueue deferred-job lifecycle, the pendingProvingJobs controller list, and a unified deferredProving<S, T>(state, request, callback, isCancelled?) submit envelope. The minimal ProvingStateLike contract is just verifyState() + reject(reason).

  • TopTreeProvingScheduler (top-tree-proving-scheduler.ts): extends ProvingScheduler and holds the checkpoint-merge, padding, and root-rollup drivers (plus tree-walking helpers) shared by both orchestrators. Wraps circuit calls via a wrapCircuitCall hook (orchestrator overrides for spans; top-tree leaves identity) and resolves via an onRootRollupComplete hook to bridge the two states' differing resolve signatures. The per-checkpoint root driver stays subclass-specific because input-building flows differ.

  • EpochProverFactory interface on ProverClient: new factory methods createEpochProvingContext(epochNumber), createCheckpointSubTreeOrchestrator(...), and createTopTreeOrchestrator(). A single shared BrokerCircuitProverFacade is owned by ProverClient and shared across every orchestrator.

What changes in existing code

  • ProvingOrchestrator extends TopTreeProvingScheduler; the inline broker-job submit envelope, queue lifecycle, and the top-tree-section drivers are inherited. cancel() delegates the queue-recreate + abort-jobs logic to resetSchedulerState(this.cancelJobsOnStop). Three internal methods (getOrEnqueueChonkVerifier, checkAndEnqueueBaseRollup, checkAndEnqueueCheckpointRootRollup) become protected so the sub-tree can override them; provingState and provingPromise likewise become protected so the sub-tree can hook the parent's failure stream onto subTreeResult. No public API change on ProvingOrchestrator.
  • CheckpointProvingState: gains two read-only accessors used by the sub-tree's checkpoint-root override — getSubTreeOutputProofs() and getLastArchiveSiblingPath(). No state changes.
  • ProverClient keeps createEpochProver() exactly as before (each call spawns its own BrokerCircuitProverFacade); the new factory methods share a getFacade() set up in start() and torn down in stop().

EpochProver, EpochProverManager, ServerEpochProver, EpochProvingState, the integration tests in orchestrator_*.test.ts, bb_prover_full_rollup.test.ts, and stdlib/interfaces/* are all unchanged from merge-train/spartan — the prover-node and e2e tests continue to build against the existing EpochProver API. Migrating the prover-node onto the new factories (and the deferred-finalize flow that goes with optimistic proving) is the follow-up PR.

Test plan

  • 261 prover-client tests pass (full yarn workspace @aztec/prover-client test).
  • yarn build clean against current merge-train/spartan (modulo the pre-existing @aztec/sqlite3mc-wasm issue inherited from baseline).

@PhilWindle PhilWindle force-pushed the pw/optimistic-orchestrator branch 2 times, most recently from 9af9856 to 3f459a5 Compare May 6, 2026 16:47
@PhilWindle PhilWindle added ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure labels May 6, 2026
@PhilWindle PhilWindle force-pushed the pw/optimistic-orchestrator branch from 3f459a5 to da67473 Compare May 6, 2026 17:31
…pair

Introduces a sub-tree + top-tree orchestrator pair that decomposes the existing
single-class proving orchestrator along the natural state-coupling boundary —
per-checkpoint block-level work vs. epoch-level top-tree work — while leaving
every existing API on the legacy `EpochProver` / `ProvingOrchestrator` /
`EpochProvingState` path untouched. The prover-node and e2e tests build
unchanged; this PR is purely additive in surface area, with structural
refactors on `ProvingOrchestrator` to share scheduling and top-tree drivers
with the new `TopTreeOrchestrator`.

Split out from #22990 so it can land independently.

## What's new

- **`CheckpointSubTreeOrchestrator`** (`checkpoint-sub-tree-orchestrator.ts`):
  extends `ProvingOrchestrator`, single-checkpoint by construction. Drives
  chonk-verifier / base / merge / block-root / block-merge for one checkpoint
  and resolves a `SubTreeResult` instead of escalating to the checkpoint root —
  the parent's `checkAndEnqueueCheckpointRootRollup` is overridden to
  short-circuit. The constructor calls `super.startNewEpoch(epoch, 1, empty
  challenges)` to set up a single-checkpoint mini-epoch; the count and
  challenges are never read because the override prevents the parent's
  finalize / root path from running.

- **`TopTreeOrchestrator`** + **`TopTreeProvingState`**: self-contained driver
  from checkpoint-root through epoch-root rollup. Takes per-checkpoint
  block-proof promises and pipelines its hint chain against them. Cancellation
  surfaces as `TopTreeCancelledError` so callers can distinguish reorg-driven
  cancel from a genuine proving failure.

- **`EpochProvingContext`** (`epoch-proving-context.ts`): per-epoch shared
  cache for chonk-verifier proofs. Survives sub-tree cancellation so a tx that
  gets reorged out and re-appears in a replacement checkpoint reuses the
  cached proof.

- **`ProvingScheduler`** (`proving-scheduler.ts`): abstract base owning the
  `SerialQueue` deferred-job lifecycle, the `pendingProvingJobs` controller
  list, and a unified `deferredProving<S, T>(state, request, callback,
  isCancelled?)` submit envelope. The minimal `ProvingStateLike` contract is
  just `verifyState()` + `reject(reason)`.

- **`TopTreeProvingScheduler`** (`top-tree-proving-scheduler.ts`): extends
  `ProvingScheduler` and holds the checkpoint-merge, padding, and root-rollup
  drivers (plus tree-walking helpers) shared by both orchestrators. Wraps
  circuit calls via a `wrapCircuitCall` hook (orchestrator overrides for spans;
  top-tree leaves identity) and resolves via an `onRootRollupComplete` hook to
  bridge the two states' differing `resolve` signatures. The per-checkpoint
  root driver stays subclass-specific because input-building flows differ.

- **`EpochProverFactory` interface on `ProverClient`**: new factory methods
  `createEpochProvingContext(epochNumber)`,
  `createCheckpointSubTreeOrchestrator(...)`, and
  `createTopTreeOrchestrator()`. A single shared `BrokerCircuitProverFacade` is
  owned by `ProverClient` and shared across every orchestrator.

## What changes in existing code

- `ProvingOrchestrator` extends `TopTreeProvingScheduler`; the inline
  broker-job submit envelope, queue lifecycle, and the top-tree-section
  drivers are inherited. `cancel()` delegates the queue-recreate + abort-jobs
  logic to `resetSchedulerState(this.cancelJobsOnStop)`. Three internal
  methods (`getOrEnqueueChonkVerifier`, `checkAndEnqueueBaseRollup`,
  `checkAndEnqueueCheckpointRootRollup`) become `protected` so the sub-tree
  can override them; `provingState` and `provingPromise` likewise become
  `protected` so the sub-tree can hook the parent's failure stream onto
  `subTreeResult`. No public API change on `ProvingOrchestrator`.
- `CheckpointProvingState`: gains two read-only accessors used by the
  sub-tree's checkpoint-root override — `getSubTreeOutputProofs()` and
  `getLastArchiveSiblingPath()`. No state changes.
- `ProverClient` keeps `createEpochProver()` exactly as before (each call
  spawns its own `BrokerCircuitProverFacade`); the new factory methods share
  a `getFacade()` set up in `start()` and torn down in `stop()`.

`EpochProver`, `EpochProverManager`, `ServerEpochProver`, `EpochProvingState`,
the integration tests in `orchestrator_*.test.ts`, `bb_prover_full_rollup.test.ts`,
and `stdlib/interfaces/*` are all unchanged from `merge-train/spartan` — the
prover-node and e2e tests continue to build against the existing `EpochProver`
API. Migrating the prover-node onto the new factories (and the deferred-finalize
flow that goes with optimistic proving) is the follow-up PR.

## Test plan

- [x] 261 prover-client tests pass (full `yarn workspace @aztec/prover-client test`).
- [x] `yarn build` clean against current merge-train/spartan (modulo the
  pre-existing `@aztec/sqlite3mc-wasm` issue inherited from baseline).
@PhilWindle PhilWindle force-pushed the pw/optimistic-orchestrator branch from da67473 to 7342efc Compare May 6, 2026 18:09
Copy link
Copy Markdown
Contributor

@spalladino spalladino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split looks correct, but it took me a while to grok the whole inheritance chain. We now have the CheckpointSubTreeOrchestrator which extends from ProvingOrchestrator to hide some methods, while the ProvingOrchestrator now extends from the TopTreeProvingScheduler which in turn extends from the ProvingScheduler. I'm worried that maybe composition would have been a better fit than inheritance for reusing orchestrator logic, or at least easier to follow.

All that said, the PR looks good, and given there's another PR stacked on top that makes use of this, I'm fine merging. But let's add a README or something for explaining how things get structured, otherwise this is getting really difficult to follow.

Comment on lines +219 to +226
if (this.facade) {
try {
await this.facade.stop();
} catch (err) {
this.log.error('Error stopping shared broker facade', err);
}
this.facade = undefined;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tryStop is your friend here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always focusing on the important stuff, I know, I know

* Callers (`EpochProvingJob`, or unit tests) construct one context per epoch and pass
* it into every sub-tree they create. `stop()` aborts every in-flight chonk job.
*/
export class EpochProvingContext {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename to something like ChonkProofCache to avoid confusions with EpochProvingState

export abstract class ProvingScheduler {
protected pendingProvingJobs: AbortController[] = [];
protected logger: Logger;
private deferredJobQueue: SerialQueue;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given all jobs from all orchestrators are funneled to a single prover broker, shouldn't the job queue be shared across all orchestrators for the same epoch? Or handled directly at the prover broker facade?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Will consider the most appropriate solution.

finalBlobBatchingChallenges,
startBlobAccumulator,
resolve,
reason => reject(this.cancelled ? new TopTreeCancelledError(reason) : new Error(reason)),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-tree masks genuine failures as cancellation. top-tree-orchestrator.ts:149 and top-tree-orchestrator.ts:196-201 both gate on the current value of this.cancelled. A real circuit failure can land first (rejecting completionPromise with the real error), and a cancel() arriving in the same tick sets this.cancelled = true before the outer catch runs — turning the genuine error into TopTreeCancelledError. Track what caused the rejection, not the current flag value (e.g. capture a cancelledReason only when cancel() itself triggered the rejection).

(reported by Claude)

outHashHint: OutHashHint,
startBlobAccumulator: BatchedBlobAccumulator,
) {
void this.buildCheckpointRootInputs(blockProofs, cd, outHashHint, startBlobAccumulator).then(inputs => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint-root input building can hang prove(). top-tree-orchestrator.ts:238 does void this.buildCheckpointRootInputs(...).then(inputs => ...) with no rejection handler. If buildBlobHints, toProofData, or the private-input constructors throw/reject, state.reject() is never invoked → completionPromise never settles → prove() hangs. Add .catch(err => { if (!this.cancelled) state.reject(\...${err}`); })`.

(reported by Claude)

Comment on lines +82 to +86
// Single-checkpoint mini-epoch by construction. The total/challenges supplied to
// `super.startNewEpoch` are never read, because the sub-tree overrides
// `checkAndEnqueueCheckpointRootRollup` to short-circuit before the parent's
// checkpoint-root / finalize machinery would consume them.
super.startNewEpoch(epochContext.epochNumber, 1, FinalBlobBatchingChallenges.empty());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky

Comment on lines +157 to +163
public override startNewEpoch(
_epochNumber: EpochNumber,
_totalNumCheckpoints: number,
_finalBlobBatchingChallenges: FinalBlobBatchingChallenges,
): void {
throw new Error('CheckpointSubTreeOrchestrator starts its epoch in the constructor; do not call startNewEpoch.');
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels hacky as well - maybe the catch is that having this class inheriting from the main orchestrator is not the right abstraction?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are hacky. I'm going to incrementally work towards better abstractions with the end goal of getting rid of the monolithic ProvingOrchestrator.

@PhilWindle
Copy link
Copy Markdown
Collaborator Author

The Linear project https://linear.app/aztec-labs/project/optimistic-proving-3c2a5b387385/overview contains issues for all follow up work.

@PhilWindle PhilWindle merged commit 19ec16d into merge-train/spartan May 13, 2026
17 checks passed
@PhilWindle PhilWindle deleted the pw/optimistic-orchestrator branch May 13, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-full Run all master checks. ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants