feat: merge-train/spartan-v5 by AztecBot · Pull Request #24148 · AztecProtocol/aztec-packages

AztecBot · 2026-06-17T10:12:23Z

BEGIN_COMMIT_OVERRIDE
fix(test): reliably find target proposer in sentinel_status_slash (A-1217) (#24143)
fix(test): wait for full gossip mesh before committee produces (A-1219) (#24149)
fix: init bb.js sync singleton before subsystems start in createAndSync (#24147)
feat(prover-node): capture checkpoint-level proving metrics (#24051)
fix: stabilize scenario invalidation timing (#24128)
fix: set default inbox lag to 2 (#24127)
test(spartan): wait for proposed instead of checkpointed in performTransfers (#24123)
fix(validator): make block-number guard reorg-aware (A-1218) (#24141)
fix(test): pin AZTEC_INBOX_LAG=1 in sandbox compose envs (#24162)
chore(sequencer): downgrade insufficient-txs block log to verbose (#24164)
END_COMMIT_OVERRIDE

…1217) (#24143) ## Problem The e2e helper `warpToSlotBeforeTargetProposer` (in `sentinel_status_slash.parallel.test.ts`) intermittently threw `Target proposer ... not found with sufficient buffer within 20 epochs`, surfacing as an `e2e_p2p` flake. It's a deterministic helper bug, not network flakiness. The search window was derived as `searchStart = currentSlot + minBufferSlots`, `searchEnd = (currentEpoch + 2) * AZTEC_EPOCH_DURATION - 1`. With `AZTEC_EPOCH_DURATION = 2` and `minBufferSlots = 2`, the buffer offset consumes a full epoch, so whenever `currentSlot` is odd the window collapses to a single, always-odd slot. Because the proposer for each slot is a different RANDAO-shuffled committee member, probing only odd slots never examines the even slot of any epoch — where the 1-of-6 target can be the proposer — so the loop exhausts all attempts and throws. ## Log verification Confirmed against CI run [`cc8c935bb167ed37`](http://ci.aztec-labs.com/cc8c935bb167ed37): - All 20 attempts probed a 1-slot window, every probed slot odd: `13, 15, 17, … 51`. Zero hit lines. - The target `0x90f79b…` was on the committee every epoch (RANDAO-shuffled into different positions) and attested at slots 11–12 — reachable, just never at a probed (odd) slot. - Zero `EpochNotStable` reverts during the scan — the search ran clean, just structurally blind to even slots. ## Fix Extract the proposer search into a shared `findUpcomingProposerSlot` in `e2e_p2p/shared.ts`, parameterized by `minLeadSlots`, and have the sentinel test use it: - Scans forward **one slot at a time** from `currentSlot + minLeadSlots`, so it examines **both epoch parities** (fixing the odd-only blindness) and finds the RANDAO-shuffled target wherever it proposes. - Guarantees the returned slot is **at least `minLeadSlots` ahead**, so the sentinel can warp to `targetSlot - minLeadSlots` for its settle buffer with no risk of a backwards warp — the lead comes from the scan start, not from a per-epoch position constraint. - Handles `EpochNotStable` by warping one epoch forward and continuing, keeping the lead. The sentinel helper becomes a thin wrapper: `findUpcomingProposerSlot({ minLeadSlots: 2 })` then `advanceToSlot(targetSlot - 2)`. Kept **separate** from the existing `advanceToEpochBeforeProposer`, which serves a genuinely different pattern: its four callers stay one epoch before the target to start sequencers, then warp to the epoch boundary, and rely on `warmupSlots` for their warm-up margin (load-bearing — without it their proposals serialize past the slot boundary and are rejected as late). That helper is unchanged, so its callers are unaffected. ## Testing - Build, format, and lint clean; only `shared.ts` and the sentinel test changed. - The proposer search now examines both parities and guarantees the lead by construction. - **Not yet run:** the full e2e (`sentinel_status_slash.parallel.test.ts`, ~20 min, real-time-dependent). The real validation is running it repeatedly to confirm the proposer is found and the suite's other two tests (which share the helper) still pass. Closes A-1217.

…9) (#24149) ## Problem `e2e_p2p_network › should rollup txs from all peers (and add the validators without cheating)` (in `gossip_network_no_cheat.test.ts`) intermittently fails with `TimeoutError: Timeout awaiting first checkpoint published` — the chain never gets a first checkpoint onto L1 within 120s. ## Log analysis From CI run [`6d6e74a70fce8826`](http://ci.aztec-labs.com/6d6e74a70fce8826): - The test did a blind `sleep(8000)` for peer discovery, then waited for the first checkpoint. On the 2-CPU runner the gossipsub **proposal/checkpoint meshes were not fully formed** 8s in. - The first checkpoint attempt (slot 97, proposer validator-3) reached only **2 of 3** attestations — validators 1 and 2 never received the slot-97 proposal at all (only validator-4 had a live gossip path). No L1 publish was attempted; the proposer aborted locally on the attestation-collection timeout. - Because that checkpoint never landed, the L1-confirmed chain stayed at genesis, so every later slot rebuilt a *competing* un-checkpointed block 1 (new archive). The blocks **are** pruned (`archiver:l1-sync: Pruning blocks after block 0 ...`), but the prune lands ~1.5 slots after the block is built — later than the next proposal arrives. So peers still holding a not-yet-pruned block 1 rejected the new proposal with `block_number_already_exists`, never re-executed, never attested — capping every round at 2/3 forever. Root cause: the gossip mesh wasn't formed when the committee started producing, so the first proposal reached only a subset of the committee. That both starved the first checkpoint of quorum and split the validators onto competing block-1 forks that never re-converge. ## Fix Replace the blind `sleep(8000)` with `waitForP2PMeshConnectivity` on the `block_proposal`, `checkpoint_proposal`, and `checkpoint_attestation` topics, requiring a **full mesh (N-1 peers per node)** so the first proposal reaches the whole committee. The first checkpoint then reaches quorum and lands — after which the chain advances to block 2 and no competing block 1 is ever built. Also adds a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity` (default `1`, preserving existing callers — the helper otherwise only requires a single mesh peer per node, which can leave some committee members unreached at first). Quorum-from-genesis tests pass `N-1` for a full mesh. This is the test-side fix that addresses the trigger. There is a separate, more fundamental product-robustness gap — a single missed checkpoint at the chain tip is unrecoverable because of the `block_number_already_exists` guard vs. the prune latency — which is consensus-sensitive and tracked separately (related to A-1218); it is intentionally **not** addressed here. ## Testing - Build, format, lint clean; only the test and its helper changed. - **Not yet run:** the full e2e (`gossip_network_no_cheat.test.ts`, real-time-dependent, ideally under a 2-CPU constraint). The real validation is running it repeatedly and confirming the committee reaches 3/3 and the first checkpoint publishes within the gate. Closes A-1219.

…nc (#24147) ## Problem A prover node in production crashed an epoch proving job: ``` Error in EpochSession 9598f7a8-...: Error: Sub-tree for checkpoint 1 failed: Error: First call BarretenbergSync.initSingleton() on @aztec/bb.js module. at TopTreeProvingState.rejectionCallback (.../prover-client/dest/orchestrator/top-tree-orchestrator.js:68:210) ``` The error comes from `BarretenbergSync.getSingleton()` throwing when the WASM singleton has not been initialised. Epoch proving deserializes compressed Chonk proofs via `ChonkProof.fromBuffer` → `ChonkProof.fromCompressedBytes` → `getSingleton()`. The prover node starts monitoring epochs as soon as it is created inside `AztecNodeService.createAndSync`. The only init on the start path (`aztec_start_action.ts`) runs later and is guarded on `services.aztec`, so an `EpochSession` can reach the decompress path before the singleton is initialised — the race that crashed the job in production. ## Fix Initialise the singleton at the top of `createAndSync`, before any subsystem runs. This covers the prover node, validator, sequencer, and every other `createAndSync` caller (including e2e tests). The existing `initSingleton()` call in `aztec_start_action` is left in place — `initSingleton` is idempotent (the singleton promise is cached), so it remains a harmless guard for the RPC schema-parsing path. Fixes A-1243.

## Summary Follow-up to the `CheckpointStore` + `SessionManager` redesign (#23552), which left the prover-node metrics covering only per-epoch-session timing. This adds checkpoint-level visibility and removes a metric that no longer carried distinct data. ## Changes **New per-checkpoint metrics** (`telemetry-client`, `prover-node/src/metrics.ts`, `checkpoint-prover.ts`): - `aztec.prover_node.checkpoint_blocks` / `aztec.prover_node.checkpoint_transactions` — histograms of per-checkpoint sizing, recorded alongside the existing `checkpoint_processing.duration`. - `aztec.prover_node.checkpoint_proving.duration` — spans from the start of checkpoint processing (after tx gathering) through block-proofs-ready. Recorded when the sub-tree result resolves. Complements `checkpoint_processing.duration`, which still measures execution/enqueue only. **Removed `aztec.prover_node.execution.duration`**: - In the old `EpochProvingJob`, `executionTime` was snapshotted before finalize/publish, so it was genuinely distinct from `job_duration`. In the new `EpochSession` the only timer available spans top-tree proving + publish, and the call site passed `timer.ms()` for *both* arguments — so the two metrics reported identical values. Execution timing now lives at the checkpoint level. `recordProvingJob` drops its redundant `executionTimeMs` argument. **Dashboard** (`spartan/metrics/grafana/dashboards/aztec_provers.json`): - Repointed the former "Execution Duration" heatmap to `checkpoint_processing.duration`. - Turned the "Epoch proving" chart into a checkpoint-centric view: checkpoint proving duration + txs per checkpoint + agents. - Added "Active Checkpoints" and "Active Epoch Sessions by Kind" panels for the live-state observable gauges introduced in #23552. ## Test plan - `telemetry-client` and `prover-node` typecheck clean. - `yarn workspace @aztec/prover-node test src/job/checkpoint-prover.test.ts src/job/epoch-session.test.ts` — 53 tests pass. - Dashboard JSON validated. > Note: removing `aztec.prover_node.execution.duration` is a breaking change for any external query/alert on that metric.

## Summary - set AZTEC_INBOX_LAG=2 for next-scenario deployments - extend invalidate_blocks.test.ts invalidation wait from 4 to 6 slots - document the extra margin for proposer pipelining ## Testing - not run locally; scenario e2e requires a deployed scenario environment

## Summary - set the baked-in AZTEC_INBOX_LAG default to 2 in network-defaults.yml - leave generated outputs unchanged; they will be regenerated by the normal generation flow ## Testing - not run; YAML default only

…ansfers (#24123) Backport of #23853 for the v5 Spartan train. Linear: https://linear.app/aztec-labs/issue/A-1233/nightly-scenario-reorg-test-times-out-post-disruption-recovery-too This keeps the reorg scenario transfer loop from waiting for each round to reach CHECKPOINTED before submitting the next round; PROPOSED is enough to keep the chain loaded. Testing: `git diff --check` passed. `yarn build` was not run because the temporary worktree does not have Yarn node_modules state installed.

## Problem A validator can permanently miss an attestation during an L2 reorg because of a prune-vs-proposal race in the block-proposal handler. When a stale block N is being pruned and the rebuilt block N proposal arrives in the small window *before this node has applied the prune*, the handler rejected it with `block_number_already_exists` and never re-processed it. The rebuilt block then never landed locally, so the later checkpoint proposal validation polled for it by archive until the attestation deadline and gave up with `last_block_not_found` — no attestation. The guard at `proposal_handler.ts` keyed on block *number only*: it looked up `getBlockData({ number })` and rejected if any block existed at that number, without comparing the existing block's archive to the proposal. ## Log verification Confirmed against CI run [`3a45cda231c3747c`](http://ci.aztec-labs.com/3a45cda231c3747c) (failing test `e2e_p2p_multiple_validators_sentinel › collects attestations for all validators on a node`): - `22:19:05.442` — validator-2 logged `Block number 3 already exists, skipping processing` → `block_number_already_exists` for the rebuilt slot-11 block 3 (archive `0x1cfa1dec…`). - validator-2 was lagging: peers pruned at `.299–.399`, validator-2 not until `.600` (~158ms after rejecting), so it still held stale block 3 at rejection time. - `22:20:08.684` — checkpoint validation timed out with `last_block_not_found` on the same archive → missed attestation. - The slot-11 block never landed on L1 and was permanently lost on validator-2, so an L1-sync-only recovery could never have recovered it — the block must be taken from the proposal in hand. ## Fix `resolveExistingBlockAtNumber` now compares the existing block's archive to the proposal: - **Same archive** → genuine duplicate, still skipped (unchanged behaviour). - **Different archive** → a different, un-checkpointed block occupies this number (e.g. a stale fork being pruned during a reorg). Force L1 sync and wait, bounded by the re-execution deadline, for the local prune to land, then return `undefined` so the proposal is processed from the proposal already in hand and is available to attest. - **Prune doesn't land before the deadline** → fall back to the safe `block_number_already_exists` rejection. ## Tests Three new deterministic unit cases in `proposal_handler.test.ts`: - rejects a genuine duplicate (same archive) - processes a rebuilt proposal once the stale fork is pruned (regression — fails against the old number-only guard, passes with the fix) - falls back to `block_number_already_exists` when the stale fork is not pruned before the deadline Full file 29/29 passing; format, lint, and full TypeScript build clean. Closes A-1218.

## Problem The network default inbox lag was changed from 1 to 2 (on `merge-train/spartan-v5`). That regressed every test environment that bridges fee juice and consumes the resulting L1→L2 message within a single checkpoint — with lag 2 the message isn't consumable yet, so they fail with `No L1 to L2 message found` (or, for `e2e_l1_publisher`, no checkpoint ever consumes a real message): - `cli-wallet/test/flows/*` — create_account_pay_native, private_transfer, profile, public_authwit_transfer, shield_and_transfer, sponsored_create_account_and_mint - `docs/examples/bootstrap.sh execute` - `guides/up_quick_start.test.ts` - `e2e_l1_publisher` block-building suite ## Fix Pin the inbox lag to 1 for these test environments: - **Sandbox compose suites** (cli-wallet flows, docs examples, up_quick_start): set `AZTEC_INBOX_LAG: 1` on the `local-network` service in both compose files — `yarn-project/end-to-end/scripts/docker-compose.yml` and `docs/examples/ts/docker-compose.yml`. These run the node via `node ./dest/bin start --local-network`, so they don't go through the TS `setup()` helper; `AZTEC_INBOX_LAG` (`ethereum/src/config.ts`) feeds both the L1 deploy and the node config. - **`e2e_l1_publisher` block-building suite**: `setup({ inboxLag: 1 })`, since that suite models the single-checkpoint inbox lag by hand. Together these cover the inboxLag-2 fallout across the affected test environments. Note: the `e2e_l1_publisher` change here is the same fix as #24158; this PR consolidates it with the sandbox compose-env pins. If this lands, #24158 can be closed (or vice-versa). The unrelated `e2e_p2p/rediscovery` P2P flake is not addressed here. ## Testing - Compose files validated (YAML parses; `AZTEC_INBOX_LAG=1` confirmed under `local-network.environment`). - `setup({ inboxLag: 1 })` typechecks (`SetupOptions & Partial<AztecNodeConfig>`). - **Not yet run locally:** the suites themselves (require the CI docker image / sandbox). Labeled `ci-full` to exercise the full suite.

…4164) The sequencer logs `Not enough txs to build block ... (got 0 txs but needs 1)` at `warn` level every time a slot lacks enough txs to build a block. On low-traffic networks this fires constantly and spams the logs. This downgrades the log from `warn` to `verbose`. The structured `logCheckpointEvent('block-build-failed', ...)`, the `block-tx-count-check-failed` event, and the `recordBlockProposalFailed('insufficient_txs')` metric are unchanged, so the condition is still observable — just no longer noisy at warn level. --- *Created by [claudebox](https://claudebox.work/v2/sessions/381234e80968cc25) · group: `slackbot`*

ludamad

🤖 Auto-approved

AztecBot · 2026-06-17T22:32:45Z

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

AztecBot added ci-no-squash ci-full-no-test-cache private-port-next labels Jun 17, 2026

PhilWindle and others added 9 commits June 17, 2026 10:50

fix: set default inbox lag to 2 (#24127)

a819515

## Summary - set the baked-in AZTEC_INBOX_LAG default to 2 in network-defaults.yml - leave generated outputs unchanged; they will be regenerated by the normal generation flow ## Testing - not run; YAML default only

Merge branch 'v5-next' into merge-train/spartan-v5

1d05278

PhilWindle requested a review from a team as a code owner June 17, 2026 17:33

ludamad approved these changes Jun 17, 2026

View reviewed changes

AztecBot added this pull request to the merge queue Jun 17, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026

Merge branch 'v5-next' into merge-train/spartan-v5

e3226c8

AztecBot mentioned this pull request Jun 17, 2026

fix(boxes): serialize workspace builds #24171

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: merge-train/spartan-v5#24148

feat: merge-train/spartan-v5#24148
AztecBot wants to merge 12 commits into
v5-nextfrom
merge-train/spartan-v5

AztecBot commented Jun 17, 2026 •

edited

Loading

Uh oh!

ludamad left a comment

Uh oh!

AztecBot commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AztecBot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ludamad left a comment

Choose a reason for hiding this comment

Uh oh!

AztecBot commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AztecBot commented Jun 17, 2026 •

edited

Loading