feat: merge-train/spartan-v5#24148
Open
AztecBot wants to merge 12 commits into
Open
Conversation
…1217) (#24143) ## Problem The e2e helper `warpToSlotBeforeTargetProposer` (in `sentinel_status_slash.parallel.test.ts`) intermittently threw `Target proposer ... not found with sufficient buffer within 20 epochs`, surfacing as an `e2e_p2p` flake. It's a deterministic helper bug, not network flakiness. The search window was derived as `searchStart = currentSlot + minBufferSlots`, `searchEnd = (currentEpoch + 2) * AZTEC_EPOCH_DURATION - 1`. With `AZTEC_EPOCH_DURATION = 2` and `minBufferSlots = 2`, the buffer offset consumes a full epoch, so whenever `currentSlot` is odd the window collapses to a single, always-odd slot. Because the proposer for each slot is a different RANDAO-shuffled committee member, probing only odd slots never examines the even slot of any epoch — where the 1-of-6 target can be the proposer — so the loop exhausts all attempts and throws. ## Log verification Confirmed against CI run [`cc8c935bb167ed37`](http://ci.aztec-labs.com/cc8c935bb167ed37): - All 20 attempts probed a 1-slot window, every probed slot odd: `13, 15, 17, … 51`. Zero hit lines. - The target `0x90f79b…` was on the committee every epoch (RANDAO-shuffled into different positions) and attested at slots 11–12 — reachable, just never at a probed (odd) slot. - Zero `EpochNotStable` reverts during the scan — the search ran clean, just structurally blind to even slots. ## Fix Extract the proposer search into a shared `findUpcomingProposerSlot` in `e2e_p2p/shared.ts`, parameterized by `minLeadSlots`, and have the sentinel test use it: - Scans forward **one slot at a time** from `currentSlot + minLeadSlots`, so it examines **both epoch parities** (fixing the odd-only blindness) and finds the RANDAO-shuffled target wherever it proposes. - Guarantees the returned slot is **at least `minLeadSlots` ahead**, so the sentinel can warp to `targetSlot - minLeadSlots` for its settle buffer with no risk of a backwards warp — the lead comes from the scan start, not from a per-epoch position constraint. - Handles `EpochNotStable` by warping one epoch forward and continuing, keeping the lead. The sentinel helper becomes a thin wrapper: `findUpcomingProposerSlot({ minLeadSlots: 2 })` then `advanceToSlot(targetSlot - 2)`. Kept **separate** from the existing `advanceToEpochBeforeProposer`, which serves a genuinely different pattern: its four callers stay one epoch before the target to start sequencers, then warp to the epoch boundary, and rely on `warmupSlots` for their warm-up margin (load-bearing — without it their proposals serialize past the slot boundary and are rejected as late). That helper is unchanged, so its callers are unaffected. ## Testing - Build, format, and lint clean; only `shared.ts` and the sentinel test changed. - The proposer search now examines both parities and guarantees the lead by construction. - **Not yet run:** the full e2e (`sentinel_status_slash.parallel.test.ts`, ~20 min, real-time-dependent). The real validation is running it repeatedly to confirm the proposer is found and the suite's other two tests (which share the helper) still pass. Closes A-1217.
…9) (#24149) ## Problem `e2e_p2p_network › should rollup txs from all peers (and add the validators without cheating)` (in `gossip_network_no_cheat.test.ts`) intermittently fails with `TimeoutError: Timeout awaiting first checkpoint published` — the chain never gets a first checkpoint onto L1 within 120s. ## Log analysis From CI run [`6d6e74a70fce8826`](http://ci.aztec-labs.com/6d6e74a70fce8826): - The test did a blind `sleep(8000)` for peer discovery, then waited for the first checkpoint. On the 2-CPU runner the gossipsub **proposal/checkpoint meshes were not fully formed** 8s in. - The first checkpoint attempt (slot 97, proposer validator-3) reached only **2 of 3** attestations — validators 1 and 2 never received the slot-97 proposal at all (only validator-4 had a live gossip path). No L1 publish was attempted; the proposer aborted locally on the attestation-collection timeout. - Because that checkpoint never landed, the L1-confirmed chain stayed at genesis, so every later slot rebuilt a *competing* un-checkpointed block 1 (new archive). The blocks **are** pruned (`archiver:l1-sync: Pruning blocks after block 0 ...`), but the prune lands ~1.5 slots after the block is built — later than the next proposal arrives. So peers still holding a not-yet-pruned block 1 rejected the new proposal with `block_number_already_exists`, never re-executed, never attested — capping every round at 2/3 forever. Root cause: the gossip mesh wasn't formed when the committee started producing, so the first proposal reached only a subset of the committee. That both starved the first checkpoint of quorum and split the validators onto competing block-1 forks that never re-converge. ## Fix Replace the blind `sleep(8000)` with `waitForP2PMeshConnectivity` on the `block_proposal`, `checkpoint_proposal`, and `checkpoint_attestation` topics, requiring a **full mesh (N-1 peers per node)** so the first proposal reaches the whole committee. The first checkpoint then reaches quorum and lands — after which the chain advances to block 2 and no competing block 1 is ever built. Also adds a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity` (default `1`, preserving existing callers — the helper otherwise only requires a single mesh peer per node, which can leave some committee members unreached at first). Quorum-from-genesis tests pass `N-1` for a full mesh. This is the test-side fix that addresses the trigger. There is a separate, more fundamental product-robustness gap — a single missed checkpoint at the chain tip is unrecoverable because of the `block_number_already_exists` guard vs. the prune latency — which is consensus-sensitive and tracked separately (related to A-1218); it is intentionally **not** addressed here. ## Testing - Build, format, lint clean; only the test and its helper changed. - **Not yet run:** the full e2e (`gossip_network_no_cheat.test.ts`, real-time-dependent, ideally under a 2-CPU constraint). The real validation is running it repeatedly and confirming the committee reaches 3/3 and the first checkpoint publishes within the gate. Closes A-1219.
…nc (#24147) ## Problem A prover node in production crashed an epoch proving job: ``` Error in EpochSession 9598f7a8-...: Error: Sub-tree for checkpoint 1 failed: Error: First call BarretenbergSync.initSingleton() on @aztec/bb.js module. at TopTreeProvingState.rejectionCallback (.../prover-client/dest/orchestrator/top-tree-orchestrator.js:68:210) ``` The error comes from `BarretenbergSync.getSingleton()` throwing when the WASM singleton has not been initialised. Epoch proving deserializes compressed Chonk proofs via `ChonkProof.fromBuffer` → `ChonkProof.fromCompressedBytes` → `getSingleton()`. The prover node starts monitoring epochs as soon as it is created inside `AztecNodeService.createAndSync`. The only init on the start path (`aztec_start_action.ts`) runs later and is guarded on `services.aztec`, so an `EpochSession` can reach the decompress path before the singleton is initialised — the race that crashed the job in production. ## Fix Initialise the singleton at the top of `createAndSync`, before any subsystem runs. This covers the prover node, validator, sequencer, and every other `createAndSync` caller (including e2e tests). The existing `initSingleton()` call in `aztec_start_action` is left in place — `initSingleton` is idempotent (the singleton promise is cached), so it remains a harmless guard for the RPC schema-parsing path. Fixes A-1243.
## Summary Follow-up to the `CheckpointStore` + `SessionManager` redesign (#23552), which left the prover-node metrics covering only per-epoch-session timing. This adds checkpoint-level visibility and removes a metric that no longer carried distinct data. ## Changes **New per-checkpoint metrics** (`telemetry-client`, `prover-node/src/metrics.ts`, `checkpoint-prover.ts`): - `aztec.prover_node.checkpoint_blocks` / `aztec.prover_node.checkpoint_transactions` — histograms of per-checkpoint sizing, recorded alongside the existing `checkpoint_processing.duration`. - `aztec.prover_node.checkpoint_proving.duration` — spans from the start of checkpoint processing (after tx gathering) through block-proofs-ready. Recorded when the sub-tree result resolves. Complements `checkpoint_processing.duration`, which still measures execution/enqueue only. **Removed `aztec.prover_node.execution.duration`**: - In the old `EpochProvingJob`, `executionTime` was snapshotted before finalize/publish, so it was genuinely distinct from `job_duration`. In the new `EpochSession` the only timer available spans top-tree proving + publish, and the call site passed `timer.ms()` for *both* arguments — so the two metrics reported identical values. Execution timing now lives at the checkpoint level. `recordProvingJob` drops its redundant `executionTimeMs` argument. **Dashboard** (`spartan/metrics/grafana/dashboards/aztec_provers.json`): - Repointed the former "Execution Duration" heatmap to `checkpoint_processing.duration`. - Turned the "Epoch proving" chart into a checkpoint-centric view: checkpoint proving duration + txs per checkpoint + agents. - Added "Active Checkpoints" and "Active Epoch Sessions by Kind" panels for the live-state observable gauges introduced in #23552. ## Test plan - `telemetry-client` and `prover-node` typecheck clean. - `yarn workspace @aztec/prover-node test src/job/checkpoint-prover.test.ts src/job/epoch-session.test.ts` — 53 tests pass. - Dashboard JSON validated. > Note: removing `aztec.prover_node.execution.duration` is a breaking change for any external query/alert on that metric.
## Summary - set AZTEC_INBOX_LAG=2 for next-scenario deployments - extend invalidate_blocks.test.ts invalidation wait from 4 to 6 slots - document the extra margin for proposer pipelining ## Testing - not run locally; scenario e2e requires a deployed scenario environment
## Summary - set the baked-in AZTEC_INBOX_LAG default to 2 in network-defaults.yml - leave generated outputs unchanged; they will be regenerated by the normal generation flow ## Testing - not run; YAML default only
…ansfers (#24123) Backport of #23853 for the v5 Spartan train. Linear: https://linear.app/aztec-labs/issue/A-1233/nightly-scenario-reorg-test-times-out-post-disruption-recovery-too This keeps the reorg scenario transfer loop from waiting for each round to reach CHECKPOINTED before submitting the next round; PROPOSED is enough to keep the chain loaded. Testing: `git diff --check` passed. `yarn build` was not run because the temporary worktree does not have Yarn node_modules state installed.
## Problem
A validator can permanently miss an attestation during an L2 reorg
because of a prune-vs-proposal race in the block-proposal handler. When
a stale block N is being pruned and the rebuilt block N proposal arrives
in the small window *before this node has applied the prune*, the
handler rejected it with `block_number_already_exists` and never
re-processed it. The rebuilt block then never landed locally, so the
later checkpoint proposal validation polled for it by archive until the
attestation deadline and gave up with `last_block_not_found` — no
attestation.
The guard at `proposal_handler.ts` keyed on block *number only*: it
looked up `getBlockData({ number })` and rejected if any block existed
at that number, without comparing the existing block's archive to the
proposal.
## Log verification
Confirmed against CI run
[`3a45cda231c3747c`](http://ci.aztec-labs.com/3a45cda231c3747c) (failing
test `e2e_p2p_multiple_validators_sentinel › collects attestations for
all validators on a node`):
- `22:19:05.442` — validator-2 logged `Block number 3 already exists,
skipping processing` → `block_number_already_exists` for the rebuilt
slot-11 block 3 (archive `0x1cfa1dec…`).
- validator-2 was lagging: peers pruned at `.299–.399`, validator-2 not
until `.600` (~158ms after rejecting), so it still held stale block 3 at
rejection time.
- `22:20:08.684` — checkpoint validation timed out with
`last_block_not_found` on the same archive → missed attestation.
- The slot-11 block never landed on L1 and was permanently lost on
validator-2, so an L1-sync-only recovery could never have recovered it —
the block must be taken from the proposal in hand.
## Fix
`resolveExistingBlockAtNumber` now compares the existing block's archive
to the proposal:
- **Same archive** → genuine duplicate, still skipped (unchanged
behaviour).
- **Different archive** → a different, un-checkpointed block occupies
this number (e.g. a stale fork being pruned during a reorg). Force L1
sync and wait, bounded by the re-execution deadline, for the local prune
to land, then return `undefined` so the proposal is processed from the
proposal already in hand and is available to attest.
- **Prune doesn't land before the deadline** → fall back to the safe
`block_number_already_exists` rejection.
## Tests
Three new deterministic unit cases in `proposal_handler.test.ts`:
- rejects a genuine duplicate (same archive)
- processes a rebuilt proposal once the stale fork is pruned (regression
— fails against the old number-only guard, passes with the fix)
- falls back to `block_number_already_exists` when the stale fork is not
pruned before the deadline
Full file 29/29 passing; format, lint, and full TypeScript build clean.
Closes A-1218.
## Problem
The network default inbox lag was changed from 1 to 2 (on
`merge-train/spartan-v5`). That regressed every test environment that
bridges fee juice and consumes the resulting L1→L2 message within a
single checkpoint — with lag 2 the message isn't consumable yet, so they
fail with `No L1 to L2 message found` (or, for `e2e_l1_publisher`, no
checkpoint ever consumes a real message):
- `cli-wallet/test/flows/*` — create_account_pay_native,
private_transfer, profile, public_authwit_transfer, shield_and_transfer,
sponsored_create_account_and_mint
- `docs/examples/bootstrap.sh execute`
- `guides/up_quick_start.test.ts`
- `e2e_l1_publisher` block-building suite
## Fix
Pin the inbox lag to 1 for these test environments:
- **Sandbox compose suites** (cli-wallet flows, docs examples,
up_quick_start): set `AZTEC_INBOX_LAG: 1` on the `local-network` service
in both compose files —
`yarn-project/end-to-end/scripts/docker-compose.yml` and
`docs/examples/ts/docker-compose.yml`. These run the node via `node
./dest/bin start --local-network`, so they don't go through the TS
`setup()` helper; `AZTEC_INBOX_LAG` (`ethereum/src/config.ts`) feeds
both the L1 deploy and the node config.
- **`e2e_l1_publisher` block-building suite**: `setup({ inboxLag: 1 })`,
since that suite models the single-checkpoint inbox lag by hand.
Together these cover the inboxLag-2 fallout across the affected test
environments.
Note: the `e2e_l1_publisher` change here is the same fix as #24158; this
PR consolidates it with the sandbox compose-env pins. If this lands,
#24158 can be closed (or vice-versa). The unrelated
`e2e_p2p/rediscovery` P2P flake is not addressed here.
## Testing
- Compose files validated (YAML parses; `AZTEC_INBOX_LAG=1` confirmed
under `local-network.environment`).
- `setup({ inboxLag: 1 })` typechecks (`SetupOptions &
Partial<AztecNodeConfig>`).
- **Not yet run locally:** the suites themselves (require the CI docker
image / sandbox). Labeled `ci-full` to exercise the full suite.
…4164) The sequencer logs `Not enough txs to build block ... (got 0 txs but needs 1)` at `warn` level every time a slot lacks enough txs to build a block. On low-traffic networks this fires constantly and spams the logs. This downgrades the log from `warn` to `verbose`. The structured `logCheckpointEvent('block-build-failed', ...)`, the `block-tx-count-check-failed` event, and the `recordBlockProposalFailed('insufficient_txs')` metric are unchanged, so the condition is still observable — just no longer noisy at warn level. --- *Created by [claudebox](https://claudebox.work/v2/sessions/381234e80968cc25) · group: `slackbot`*
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(test): reliably find target proposer in sentinel_status_slash (A-1217) (#24143)
fix(test): wait for full gossip mesh before committee produces (A-1219) (#24149)
fix: init bb.js sync singleton before subsystems start in createAndSync (#24147)
feat(prover-node): capture checkpoint-level proving metrics (#24051)
fix: stabilize scenario invalidation timing (#24128)
fix: set default inbox lag to 2 (#24127)
test(spartan): wait for proposed instead of checkpointed in performTransfers (#24123)
fix(validator): make block-number guard reorg-aware (A-1218) (#24141)
fix(test): pin AZTEC_INBOX_LAG=1 in sandbox compose envs (#24162)
chore(sequencer): downgrade insufficient-txs block log to verbose (#24164)
END_COMMIT_OVERRIDE