Skip to content

feat: merge-train/spartan-v5#24148

Open
AztecBot wants to merge 12 commits into
v5-nextfrom
merge-train/spartan-v5
Open

feat: merge-train/spartan-v5#24148
AztecBot wants to merge 12 commits into
v5-nextfrom
merge-train/spartan-v5

Conversation

@AztecBot

@AztecBot AztecBot commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

BEGIN_COMMIT_OVERRIDE
fix(test): reliably find target proposer in sentinel_status_slash (A-1217) (#24143)
fix(test): wait for full gossip mesh before committee produces (A-1219) (#24149)
fix: init bb.js sync singleton before subsystems start in createAndSync (#24147)
feat(prover-node): capture checkpoint-level proving metrics (#24051)
fix: stabilize scenario invalidation timing (#24128)
fix: set default inbox lag to 2 (#24127)
test(spartan): wait for proposed instead of checkpointed in performTransfers (#24123)
fix(validator): make block-number guard reorg-aware (A-1218) (#24141)
fix(test): pin AZTEC_INBOX_LAG=1 in sandbox compose envs (#24162)
chore(sequencer): downgrade insufficient-txs block log to verbose (#24164)
END_COMMIT_OVERRIDE

…1217) (#24143)

## Problem

The e2e helper `warpToSlotBeforeTargetProposer` (in
`sentinel_status_slash.parallel.test.ts`) intermittently threw `Target
proposer ... not found with sufficient buffer within 20 epochs`,
surfacing as an `e2e_p2p` flake. It's a deterministic helper bug, not
network flakiness.

The search window was derived as `searchStart = currentSlot +
minBufferSlots`, `searchEnd = (currentEpoch + 2) * AZTEC_EPOCH_DURATION
- 1`. With `AZTEC_EPOCH_DURATION = 2` and `minBufferSlots = 2`, the
buffer offset consumes a full epoch, so whenever `currentSlot` is odd
the window collapses to a single, always-odd slot. Because the proposer
for each slot is a different RANDAO-shuffled committee member, probing
only odd slots never examines the even slot of any epoch — where the
1-of-6 target can be the proposer — so the loop exhausts all attempts
and throws.

## Log verification

Confirmed against CI run
[`cc8c935bb167ed37`](http://ci.aztec-labs.com/cc8c935bb167ed37):

- All 20 attempts probed a 1-slot window, every probed slot odd: `13,
15, 17, … 51`. Zero hit lines.
- The target `0x90f79b…` was on the committee every epoch
(RANDAO-shuffled into different positions) and attested at slots 11–12 —
reachable, just never at a probed (odd) slot.
- Zero `EpochNotStable` reverts during the scan — the search ran clean,
just structurally blind to even slots.

## Fix

Extract the proposer search into a shared `findUpcomingProposerSlot` in
`e2e_p2p/shared.ts`, parameterized by `minLeadSlots`, and have the
sentinel test use it:

- Scans forward **one slot at a time** from `currentSlot +
minLeadSlots`, so it examines **both epoch parities** (fixing the
odd-only blindness) and finds the RANDAO-shuffled target wherever it
proposes.
- Guarantees the returned slot is **at least `minLeadSlots` ahead**, so
the sentinel can warp to `targetSlot - minLeadSlots` for its settle
buffer with no risk of a backwards warp — the lead comes from the scan
start, not from a per-epoch position constraint.
- Handles `EpochNotStable` by warping one epoch forward and continuing,
keeping the lead.

The sentinel helper becomes a thin wrapper: `findUpcomingProposerSlot({
minLeadSlots: 2 })` then `advanceToSlot(targetSlot - 2)`.

Kept **separate** from the existing `advanceToEpochBeforeProposer`,
which serves a genuinely different pattern: its four callers stay one
epoch before the target to start sequencers, then warp to the epoch
boundary, and rely on `warmupSlots` for their warm-up margin
(load-bearing — without it their proposals serialize past the slot
boundary and are rejected as late). That helper is unchanged, so its
callers are unaffected.

## Testing

- Build, format, and lint clean; only `shared.ts` and the sentinel test
changed.
- The proposer search now examines both parities and guarantees the lead
by construction.
- **Not yet run:** the full e2e
(`sentinel_status_slash.parallel.test.ts`, ~20 min,
real-time-dependent). The real validation is running it repeatedly to
confirm the proposer is found and the suite's other two tests (which
share the helper) still pass.

Closes A-1217.
PhilWindle and others added 9 commits June 17, 2026 10:50
…9) (#24149)

## Problem

`e2e_p2p_network › should rollup txs from all peers (and add the
validators without cheating)` (in `gossip_network_no_cheat.test.ts`)
intermittently fails with `TimeoutError: Timeout awaiting first
checkpoint published` — the chain never gets a first checkpoint onto L1
within 120s.

## Log analysis

From CI run
[`6d6e74a70fce8826`](http://ci.aztec-labs.com/6d6e74a70fce8826):

- The test did a blind `sleep(8000)` for peer discovery, then waited for
the first checkpoint. On the 2-CPU runner the gossipsub
**proposal/checkpoint meshes were not fully formed** 8s in.
- The first checkpoint attempt (slot 97, proposer validator-3) reached
only **2 of 3** attestations — validators 1 and 2 never received the
slot-97 proposal at all (only validator-4 had a live gossip path). No L1
publish was attempted; the proposer aborted locally on the
attestation-collection timeout.
- Because that checkpoint never landed, the L1-confirmed chain stayed at
genesis, so every later slot rebuilt a *competing* un-checkpointed block
1 (new archive). The blocks **are** pruned (`archiver:l1-sync: Pruning
blocks after block 0 ...`), but the prune lands ~1.5 slots after the
block is built — later than the next proposal arrives. So peers still
holding a not-yet-pruned block 1 rejected the new proposal with
`block_number_already_exists`, never re-executed, never attested —
capping every round at 2/3 forever.

Root cause: the gossip mesh wasn't formed when the committee started
producing, so the first proposal reached only a subset of the committee.
That both starved the first checkpoint of quorum and split the
validators onto competing block-1 forks that never re-converge.

## Fix

Replace the blind `sleep(8000)` with `waitForP2PMeshConnectivity` on the
`block_proposal`, `checkpoint_proposal`, and `checkpoint_attestation`
topics, requiring a **full mesh (N-1 peers per node)** so the first
proposal reaches the whole committee. The first checkpoint then reaches
quorum and lands — after which the chain advances to block 2 and no
competing block 1 is ever built.

Also adds a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity`
(default `1`, preserving existing callers — the helper otherwise only
requires a single mesh peer per node, which can leave some committee
members unreached at first). Quorum-from-genesis tests pass `N-1` for a
full mesh.

This is the test-side fix that addresses the trigger. There is a
separate, more fundamental product-robustness gap — a single missed
checkpoint at the chain tip is unrecoverable because of the
`block_number_already_exists` guard vs. the prune latency — which is
consensus-sensitive and tracked separately (related to A-1218); it is
intentionally **not** addressed here.

## Testing

- Build, format, lint clean; only the test and its helper changed.
- **Not yet run:** the full e2e (`gossip_network_no_cheat.test.ts`,
real-time-dependent, ideally under a 2-CPU constraint). The real
validation is running it repeatedly and confirming the committee reaches
3/3 and the first checkpoint publishes within the gate.

Closes A-1219.
…nc (#24147)

## Problem

A prover node in production crashed an epoch proving job:

```
Error in EpochSession 9598f7a8-...: Error: Sub-tree for checkpoint 1 failed: Error: First call BarretenbergSync.initSingleton() on @aztec/bb.js module.
    at TopTreeProvingState.rejectionCallback (.../prover-client/dest/orchestrator/top-tree-orchestrator.js:68:210)
```

The error comes from `BarretenbergSync.getSingleton()` throwing when the
WASM singleton has not been initialised. Epoch proving deserializes
compressed Chonk proofs via `ChonkProof.fromBuffer` →
`ChonkProof.fromCompressedBytes` → `getSingleton()`.

The prover node starts monitoring epochs as soon as it is created inside
`AztecNodeService.createAndSync`. The only init on the start path
(`aztec_start_action.ts`) runs later and is guarded on `services.aztec`,
so an `EpochSession` can reach the decompress path before the singleton
is initialised — the race that crashed the job in production.

## Fix

Initialise the singleton at the top of `createAndSync`, before any
subsystem runs. This covers the prover node, validator, sequencer, and
every other `createAndSync` caller (including e2e tests).

The existing `initSingleton()` call in `aztec_start_action` is left in
place — `initSingleton` is idempotent (the singleton promise is cached),
so it remains a harmless guard for the RPC schema-parsing path.

Fixes A-1243.
## Summary

Follow-up to the `CheckpointStore` + `SessionManager` redesign (#23552),
which left the prover-node metrics covering only per-epoch-session
timing. This adds checkpoint-level visibility and removes a metric that
no longer carried distinct data.

## Changes

**New per-checkpoint metrics** (`telemetry-client`,
`prover-node/src/metrics.ts`, `checkpoint-prover.ts`):
- `aztec.prover_node.checkpoint_blocks` /
`aztec.prover_node.checkpoint_transactions` — histograms of
per-checkpoint sizing, recorded alongside the existing
`checkpoint_processing.duration`.
- `aztec.prover_node.checkpoint_proving.duration` — spans from the start
of checkpoint processing (after tx gathering) through
block-proofs-ready. Recorded when the sub-tree result resolves.
Complements `checkpoint_processing.duration`, which still measures
execution/enqueue only.

**Removed `aztec.prover_node.execution.duration`**:
- In the old `EpochProvingJob`, `executionTime` was snapshotted before
finalize/publish, so it was genuinely distinct from `job_duration`. In
the new `EpochSession` the only timer available spans top-tree proving +
publish, and the call site passed `timer.ms()` for *both* arguments — so
the two metrics reported identical values. Execution timing now lives at
the checkpoint level. `recordProvingJob` drops its redundant
`executionTimeMs` argument.

**Dashboard** (`spartan/metrics/grafana/dashboards/aztec_provers.json`):
- Repointed the former "Execution Duration" heatmap to
`checkpoint_processing.duration`.
- Turned the "Epoch proving" chart into a checkpoint-centric view:
checkpoint proving duration + txs per checkpoint + agents.
- Added "Active Checkpoints" and "Active Epoch Sessions by Kind" panels
for the live-state observable gauges introduced in #23552.

## Test plan

- `telemetry-client` and `prover-node` typecheck clean.
- `yarn workspace @aztec/prover-node test
src/job/checkpoint-prover.test.ts src/job/epoch-session.test.ts` — 53
tests pass.
- Dashboard JSON validated.

> Note: removing `aztec.prover_node.execution.duration` is a breaking
change for any external query/alert on that metric.
## Summary
- set AZTEC_INBOX_LAG=2 for next-scenario deployments
- extend invalidate_blocks.test.ts invalidation wait from 4 to 6 slots
- document the extra margin for proposer pipelining

## Testing
- not run locally; scenario e2e requires a deployed scenario environment
## Summary
- set the baked-in AZTEC_INBOX_LAG default to 2 in network-defaults.yml
- leave generated outputs unchanged; they will be regenerated by the
normal generation flow

## Testing
- not run; YAML default only
…ansfers (#24123)

Backport of #23853 for the v5 Spartan train.

Linear:
https://linear.app/aztec-labs/issue/A-1233/nightly-scenario-reorg-test-times-out-post-disruption-recovery-too

This keeps the reorg scenario transfer loop from waiting for each round
to reach CHECKPOINTED before submitting the next round; PROPOSED is
enough to keep the chain loaded.

Testing: `git diff --check` passed. `yarn build` was not run because the
temporary worktree does not have Yarn node_modules state installed.
## Problem

A validator can permanently miss an attestation during an L2 reorg
because of a prune-vs-proposal race in the block-proposal handler. When
a stale block N is being pruned and the rebuilt block N proposal arrives
in the small window *before this node has applied the prune*, the
handler rejected it with `block_number_already_exists` and never
re-processed it. The rebuilt block then never landed locally, so the
later checkpoint proposal validation polled for it by archive until the
attestation deadline and gave up with `last_block_not_found` — no
attestation.

The guard at `proposal_handler.ts` keyed on block *number only*: it
looked up `getBlockData({ number })` and rejected if any block existed
at that number, without comparing the existing block's archive to the
proposal.

## Log verification

Confirmed against CI run
[`3a45cda231c3747c`](http://ci.aztec-labs.com/3a45cda231c3747c) (failing
test `e2e_p2p_multiple_validators_sentinel › collects attestations for
all validators on a node`):

- `22:19:05.442` — validator-2 logged `Block number 3 already exists,
skipping processing` → `block_number_already_exists` for the rebuilt
slot-11 block 3 (archive `0x1cfa1dec…`).
- validator-2 was lagging: peers pruned at `.299–.399`, validator-2 not
until `.600` (~158ms after rejecting), so it still held stale block 3 at
rejection time.
- `22:20:08.684` — checkpoint validation timed out with
`last_block_not_found` on the same archive → missed attestation.
- The slot-11 block never landed on L1 and was permanently lost on
validator-2, so an L1-sync-only recovery could never have recovered it —
the block must be taken from the proposal in hand.

## Fix

`resolveExistingBlockAtNumber` now compares the existing block's archive
to the proposal:

- **Same archive** → genuine duplicate, still skipped (unchanged
behaviour).
- **Different archive** → a different, un-checkpointed block occupies
this number (e.g. a stale fork being pruned during a reorg). Force L1
sync and wait, bounded by the re-execution deadline, for the local prune
to land, then return `undefined` so the proposal is processed from the
proposal already in hand and is available to attest.
- **Prune doesn't land before the deadline** → fall back to the safe
`block_number_already_exists` rejection.

## Tests

Three new deterministic unit cases in `proposal_handler.test.ts`:

- rejects a genuine duplicate (same archive)
- processes a rebuilt proposal once the stale fork is pruned (regression
— fails against the old number-only guard, passes with the fix)
- falls back to `block_number_already_exists` when the stale fork is not
pruned before the deadline

Full file 29/29 passing; format, lint, and full TypeScript build clean.

Closes A-1218.
## Problem

The network default inbox lag was changed from 1 to 2 (on
`merge-train/spartan-v5`). That regressed every test environment that
bridges fee juice and consumes the resulting L1→L2 message within a
single checkpoint — with lag 2 the message isn't consumable yet, so they
fail with `No L1 to L2 message found` (or, for `e2e_l1_publisher`, no
checkpoint ever consumes a real message):

- `cli-wallet/test/flows/*` — create_account_pay_native,
private_transfer, profile, public_authwit_transfer, shield_and_transfer,
sponsored_create_account_and_mint
- `docs/examples/bootstrap.sh execute`
- `guides/up_quick_start.test.ts`
- `e2e_l1_publisher` block-building suite

## Fix

Pin the inbox lag to 1 for these test environments:

- **Sandbox compose suites** (cli-wallet flows, docs examples,
up_quick_start): set `AZTEC_INBOX_LAG: 1` on the `local-network` service
in both compose files —
`yarn-project/end-to-end/scripts/docker-compose.yml` and
`docs/examples/ts/docker-compose.yml`. These run the node via `node
./dest/bin start --local-network`, so they don't go through the TS
`setup()` helper; `AZTEC_INBOX_LAG` (`ethereum/src/config.ts`) feeds
both the L1 deploy and the node config.
- **`e2e_l1_publisher` block-building suite**: `setup({ inboxLag: 1 })`,
since that suite models the single-checkpoint inbox lag by hand.

Together these cover the inboxLag-2 fallout across the affected test
environments.

Note: the `e2e_l1_publisher` change here is the same fix as #24158; this
PR consolidates it with the sandbox compose-env pins. If this lands,
#24158 can be closed (or vice-versa). The unrelated
`e2e_p2p/rediscovery` P2P flake is not addressed here.

## Testing

- Compose files validated (YAML parses; `AZTEC_INBOX_LAG=1` confirmed
under `local-network.environment`).
- `setup({ inboxLag: 1 })` typechecks (`SetupOptions &
Partial<AztecNodeConfig>`).
- **Not yet run locally:** the suites themselves (require the CI docker
image / sandbox). Labeled `ci-full` to exercise the full suite.
@PhilWindle PhilWindle requested a review from a team as a code owner June 17, 2026 17:33
…4164)

The sequencer logs `Not enough txs to build block ... (got 0 txs but
needs 1)` at `warn` level every time a slot lacks enough txs to build a
block. On low-traffic networks this fires constantly and spams the logs.

This downgrades the log from `warn` to `verbose`. The structured
`logCheckpointEvent('block-build-failed', ...)`, the
`block-tx-count-check-failed` event, and the
`recordBlockProposalFailed('insufficient_txs')` metric are unchanged, so
the condition is still observable — just no longer noisy at warn level.

---
*Created by
[claudebox](https://claudebox.work/v2/sessions/381234e80968cc25) ·
group: `slackbot`*

@ludamad ludamad left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Auto-approved

@AztecBot AztecBot added this pull request to the merge queue Jun 17, 2026
@AztecBot

Copy link
Copy Markdown
Collaborator Author

🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants