feat: merge-train/spartan-v5#23975
Merged
Merged
Conversation
…ted block (#23967) ## Motivation On a staging HA validator, an archiver orphan prune triggered a storm of thousands of duplicate `chain-checkpointed` events out of p2p's `L2BlockStream`. The local tips store keeps a block number per cursor and derives the checkpoint number from a `block -> checkpoint` map that is only populated for the last block of each confirmed checkpoint. `handleChainPruned` moved the `checkpointed` and `proposedCheckpoint` cursors to the prune target unconditionally. That target is the new tip of the *proposed* chain and can be an uncheckpointed block with no mapping (in the incident, a block belonging to a not-yet-confirmed checkpoint, sitting ahead of the checkpointed tip). `getCheckpointId` then resolved that cursor to checkpoint zero, the stream computed `nextCheckpointToEmit = 0 + 1 = 1`, and it replayed every checkpoint from 1 up to the source tip. ## Approach A prune is a rollback, so checkpoint-bearing cursors may only move *backward*. `handleChainPruned` now sets `proposed` unconditionally and clamps `checkpointed`/`proposedCheckpoint`/`proven` to the prune target only when they are ahead of it (generalizing the guard `proven` already had). In the incident the checkpointed cursor is left untouched and keeps resolving to its real checkpoint, so there is nothing to replay. Surfacing a missing mapping *loudly* (rather than silently reporting checkpoint zero) is intentionally deferred to a stacked follow-up: doing it safely requires per-tip checkpoint ids so the store can fail loudly on genuine corruption without bricking legitimate skipped-history prunes (which would otherwise throw on the next `getL2Tips`). This PR is the minimal, behavior-preserving fix for the storm. ## Changes - **stdlib**: `handleChainPruned` clamps checkpoint-bearing cursors backward only instead of forcing them onto the (possibly uncheckpointed) prune target. - **stdlib (tests)**: store-level regression that a prune to an uncheckpointed block ahead of the checkpointed tip leaves the cursor intact and resolving to its real checkpoint; stream-level regression asserting no `chain-checkpointed` replay after such a prune. Fixes A-1167
…ration (#23821) ## Motivation The production sequencer kept two legacy escape hatches: `enforceTimeTable=false` (unbounded block building with no deadlines) and `blockDurationMs=undefined` (single-block-per-slot mode). Both existed only to satisfy tests and the sandbox, complicated the timetable with dead branches, and let most e2e tests run under timing that production never uses. ## Approach The timetable now always enforces sub-slot deadlines with a concrete `blockDurationMs` (required config, default 3000 ms). The only non-enforced path left is the `AutomineSequencer`: the local sandbox switches to it, which makes `AnvilTestWatcher` deletable — it was already inert across the e2e suite since every e2e path runs anvil in interval mining. The e2e PIPELINING preset flips to enforced real timing at exactly 2 blocks per slot. ## Fee prediction changes `sequencer-client/src/global_variable_builder/fee_provider.ts` now treats the current L1 fee snapshot as part of the predicted-fee set exposed by the node. `getPredictedMinFees()` returns the current minimum fees first, followed by the future-slot predictions from the fee predictor. This matters for local automine because a freshly mined checkpoint can make the node current minimum fees higher than the predictor's future samples; including the current value prevents clients from quoting below what tx validation will accept. `getCurrentMinFees()` also bypasses viem's cached block number by calling `getBlockNumber({ cacheTime: 0 })`, so automine fee snapshots observe newly mined L1 blocks immediately instead of reusing a stale L1 block number. ## Public simulation global variables `aztec-node/src/aztec-node/server.ts` no longer calls `buildGlobalVariables()` for `simulatePublicCalls`. Instead it computes a simulation target slot from local chain state and calls `buildCheckpointGlobalVariables()` with that slot, then combines those checkpoint globals with the requested simulation block number. The target slot is the max of: - the slot corresponding to the next L1 timestamp from the epoch cache, - the slot after the proposed checkpoint loaded from the block source, - the latest proposed block slot when it is ahead of the proposed checkpoint. This keeps public simulation aligned with the same checkpoint-global construction used for block building, while avoiding a rollup-contract lookup on the simulation path. ## Automine sequencer: proving and recovery The `AutomineSequencer` now drives epoch proving for the sandbox without a prover. `aztecNode.prove(upToCheckpoint?)` synthetically settles each checkpointed epoch — computing its out-hash, writing the outbox root and proven checkpoint to L1 via cheat codes, then calling `markAsProven` — with partial-epoch support (it can settle a prefix up to a requested checkpoint). Because that settlement mines no L1 block, it then mines one empty block so the archiver (which short-circuits its L1 sync while the block hash is unchanged) observes the new proven tip immediately, mirroring a real epoch proof landing an L1 verify tx. An optional auto-settle loop (`AUTOMINE_ENABLE_PROVE_EPOCH`, on by default for the local network) proves epochs as they close, replacing the standalone `EpochTestSettler` that used to race the build loop. On a wrong-slot or failed publish the sequencer returns the failed block's txs to the pool and retries the build rather than reorging L1, using a new `archiver.discardProposedCheckpointsAfter` to drop proposed-but-uncheckpointed blocks during recovery. ## Changes - **stdlib**: `blockDuration` required in the checkpoint timing model, single-block branches removed; `DEFAULT_BLOCK_DURATION_MS = 3000` as single source of truth. - **sequencer-client**: `SequencerTimetable` loses the `enforce` field; `canStartNextBlock` always returns a concrete deadline; config drops `enforceTimeTable`. - **automine sequencer**: `prove(upToCheckpoint?)` synthetically settles epochs (partial-epoch capable) and an optional auto-settle loop (`AUTOMINE_ENABLE_PROVE_EPOCH`) advances the proven tip, mining an empty L1 block so the archiver observes it; failed/wrong-slot publishes return txs to the pool and retry instead of reorging L1 (new `archiver.discardProposedCheckpointsAfter`). - **fees**: node fee predictions now include the current minimum fees as the first entry before future-slot predictions, and current fee snapshots bypass cached L1 block numbers so local automine fee quotes see freshly mined checkpoints. - **p2p**: `blockDurationMs` required in proposal/attestation validators, the pipelining window, and gossipsub topic scoring. - **foundation / aztec-node / validator-client**: `SEQ_ENFORCE_TIME_TABLE` env var removed; dead `blockDurationMs === undefined` branches simplified. - **aztec (sandbox)**: local network runs the `AutomineSequencer` by default, including p2p-enabled local runs; local-network is not a mode for connecting to an existing Aztec network. `AnvilTestWatcher` deleted, and the standalone `EpochTestSettler` is replaced by the AutomineSequencer auto-settle loop. - **end-to-end (tests)**: PIPELINING preset sets `blockDurationMs: 3000` (2 blocks/slot); ~30 `enforceTimeTable` call sites removed; watcher manual-proving call sites replaced with `cheatCodes.rollup.markAsProven()`; bench given explicit slot headroom; block-building regression test fixed for a min-txs remainder livelock that enforced deadlines exposed. - **docs**: sequencer-client and gossipsub READMEs updated to the always-enforced model; sandbox/local-network docs updated to describe automine block production and the removal of `SEQ_ENFORCE_TIME_TABLE` for v5. Breaking: `SEQ_ENFORCE_TIME_TABLE` is removed and `SEQ_BLOCK_DURATION_MS` now defaults to 3000 ms (previously unset, meaning single block per slot). The `SEQ_ENFORCE_TIME_TABLE` wiring in `spartan/` (deploy script, terraform, env files) is removed as well. Fixes A-1148
PhilWindle
approved these changes
Jun 9, 2026
…g test (#23976) ## Problem CI on `merge-train/spartan-v5` (commit 609014a, [log](http://ci.aztec-labs.com/1781038207169152)) failed in the `yarn-project` build at the `yarn tsgo -b --emitDeclarationOnly` step: ``` end-to-end/src/e2e_epochs/epochs_optimistic_proving.parallel.test.ts(222,9): error TS2353: Object literal may only specify known properties, and 'enforceTimeTable' does not exist in type 'EpochsTestOpts'. ``` (also at lines 366, 473, 558, 646, 780) ## Root cause PR #23821 (*always enforce timetable with concrete block duration*) made timetable enforcement unconditional and removed the `enforceTimeTable` option from `EpochsTestOpts`/`SetupOptions`, deleting ~30 `enforceTimeTable: true` call sites. `epochs_optimistic_proving.parallel.test.ts` landed on the v5 line separately and still passed `enforceTimeTable: true` at six sites, so it no longer type-checks. ## Fix - Remove the six now-invalid `enforceTimeTable: true` properties. Each call site already sets a concrete `blockDurationMs: 8000`, so the change is behavior-preserving — the same deletion the PR applied to every other e2e test. Verified in CI: `yarn-project` now type-checks and `epochs_optimistic_proving.parallel.test.ts` passes. - Temporarily `it.skip` the HA test `should distribute work across multiple HA nodes` in `composed/ha/e2e_ha_full.test.ts`, which fails under the always-enforced timetable (sequencer misses slots: `BlockOrCheckpointSlotExpiredError` / `no_blocks_built` / `Fork not found`). Skipped at Santiago's request, to be re-enabled after the HA block-building interaction with #23821 is fixed.
Fixes A-1157. Addresses security advisory GHSA-h4vv-85x5-6hmh.
## Problem
Peer scores decay toward zero (~0.9/minute). A peer whose score crossed
the ban threshold (`MIN_SCORE_BEFORE_BAN = -100`) recovered to a healthy
score within approximately 1 hour.
## Fix
Record a ban when a peer's score drops below the ban threshold and hold
it for a configurable duration (default 24h). Bans are kept **in memory
only** and are cleared on restart — a restarted node re-learns bad peers
from their behaviour rather than carrying stale bans across runs.
- **`PeerScoring`** records `{ score, expiry }` in an in-memory
`bannedPeers` map, so `getScore`/`getScoreState` stay **synchronous**
(required by the peer-manager hot paths, including a `.sort()`
comparator).
- While banned, `getScore` returns the **ban score** regardless of
decay, so a peer cannot recover its way out of the ban early — even
after `decayAllScores` cleans up the decayed live-score entry. Once the
ban expires it is lifted and the live (decayed) score takes over,
letting the peer recover.
- Expired bans are lifted lazily on the next score query
(`getActiveBanScore`) and swept proactively each heartbeat via
`pruneExpiredBans()`, so a banned peer that disconnects and is never
queried again does not linger in the map.
## Configuration
New `P2P_PEER_BAN_DURATION_SECONDS` (config field
`peerBanDurationSeconds`), default `86400` (24h). Registered in
`foundation` env vars and the P2P config mappings.
## Tests
`peer_scoring.test.ts` covers the full lifecycle, asserting both score
**values** and states:
- ban floor held through banned → recovered-live-score → expiry
transitions;
- `peerBanDurationSeconds` drives the window (60s case);
- the advisory regression: after decay cleans up the live-score entry,
`getScore` still returns the `-150` ban score (not `0`), keeping the
peer Banned;
- a peer whose previous ban has expired can be re-banned;
- `pruneExpiredBans` removes expired bans but keeps active ones.
Existing `peer_manager` and `peer_scoring` suites pass; the previously
existing "returns to Healthy after improving score" assertion was
updated to reflect the new intended behaviour (a banned peer stays
banned for the full window).
## Summary - Rename JSON-RPC namespaces from `node_*` / `nodeAdmin_*` / `nodeDebug_*` to `aztec_*`, `aztecAdmin_*`, and `aztecDebug_*` on both client schemas and `aztec start` server registration. - Stop mounting the standalone `p2p_*` namespace; add `getPeers` and `getCheckpointAttestationsForSlot` to `AztecNode` and delegate from the node server. - Update node API reference generation, regenerated operator docs, e2e forward-compatibility config, and a migration note under TBD. Fixes [A-1010](https://linear.app/aztec-labs/issue/A-1010/archiver-flag-silently-ignored-when-combined-with-node-archiver-rpc)
…all clock (#23978) The prepare-for-slot loop in p2p client was **not** synced with the `L2BlockStream` events, meanining the `unprotect` call could trigger before the blocks-added flagged the txs as mined. One solution could've been to add a new event to the blockstream on `slot-synced`, but it's easier to just remove the polling, and unprotect slots when a block proposal that protected the txs fails. As a safeguard, we still call unprotect based on slot numbers on mined blocks. ## Problem The tx pool **protects** txs referenced by an in-flight block proposal: on gossip receipt, the proposal's txs are keyed to its slot and removed from the pending indices, so the local builder cannot re-select them and eviction cannot drop them while the proposal may still land. `prepareForSlot(S)` releases protections from slots before `S`, revalidates the txs, and returns them to pending. Release was driven by a wall-clock slot monitor polling the epoch cache every tick. Three problems: - **Race against mined-marking.** The monitor can fire after a proposal's checkpoint lands on L1 but before the block stream delivers `blocks-added`. The just-landed txs are unprotected into pending, where eviction or nullifier-conflict resolution can delete them; when the block then syncs there is nothing left to mark mined, and after a later reorg `handlePrunedBlocks` has nothing to restore — the tx is lost to the pool. - **Clock dependency.** The epoch cache is wall-clock derived and explicitly depends on system clock sync; unprotection correctness should depend on observed chain state instead. - **Pipelining blind spot.** Gossiped proposals carry future target slots during proposer pipelining, so wall-clock release frees them late. (The old target-slot branch that tried to address this read `proposedCheckpoint` from the local tips store, where it can never lead the checkpointed tip — removed in #23968 as dead code.) ## Change The protection lifecycle becomes fully event-driven and the slot monitor is deleted: - **Protect** on gossip receipt of a block proposal (unchanged). - **Release on local validation failure**: a proposal that fails validation immediately releases the protections it created — only entries still keyed to that proposal's slot, so a tx also referenced by a live proposal at another slot stays protected. - **Resolve via chain events** (unchanged): `blocks-added` marks txs mined, superseding protection; `chain-pruned` un-mines them back to pending. Proposals that landed as proposed-but-unconfirmed checkpoints and are later orphaned are fully handled by this existing lifecycle. - **Collect silent deaths via synced block slots**: `prepareForSlot` now runs inside the `blocks-added` handler with the slot of the last synced block, after mined-marking. Any block landing at slot S releases protections from all earlier slots — covering proposals that never reached L1 at all (no quorum, proposer crash, dropped L1 tx), for which no chain event can ever fire. Because it is ordered after mined-marking in the same handler, the unprotect-before-mined race is impossible by construction. - **Proposers are unaffected**: the sequencer already calls `prepareForSlot(targetSlot)` directly before building, which remains the one legitimate ahead-of-chain preparation. ## Trade-off accepted During a multi-slot stall with no blocks landing anywhere, non-proposer pools retain protections until the first block lands (the wall clock used to release them mid-stall). There is no user-visible cost — there is no chain to include txs in during a stall — and the memory held is bounded and self-healing on the first `blocks-added`. Proposers self-serve via the direct sequencer call. Protections are in-memory only, so a restart clears them. Fixes A-1173
## Motivation #23967 stopped the checkpoint-replay storm at its source (a prune must not advance checkpoint-bearing cursors onto an uncheckpointed block). This PR makes the local tips store never *silently* report checkpoint zero for a real block, removes the machinery and degenerate fields that made the silent path possible, and closes the gaps found in review: a store-upgrade path that would brick p2p nodes, a transient proven-tip overshoot on prune, and world-state fabricating checkpoint ids it never had. Fixes A-1174 ## Commits **1. `fix(p2p): resolve checkpoint tips from stored ids and fail loudly on corruption`** - `chain-proven`/`chain-finalized` carry their `CheckpointId`; the store records a checkpoint id per cursor. `getCheckpointId` resolves genesis → stored id → (back-compat) `block→checkpoint` mapping → **throw**, so a real-block cursor with no resolvable checkpoint fails loudly instead of degrading to a replay. - A `proven`/`finalized` cursor can legitimately lead the locally-checkpointed frontier (batch lag / `startingBlock`); carrying the id lets it resolve without a local mapping, so the throw only fires on genuine corruption (never on a legitimate skipped-history prune). **2. `refactor(p2p): drop the block→checkpoint mapping and checkpoint-object store`** - With per-cursor stored ids, the `block→checkpoint` mapping and the checkpoint-object store are dead weight (they existed only to feed `getCheckpointId`). Both backing maps are removed from the KV and memory stores. - `chain-pruned` now carries `checkpointed: L2TipId` (the source's confirmed checkpointed tip) instead of a bare `CheckpointId`. The prune handler clamps any checkpoint-bearing cursor that leads that tip down to it, always landing on a block with a recorded id — no genesis-clamp, no mapping lookup. `handleChainFinalized` collapses to pruning block hashes below the lowest live tip. **3. `refactor(stdlib): drop proposedCheckpoint from the local L2 tips provider`** - `proposedCheckpoint` is degenerate in the local stores (always equal to `checkpointed`) and no consumer reads it — the one reader, `p2p_client`'s `maybeCallPrepareForSlot`, was a dead branch (always false). `L2TipsProvider.getL2Tips` now returns `LocalL2Tips = Omit<L2Tips, 'proposedCheckpoint'>`; the local stores stop maintaining the cursor; the dead p2p branch is removed. - `L2BlockSource.getL2Tips` is a separate interface and keeps the full `L2Tips` with `proposedCheckpoint`, which the sequencer and node still read from the archiver. **4. `fix(p2p): bump p2p store schema version for the per-tip checkpoint id layout`** - An upgraded p2p store would keep its old tips with an empty per-cursor id map, making `getL2Tips` throw on every read with no way to self-heal — failing `P2PClient.start()` outright. Bumping the store schema version (3 → 4) resets the store on upgrade instead. **5. `fix(p2p): clamp the proven tip to the source proven tip on prune`** - `chain-pruned` carried only the checkpointed tip, so a prune that rolled back the proven chain clamped the local proven cursor onto the (higher) checkpointed tip, transiently reporting unproven blocks as proven until the corrective `chain-proven` event landed at the end of the same sync iteration. The event now also carries `proven: L2TipId` and each cursor clamps to its own source tip. **6. `refactor(world-state): stop fabricating checkpoint ids in the world-state tips provider`** - The stream's local-data-provider contract demanded full checkpoint-bearing tips, forcing world-state to hardcode genesis checkpoint ids and a `checkpointed` tip at block zero (violating `finalized ≤ checkpointed`) that nothing ever read. The provider contract is narrowed to `LocalChainTips` — the tips the stream actually consumes, with `checkpointed` required only when emitting checkpoint events — and the stream fails loudly if checkpoint emission is enabled without one. World-state now reports only the proposed/proven/finalized blocks it genuinely tracks; `L2TipsProvider` keeps the full `LocalL2Tips` shape for the p2p/pxe tips stores. **7. `fix(prover-node): consume the reshaped chain-pruned and checkpoint-bearing events`** - The CheckpointStore redesign landed a prover-node consumer of `chain-pruned` (written against the old event shape) on the merge train while this branch was in flight. The prune handler now reads `event.checkpointed.checkpoint` (same semantics as the old field) and the test event constructors are updated to the new shapes. **8. `refactor(world-state): report unresolvable tip hashes as undefined instead of fabricating them`** - World-state's `getL2Tips` fabricated values for hashes it could not resolve: an empty string for the proven/finalized tips and a non-null assertion for the proposed tip. The honest missing case is real (a proven tip ahead of the synced range has no archive leaf to resolve from), so `LocalChainTips` now carries block ids with an optional hash and world-state returns resolved hashes as-is. Local tips stores are unaffected (`LocalL2Tips` with required hashes remains assignable), and the stream reads only block numbers from local tips. ## Breaking / operational notes - `PXE_DATA_SCHEMA_VERSION` bumped: existing PXE DBs are wiped and resync on first open. - p2p store schema version bumped: existing p2p data dirs (including the tx pool) are wiped and resync on upgrade. - `L2BlockStreamEvent` shape changes (internal API): `chain-proven`/`chain-finalized` gain `checkpoint`; `chain-pruned` carries `checkpointed`/`proven` tips instead of a bare `checkpoint`. ## Deferred `maybeCallPrepareForSlot`'s target-slot preparation (prepare the target slot when a proposed checkpoint exists) never worked, because the local store cannot represent a proposed checkpoint ahead of the checkpointed tip. This is handled in #23978.
…g anvil (#23979) Fixes the flaky HA full suite (`e2e_ha_full`) seen in http://ci.aztec-labs.com/8e1e980c4886df0d, where "should distribute work across multiple HA nodes" timed out awaiting a trigger tx. Also re-enables the suite, which #23976 had skipped. ## Root cause The HA compose suite was the only block-building suite running against an L1 with no self-advancing clock. Its anvil container ran in automine with no `--block-time`, and being external, it was excluded from the `TestDateProvider` sync that locally-spawned anvils get. L1 chain time only moved when something mined, while the shared sequencer clock free-ran. #23821 removed the `AnvilTestWatcher` that used to couple the two clocks in this mode and replaced it with per-iteration nudges in the test (clock warp + blind `mine(8)`). Two consequences, both visible in the failed run's logs: - The `mine(8)` overshoot put L1 ~1.5 slots ahead of the test clock, so each iteration's first propose raced its slot boundary and was silently dropped, followed by a prune that destroyed the pipelined builders' forks (`Fork not found` on all surviving nodes). This race was lost in passing runs too. - Recovery then required the proposers' archiver-sync gate to clear, but the gate's deadline runs on the free-running test clock while nothing mines L1 during the test's `waitForTx` — `Archiver did not sync L1 past slot 109 before slot 110 expired, discarding pipelined work`, repeated until the jest timeout. Whether a run passed or failed came down to seconds of margin on this gate. ## Fix Stop emulating L1 time in the test and run the suite in the same regime as every other block-building e2e (e.g. `e2e_epochs`): - Drop the anvil container and `ETHEREUM_HOSTS` from the HA compose file. With no external L1 configured, `setup()` spawns anvil in-proc with interval mining (`--block-time = ethereumSlotDuration`) and keeps the `TestDateProvider` snapped to L1 block timestamps via the existing stdout listener. The sibling web3signer compose suite already works this way. - Add `automineL1Setup: true` so L1 contract deployment runs under temporary automine before interval mining starts. - Delete all time scaffolding from the test (clock warps, cheat-mining heartbeats, archiver sync nudges). Tests submit a tx and wait, in real time. No assertions change. No production code changes: with a self-advancing L1, the sequencer and publisher behave exactly as on a real network. ## Parallelization The suite file is renamed to `e2e_ha_full.parallel.test.ts`, so CI runs each of its 8 tests as an isolated job in its own compose stack instead of one 15+ minute serial job: - `bootstrap.sh` expands the HA suite per test name (same mechanism as the existing `.parallel` simple tests). - `run_test.sh` forwards the test name into the compose stack and namespaces the docker compose project per test so concurrent jobs on one host don't collide. - `sendTriggerTx` now starts the HA sequencers idempotently, since under per-test isolation the governance/reload/distribute tests run without the first test (previously the only caller of `startHASequencers`). - Three clock-skew test titles contained parentheses, which jest's `--testNamePattern` interprets as regex groups (the filter would silently match nothing); they are retitled. ## Teardown fix (follow-up to the first CI round) The first CI round passed every test body but three jobs (produce-blocks, governance, reload) hung in `afterAll` until the job timeout. Two compounding causes, both fixed here: - `afterAll` reset the shared `TestDateProvider` *before* stopping nodes. The reset rewinds the clock from chain time to wall time — minutes apart after the automine deploy burst — so vote submissions armed against the rewound clock pushed sequencer stops out by that gap. The old 30s abandon-race then gave up, and the abandoned nodes outlived the jest environment, keeping the worker alive until the CI timeout (jest runs without `forceExit`). `afterAll` now stops sequencers first, awaits every node stop fully, and resets the clock last. These three jobs are the ones whose tests end with sequencers still running; the distribute test (which stops nodes in-test, before any reset) passed for the same reason. - Ports #23990 from `merge-train/spartan` (not previously on the v5 line): `CheckpointProposalJob.interrupt()` now propagates to the publisher, cancelling the `sendRequestsAt` slot-deadline sleep on sequencer stop, so a pending vote submission can never block shutdown. The original PR's `e2e_ha_full` teardown changes are superseded by the rework above and were not ported. ## Verification - Three full local runs of the suite via `run_test.sh ha` (all 8 tests each): green in 255s / 254s / 268s of jest time (the old warp-based suite ran 10+ minutes), with zero occurrences of the old failure signatures (`Fork not found`, `Archiver did not sync`, `discarding pipelined work`) — passing runs of the old code showed 12+ `Fork not found` errors even when green. - One per-test CI-style run (`run_test.sh ha <file> "should distribute work across multiple HA nodes"`): the originally flaky test passes standalone in its own compose stack (7 skipped, 1 passed), exercising the full `TEST_NAME` plumbing. - `yarn build`, `yarn format`, `yarn lint` clean; `sequencer-client` unit tests pass (back to the pre-change suite after the revert).
…154) (#23947) ## Motivation A tx sent for real without gas estimation gets fallback gas settings, whose DA gas limit was hardcoded to assume **4 blocks per checkpoint** (`APPROXIMATE_MAX_DA_GAS_PER_BLOCK = MAX_PROCESSABLE_DA_GAS_PER_CHECKPOINT / 4 = 196608`). The sequencer's real per-block DA allocation, however, divides the checkpoint budget by the **timetable-derived** blocks-per-checkpoint, which on v5 mainnet (72s slots, 6s blocks) is **10**, not 4. When the real value exceeds 4, a default-gas tx declares more DA than any single block admits, so the proposer prefilter — which checks the declared limit, not actual usage — skips it forever. This is what stranded account/contract deploys in the pipelining e2e runs. Separately, the largest tx we want to support — a maximal contract class registration (~97k DA gas / ~3k blob fields) — does not fit a block at 10 blocks/checkpoint with the general `perBlockAllocationMultiplier` of 1.2 (per-block DA cap 94,372; blob cap 2,949). ## Approach - The node derives the **most a single tx may declare on the network** and advertises it in `getNodeInfo` as `txsLimits`. This is a *network admission limit*: a function of network-wide inputs only — the timetable-derived blocks-per-checkpoint, the per-checkpoint budgets, and the network-minimum per-block multipliers (1.2 general / 1.5 DA). It is computed by shared stdlib helpers (`computeNetworkTxGasLimits` / `getNetworkTxGasLimits` in `stdlib/src/gas/tx_gas_limits.ts`) and never depends on a node's local block-gas caps or its (possibly higher) configured multipliers, so advertising (`getNodeInfo`), enforcement (RPC tx acceptance, gossip validation, pending-pool admission), and the wallet all compute the same value. Reqresp and block-proposal tx validation are intentionally left on well-formedness-only checks. - The DA admission budget scales with blocks-per-checkpoint via `getDaCheckpointBudgetForTxs`, which subtracts blob-encoding overhead (checkpoint-end marker + first-block and subsequent block-end fields) from the raw blob capacity. At v5 mainnet geometry (10 blocks per checkpoint) this yields **117,668 DA gas** as the per-tx admission limit. The builder uses the same basis, so a tx admitted by the DA limit always fits the first block's blob-field cap. - The wallet reads `txsLimits` **once** (cached for the wallet's lifetime) and uses it internally: as the fallback gas limits when the caller declares none, to clamp the limits it derives from its own pre-send simulation, and to validate caller-declared limits — a declared limit above the admission limit fails fast in the wallet with a descriptive error, mirroring the node's inbound validation. The limits are **not** exposed through the wallet API: apps that really need them ask the node (`getNodeInfo().txsLimits`). `txsLimits` is now a **required** field on `NodeInfo`; clients built against this version cannot talk to pre-field nodes. - Gas-limit padding is removed from aztec.js: the `estimateGas` / `estimatedGasPadding` simulate fee options and the `estimatedGas` result field are gone. `simulate({ includeMetadata: true })` exposes the raw `gasUsed` instead, and apps that want explicit limits pad it themselves. Wallets that simulate before send (embedded, CLI) keep their own internal padding defaults. - A new DA-specific per-block multiplier (default 1.5), applied to DA gas and blob fields only, lets the largest contract class deploy fit a single block at 10 blocks/checkpoint while leaving the general L2 multiplier untouched. Checkpoint-level capping still bounds the tail. - `GasSettings.fallback` now requires explicit `gasLimits` (wallets pass the node-advertised limit); the teardown limit is derived from the effective total so teardown DA can never exceed total DA. - The sequencer fails startup (and runtime config updates) only when its per-block allocation *multipliers* are below the network minimums — such a node would admit txs over RPC/gossip it can never pack. Merely restrictive absolute caps (`maxL2BlockGas` / `maxDABlockGas`) are a supported operator knob (the node just builds smaller blocks and leaves larger txs in the pool for other proposers), so they only log a warning. ## API changes - `NodeInfo.txsLimits` is now a **required** field (breaking): `{ gas: { daGas, l2Gas } }` — the most a single tx may declare on the network. Clients that connect to older nodes missing this field will fail. - Wallets validate caller-declared `gasLimits` against the network per-tx admission limit and throw before sending (e.g. `Declared DA gas limit (X) exceeds the maximum this network allows per tx (Y)`). When no limits are declared, the wallet fills in the admission limit. No new wallet API is introduced. - Removed the `estimateGas` / `estimatedGasPadding` simulate fee options and the `estimatedGas` simulation result field (breaking): `simulate({ includeMetadata: true })` now returns the raw `gasUsed` (`totalGas` / `teardownGas`) and apps derive their own limits from it. - `getGasLimits` is no longer exported from `@aztec/aztec.js` (breaking): it moved to `@aztec/wallet-sdk/base-wallet` with signature `(gasUsed, maxTxGasLimits, pad?)`, clamping padded estimates to the admission limit and throwing early if simulated usage exceeds it. A companion `assertGasLimitsWithinNetworkLimits` implements the wallet-side validation. - `GasSettings.fallback` now requires explicit `gasLimits` — read them from the node's `txsLimits.gas` if constructing settings manually. - New sequencer config `perBlockDAAllocationMultiplier` (env `SEQ_PER_BLOCK_DA_ALLOCATION_MULTIPLIER`, default 1.5). - Removed from `@aztec/stdlib`: `getDefaultNetworkTxGasLimits`, `getDefaultMaxBlocksPerCheckpoint`, `DEFAULT_MAINNET_AZTEC_SLOT_DURATION`, `DEFAULT_MAINNET_ETHEREUM_SLOT_DURATION`, `DEFAULT_MAINNET_BLOCK_DURATION_MS`, `APPROXIMATE_MAX_DA_GAS_PER_BLOCK`, `FALLBACK_TEARDOWN_L2_GAS_LIMIT`, `FALLBACK_TEARDOWN_DA_GAS_LIMIT`. Renamed `DEFAULT_PER_BLOCK_ALLOCATION_MULTIPLIER` → `MIN_PER_BLOCK_ALLOCATION_MULTIPLIER` and `DEFAULT_PER_BLOCK_DA_ALLOCATION_MULTIPLIER` → `MIN_PER_BLOCK_DA_ALLOCATION_MULTIPLIER`. ## Changes - **stdlib**: shared `buildProposerTimetable` (timetable); `computeNetworkTxGasLimits` / `getNetworkTxGasLimits` / `getDaCheckpointBudgetForTxs` + per-block multiplier constants (gas); `NodeInfo.txsLimits` is now required; `GasSettings.fallback` requires explicit `gasLimits` and derives teardown from the effective total; `perBlockDAAllocationMultiplier` added to `SequencerConfig` and `BlockBuilderOptions`. - **constants**: re-exports `MAX_PROCESSABLE_DA_GAS_PER_CHECKPOINT` with a JSDoc documenting that it is the raw, unattainable blob capacity and that tx-data consumers must subtract the encoding overhead. - **aztec-node**: `getNodeInfo` populates `txsLimits` from `getNetworkTxGasLimits`; the RPC tx-acceptance validator enforces the same network limit it advertises. - **p2p**: gossip and pending-pool tx validators enforce the network admission limit (`maxTxL2Gas` / `maxTxDAGas`) instead of the node's local block-gas caps; uses the shared `buildProposerTimetable`. - **validator-client**: `checkpoint_builder` applies the DA multiplier to DA gas and blob fields. - **sequencer-client**: config default + env mapping; threads the multiplier through the checkpoint proposal job; startup/runtime guard throws only on sub-minimum allocation multipliers and warns on restrictive absolute block-gas caps. - **wallet-sdk**: `BaseWallet` caches node info for its lifetime, fills in missing gas limits from `txsLimits.gas`, and validates caller-declared limits via `assertGasLimitsWithinNetworkLimits`; `getGasLimits` lives here now. - **aztec.js**: `Wallet` interface and method schemas lose the estimation surface; `ContractFunctionInteraction`, `BatchCall`, and `DeployMethod` return raw `gasUsed` in simulate metadata instead of padded estimates. - **wallets / cli-wallet / bot**: the embedded wallet keeps its internal pre-send estimation (padding default 0.1) now clamped to the admission limit; the cli-wallet fee path reads `txsLimits` from the node and derives estimate-only output via a `CLIWallet.estimateGasLimits` helper; the bot sends without explicit limits and lets the wallet derive them. - **foundation**: registers the new env var. - **tests**: `tx_gas_limits` and `gas_settings` unit tests (incl. the teardown invariant and the multiplier/cap distinction); a `checkpoint_builder` red/green test that the largest contract class deploy fits at 1.5 but not at 1.2; wallet-side validation tests in `base_wallet` and `embedded_wallet`. - **docs**: migration notes for the breaking `txsLimits` field, the removed estimation options, `getGasLimits` relocation, `GasSettings.fallback`, removed constants, and the new operator env var. Builds on #23933 (A-1162), which introduced `MAX_TX_DA_GAS`. Fixes A-1154
…ection (A-1168) (#23977) ## Summary Identifies and enforces the configuration values that must be identical across all nodes of a network (A-1168), sourcing per-network values from the generated network config. Prevents operators from overriding them unless a new `ALLOW_OVERRIDING_NETWORK_CONFIG` flag is set. Also: - Adds a validation step in CI for the generated network configs. - Fixes a parse error for per-block allocation multipliers. - Enshrines max blocks per checkpoint as a consensus-wide config entry, checking it is sound wrt block times. ## Network-wide consensus values `stdlib/src/config/network-consensus-config.ts` defines `NETWORK_CONSENSUS_ENV_VARS`, the env vars required to be the same for every node of a network, in three categories: - **Timing/protocol consensus**: `ETHEREUM_SLOT_DURATION`, `AZTEC_SLOT_DURATION`, `AZTEC_EPOCH_DURATION`, `SEQ_BLOCK_DURATION_MS`, `MAX_BLOCKS_PER_CHECKPOINT`, `CHECKPOINT_PROPOSAL_SYNC_GRACE_SECONDS`. - **Network identity / L1-posted deployment params**: chain id, committee size, lags, staking thresholds, mana target, proving cost, governance/slashing contract params, slash amounts. - **Node-side slashing offense params** (`SLASH_*`): validators must agree on these to reach slashing quorum. Per-network values live in `spartan/environments/network-defaults.yml` (the source of `cli/src/config/generated/networks.ts`). This PR adds the two missing ones — `MAX_BLOCKS_PER_CHECKPOINT: 10` and `CHECKPOINT_PROPOSAL_SYNC_GRACE_SECONDS: 12` — to the shared prodlike section, and makes devnet's `AZTEC_SLASHING_QUORUM: 17` / `AZTEC_GOVERNANCE_PROPOSER_QUORUM: 151` explicit (values match the Solidity `vm.envOr` defaults, so deployment behavior is unchanged). `maxBlocksPerCheckpoint` is an explicit network value rather than derived per node, so nodes with different operational budgets cannot diverge on checkpoint geometry; mainnet/testnet/devnet geometry (72s slots, 12s L1 slots, 6s blocks) derives exactly 10. `NetworkConsensusConfig` is composed by `Pick`ing fields from `L1ContractsConfig` and `SequencerConfig`, and `getConsensusConfigFromNetworkEnv` derives env names and parsing from the canonical config mappings, so each field is parsed exactly as the node's config layer would parse it. ## Enforcement layers - **Compile time**: `chain_l2_config.ts` asserts (via `satisfies`) that every generated network config defines every consensus-critical var; a `@ts-expect-error` compile gate in the test file proves the assertion actually rejects configs missing a var. - **CI**: `cli/src/config/chain_l2_config.test.ts` validates each generated network config with `validateNetworkConsensusConfig`, which requires `MAX_BLOCKS_PER_CHECKPOINT` to be exactly what a `ProposerTimetable` at the production default budgets derives, plus basic geometry soundness (slot multiples, sub-slot fits in slot, etc.). - **Startup (cli path)**: `enrichEnvironmentWithChainName` calls the pure `checkConsensusEnvOverrides` before enriching: a consensus var already set in the env to a value diverging from the network config makes startup throw, unless `ALLOW_OVERRIDING_NETWORK_CONFIG=1` is set (then it warns and keeps the operator value). The check returns canonical rewrites for numerically-equal-but-noncanonical values (e.g. `6e3`, which `parseInt`-based config parsing would read as 6), which the cli enrichment layer applies to the env. - **Startup (node)**: `AztecNodeService` verifies the rollup contract reports the same `aztecSlotDuration`/`aztecEpochDuration` the node is configured with, and throws on mismatch. These are the only L1-timing fields the node config carries that the rollup exposes; the other rollup params (committee size, lags, proof submission epochs, mana limit) are read from L1 directly rather than from config. ## Where maxBlocksPerCheckpoint applies - **Proposer**: the sequencer's `ProposerTimetable` computes the locally achievable count from operational budgets and clamps it down to the network value when the network value is lower; sub-slot selection never starts a block past the effective count. When local budgets compute more than the network allows, the timetable warns through an injected logger. - **Gossip validation**: `proposal_validator.ts` rejects (and penalizes peers for) block proposals with `indexWithinCheckpoint >= min(maxBlocksPerCheckpoint, MAX_ATTESTABLE_BLOCKS_PER_CHECKPOINT)`. - **Attestation**: `proposal_handler.ts` refuses to attest to checkpoint proposals with more blocks than the configured value; `checkpoint_builder.ts` caps the blocks it assembles. - **Gossipsub scoring**: peer-rate thresholds are sized from the network config value directly; the gossip layer now uses a plain `ConsensusTimetable` and no longer depends on proposer operational budgets (which were also dropped from `P2PConfig`). ## Also in this PR - `MIN_PER_BLOCK_ALLOCATION_MULTIPLIER = 1.2` / `MIN_PER_BLOCK_DA_ALLOCATION_MULTIPLIER = 1.5` live in `@aztec/constants`; the sequencer rejects multipliers below the minimum, and `SEQ_PER_BLOCK_ALLOCATION_MULTIPLIER` switched from `numberConfigHelper` (parseInt truncated `1.5` to `1`) to `floatConfigHelper` (mirrors #23947; deliberate copy, conflicts to be resolved when either lands). - Removed the redundant `checkpointProposalSyncGraceSeconds` defaulting in node `createAndSync`; every consumer (archiver factory, sequencer, p2p timetable) has its own fallback. ## Spartan deployments Existing spartan networks that intentionally diverge from the generated defaults keep deploying: `devnet.env` (36s slots, committee size 1) and `testnet.env` (slashing round size 2 epochs) now set `ALLOW_OVERRIDING_NETWORK_CONFIG=true`, plumbed through `deploy_network.sh` into both the `deploy-rollup-contracts` job env and every aztec-image helm release (new `global.allowOverridingNetworkConfig` rendered by the shared aztec-node pod template). `AZTEC_SLOT_DURATION`/`AZTEC_EPOCH_DURATION` are also passed through to node pods so devnet nodes carry the real deployed 36s value and pass the rollup cross-check instead of inheriting the generated 72s default. mainnet/staging/next-net set no conflicting consensus vars and are untouched, so enforcement stays loud by default. ## Known limitations - `MIN_PER_BLOCK_DA_ALLOCATION_MULTIPLIER` documents the network minimum only; its operator knob and runtime enforcement land with #23947. - The remote `network_config.json` enrichment runs before `enrichEnvironmentWithChainName`, so a consensus value pushed via the networks repo that diverges from the binary's generated defaults will also be refused at startup (the live JSON sets no consensus values today). - `ethereumSlotDuration` cannot be cross-checked against the rollup contract (no getter); it is enforced via env only on named networks.
…idates multiple checkpoints` (#24017) Fixes a flake in `proposer invalidates multiple checkpoints` (`e2e_epochs/epochs_invalidate_block.parallel.test.ts`) reported on `v5-next`: [failed run](http://ci.aztec-labs.com/e4076dd86c434c6f). Replaces #24016 (was based on `merge-train/spartan`; this one targets the v5 line where the flake fired and restructures the test instead of just resizing the timeout). ## Root cause of the flake `TimeoutError: Operation timed out after 256000ms` — the bare 8-slot `timeoutPromise` waiting for the two bad checkpoints. The bad-slot search from #23608 rejects any candidate pair whose proposer also owns an earlier un-snapshotted pipelined slot, and the rejection window grows with each attempt. In the failed run the current slot was 21 and the search rejected (24,25)…(29,30) before accepting slots **30/31** — 9–10 slots out. The fixed 256s wait expired at 22:48:55, before slot 30 even began (~22:49:00), while the chain healthily mined checkpoints at slots 22–28 underneath; the run was unwinnable at selection time. The race's `.then(() => [CheckpointNumber(0), …])` fallback was also dead code, since `timeoutPromise` rejects. ## Fix: search first, then warp Instead of starting the sequencers and waiting in real time for whatever slots the search lands on: - With sequencers stopped, search for a `warpSlot` such that the proposers of the three lead-in slots `warpSlot+1..warpSlot+3` are not the proposers of the bad slots `warpSlot+4`/`warpSlot+5`. A far-away candidate now costs a warp instead of a real-time wait, and `EpochNotStable` during the search is handled by warping forward one epoch (same pattern as the `archiver skips a descendant` test in this file). - Warp to one L1 block before `warpSlot`, so sequencers get a full L2 slot to boot before the first pipelined build window we rely on (end of `warpSlot`, targeting `warpSlot+1`). - Start the sequencers and wait for the first good checkpoint (lands at `warpSlot`, or up to `warpSlot+2` on a slow start). - Apply the malicious config to the bad-slot proposers. The three good lead-in slots guarantee no pipelined job before `badSlot1` can snapshot it, since jobs snapshot config during the last L1 slot of the previous L2 slot. - Fail fast with a clear assertion if config application was somehow late enough to reach `badSlot1`'s build window, rather than timing out opaquely. - The 8-slot wait for the bad checkpoints is now correctly sized by construction (`badSlot2` is at most ~6 slots from the wait start), and gets a descriptive timeout message. Worst case the wait phase is bounded at ~6 slots regardless of how many candidates the search rejects, where previously each rejected candidate pushed the bad checkpoints one slot further past the fixed timeout. --- *Created by [claudebox](https://claudebox.work/v2/sessions/d509a218614bf4ac) · group: `slackbot`*
Collaborator
Author
|
🤖 Auto-merge enabled after 4 hours of inactivity. This PR will be merged automatically once all checks pass. |
## Motivation
The bot waits for its Fee Juice bridge claim with
`waitForL1ToL2MessageReady` and then immediately simulates the account
deployment that consumes the claim. Readiness was always evaluated
against the `latest` block, but the bot's embedded PXE can be configured
to sync to a slower tip (e.g. `syncChainTip=checkpointed`). When the
tips diverge, readiness passes while the PXE simulation anchors to an
older block whose message tree does not contain the message yet, and
simulation fails with `No L1 to L2 message found for message hash ...`,
sending the bot into a crash loop where it repeatedly validates a claim
it cannot consume.
## Approach
Make readiness answer the question the consumer actually needs: is the
message present at the same chain tip the consuming PXE will anchor its
simulation to?
- `isL1ToL2MessageReady` / `waitForL1ToL2MessageReady` accept an
optional chain tip (`BlockTag`), defaulting to `latest` so existing
callers are unaffected. The helper compares the message checkpoint
against the block at the requested tip.
- The bot does not get a new config knob and no wallet APIs change:
`addBot` extracts `syncChainTip` from the same PXE options its callers
use to build the embedded wallet, and threads it through `BotRunner` →
bot `create` → `BotFactory`. This keeps the readiness tip from drifting
from the PXE's actual config. Polling the node at the PXE's configured
tip (rather than exposing the PXE anchor) is required for the wait to
make progress, since the PXE synchronizer is pull-on-demand and its
anchor only advances on `pxe.sync()`.
- All bot readiness checks now pass the tip: the stored-claim
revalidation and the new-claim wait in `BotFactory`, the cross-chain
setup wait, and the steady-state message selection in `CrossChainBot`.
## API changes
`isL1ToL2MessageReady(node, msgHash, chainTip?)` and
`waitForL1ToL2MessageReady(node, msgHash, { timeoutSeconds, chainTip?
})` in `@aztec/aztec.js/messaging` accept an optional `BlockTag`
(default `'latest'`, preserving previous behavior). Their node
dependency narrowed from `getBlock` to the cheaper `getBlockData`.
## Changes
- **aztec.js**: tip-aware readiness helpers in `utils/cross_chain.ts`;
new unit tests covering the latest fallback and the tip-aware path.
- **bot**: `BotRunner`, `Bot`/`AmmBot`/`CrossChainBot.create`, and
`BotFactory` accept the PXE sync tip and use it at every L1-to-L2
readiness check.
- **aztec**: `addBot` extracts `syncChainTip` from the PXE options and
passes it to `BotRunner`.
Fixes A-1155
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BEGIN_COMMIT_OVERRIDE
fix(p2p): stop checkpoint-replay storm when pruning to an uncheckpointed block (#23967)
refactor(sequencer)!: always enforce timetable with concrete block duration (#23821)
fix(e2e): drop removed enforceTimeTable option from optimistic proving test (#23976)
feat: persist peer bans for a configurable duration (A-1157) (#23922)
refactor!: rename node JSON-RPC to aztec_* prefixes (#23909)
fix(p2p): drive tx protection release from synced blocks instead of wall clock (#23978)
fix(p2p)!: resolve checkpoint tips from stored ids (#23968)
fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil (#23979)
fix(gas)!: client fallback limits track network per-block budget (A-1154) (#23947)
feat: network-wide consensus config with validation and override protection (A-1168) (#23977)
test(e2e): pick bad slots upfront and warp to them in
proposer invalidates multiple checkpoints(#24017)fix(bot): check L1-to-L2 message readiness against PXE sync tip (#24004)
fix: Merge Conflicts (#24014)
END_COMMIT_OVERRIDE