fix(archiver): prune blocks without proposed checkpoint by end of build slot#23606
Merged
PhilWindle merged 6 commits intoMay 29, 2026
Merged
Conversation
Collaborator
Flakey Tests🤖 says: This CI run detected 2 tests that failed, but were tolerated due to a .test_patterns.yml entry. |
Adds a multi-node e2e (`epochs_orphan_block_prune.test.ts`) that exercises both defenses from #23606 end-to-end: it picks consecutive distinct proposers P1/P2, makes P1 publish its block but withhold the matching CheckpointProposal, then asserts that every archiver (a) ingests the orphan block at slot S1, (b) prunes it via the wall-clock orphan prune, and (c) lets P2 rebuild block 1 at slot S2 with a checkpoint that lands on L1. To enable that scenario in single-block-per-checkpoint mode, adds a new test-only `skipBroadcastCheckpointProposal` sequencer config. It is a narrower variant of the existing `skipBroadcastProposals`: when set, the sequencer skips the CheckpointProposal gossip broadcast but still broadcasts the held last block standalone, so peers receive every block yet never see a proposed checkpoint.
`pruneOrphanProposedBlocks` is wall-clock based and does not touch L1, so it belongs on the `Archiver` rather than the `ArchiverL1Synchronizer`. Moves the method (and its `epochCache` / `dateProvider` dependencies) onto the `Archiver`, called directly from `sync()` between `processInboundQueue()` and `syncFromL1()`. The synchronizer keeps the L1-block-driven `pruneUncheckpointedBlocks` (used to clear late stale blocks once L1 advances past their slot); its inline emit is now duplicated in both prune paths to keep them self-contained. No behavior change — verified by the existing orphan-prune unit tests in `archiver-sync.test.ts` and the full archiver suite.
d6d80f9 to
0c5e020
Compare
PhilWindle
reviewed
May 29, 2026
| // The L1 rollup contract only exposes proposers for epochs whose randao seed is "stable" (i.e. queryable on L1 | ||
| // right now). When we look too far into the future the contract reverts with `ValidatorSelection__EpochNotStable`. | ||
| // We handle this by warping L1 forward one epoch at a time and retrying. | ||
| let S1: SlotNumber | undefined; |
Collaborator
There was a problem hiding this comment.
Not needed for this PR, but I feel like we have variations of this same code in many places.
Contributor
Author
There was a problem hiding this comment.
Agree. Inbetween pipelining and inbox I want to allocate some time to e2e refactoring.
PhilWindle
approved these changes
May 29, 2026
spalladino
added a commit
that referenced
this pull request
Jun 2, 2026
…due (#23807) ## Motivation The orphan-block guard in `checkSync` (added in #23606) was logging at `warn` on every non-proposer validator, ~once per second for a full slot, every slot. Under pipelining a node receives and re-executes a block proposal for the next checkpoint up to one slot before the matching checkpoint proposal arrives, so the world-state tip legitimately sits in an as-yet-unproposed checkpoint for that whole window. That is the happy path, not the abnormal "proposer published blocks but never the checkpoint" case the guard is meant to flag. Observed on `next-net`: 118 warnings in ~59s on a healthy validator for a single slot. ## Approach The condition that distinguishes "checkpoint hasn't arrived yet" from "checkpoint will never arrive" is purely temporal — which is exactly what the archiver already computes in `pruneOrphanProposedBlocks` to decide when to prune an orphan block. The guard now reuses that same deadline: it still refuses to build (`return undefined`) whenever the orphan-shaped state holds, but only escalates to `warn` once the enclosing checkpoint is overdue by that deadline; within the normal pipelining window it logs at `debug`. The warn therefore fires at the same instant the archiver would prune the orphan. ## Changes - **sequencer-client**: Add `isProposedCheckpointOverdue`, mirroring the archiver's orphan-prune deadline (`start of slot after the block's build slot + grace`, grace derived from `blockDurationMs` as the node wiring does). Gate the existing guard's log level on it — `warn` when overdue, `debug` otherwise. Control flow is unchanged. - **sequencer-client (tests)**: Thread a real `blockSlot` through the orphan-guard test setup and split the warning test into an overdue case (expects `warn`) and a within-window case (expects no `warn`).
Thunkar
pushed a commit
that referenced
this pull request
Jun 3, 2026
## Problem CI on `merge-train/fairies` failed on the boxes `react chromium` test ([log](http://ci.aztec-labs.com/1780510430908759), [failing test](http://ci.aztec-labs.com/243e7294cb8ba269)) with a timeout (code 124). The actual error was during `aztec start` / `createLocalNetwork`: ``` Error: Transaction 0x0826… was dropped. Reason: Tx dropped by P2P node at NodeEmbeddedWallet.sendTx at DeployAccountMethod.send at deployFundedSchnorrAccounts at createLocalNetwork at aztecStart ``` The local network never came up, so the browser test timed out. ## Root cause PR #23819 ("embedded wallet defaults to proposed") fixed the embedded wallet so its default wait status is *actually* `PROPOSED` — previously the default was a no-op that fell through to `waitForTx`'s `CHECKPOINTED` default. `PROPOSED` returns as soon as a tx lands in a proposed L2 block. In the serial sandbox setup that races against block pruning: a proposed-but-not-checkpointed block can be pruned by end of build slot (see #23606), and a tx in it is then neither in the archiver nor the pool, so `getTxReceipt` returns `DroppedTxReceipt("Tx dropped by P2P node")`. With the old broken default this path waited for `CHECKPOINTED` and was reliable. The real source of flakiness is the local network setup, not the boxes. ## Fix Thread an explicit `{ waitForStatus: TxStatus.CHECKPOINTED }` wait through the sandbox-setup sends: - `createLocalNetwork`: `deployFundedSchnorrAccounts`, `publishStandardAuthRegistry`, `setupBananaFPC` - `setup-l2-contracts` CLI wait options The intended product default of `PROPOSED` for normal wallet usage is unchanged; only the CI/sandbox bring-up that needs durable inclusion before the next serial tx is pinned to `CHECKPOINTED`. e2e fixtures use `TestWallet` (BaseWallet's `CHECKPOINTED` default) and are unaffected. Also reverts the per-box `CHECKPOINTED` waits that #23819 added to the react/vite/vanilla boxes: they didn't fix the flakiness (the local-network setup did), so the box sends go back to using the embedded wallet `PROPOSED` default. ## Verification TypeScript-only change in `yarn-project` plus box reverts; the box files now match their pre-#23819 state exactly. A full `./bootstrap.sh ci` could not be run in this container (clang 18 vs required 20, zig missing, no remote build cache; the suite is multi-hour). Confirmed by the merge-train CI re-run of the boxes tests.
danielntmd
pushed a commit
to danielntmd/aztec-packages
that referenced
this pull request
Jun 4, 2026
BEGIN_COMMIT_OVERRIDE test(e2e): unskip pipelining related e2e tests (AztecProtocol#23642) fix(archiver): prune blocks without proposed checkpoint by end of build slot (AztecProtocol#23606) test: migrate benchmarks to pipelining setup (AztecProtocol#23647) fix(p2p): fall back to archiver in BLOCK_TXS response validation (AztecProtocol#23624) docs(slashing): align operator and slasher docs with AZIP-7 (AztecProtocol#23494) fix(p2p): do not penalize peers that signal a missing block with Fr.ZERO (AztecProtocol#23672) chore: adjust metrics deployment (AztecProtocol#23676) fix(cheat-codes): warpL2TimeAtLeastBy advances relative to leading clock (AztecProtocol#23675) chore: tighten node pool sizes (AztecProtocol#23678) chore: remove archival nodes (AztecProtocol#23630) chore: merge blob sink duties into RPC node (AztecProtocol#23631) fix: sync avm-transpiler Cargo.lock with noir submodule (AztecProtocol#23683) fix(spartan): set validator lag env vars in tps-scenario (AztecProtocol#23684) fix: make world-state hash queries reorg-aware to close getWorldState race (AztecProtocol#23677) fix: pin noir submodule to next's version on merge-train/spartan (AztecProtocol#23690) fix: ensure image ref is used by bench runner (AztecProtocol#23682) fix(ci): retry aztec-nr nargo dependency clone on transient network flake (AztecProtocol#23653) chore: run one-off jobs on network nodes (AztecProtocol#23701) fix: simulate proposals inside target slot (AztecProtocol#23692) chore: smaller eth-devnet (AztecProtocol#23704) chore: enable testnet autoscaling (AztecProtocol#23705) feat(api)!: redesign node log retrieval API around tag-based queries (AztecProtocol#23625) fix(sequencer): set own proposed checkpoint locally instead of via p2p loopback (AztecProtocol#23659) END_COMMIT_OVERRIDE
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When the previous proposer sent some block proposals but failed to send the corresponding checkpoint proposal, the current proposer would assume there was no proposed checkpoint to build on top of, but would still use the proposed blocks as chain tip. This meant a failed
canProposecheck against the Rollup contract as soon as it started its slot, since the proposed blocks from the previous proposer meant the proposer had a wrong chain tip.To fix, the sequencer is now aware that there may be proposed blocks without the corresponding checkpoints, and it can't start building until that's resolved. Also, the archiver now prunes proposed blocks without a checkpoint when the corresponding build slot is over.
Motivation
Under proposer pipelining a node can receive and reexecute the block-only proposals for a checkpoint before (or without ever) receiving the enclosing proposed checkpoint. This leaves the local tip one checkpoint ahead of the checkpointed tip with no proposed checkpoint backing it. A sequencer that then builds the next checkpoint on top of that orphan tip forks the chain off a parent no other node can follow, which was the root cause behind the sentinel CI flake.
Approach
Two complementary defenses. The sequencer's
checkSyncrefuses to proceed when the synced block's checkpoint is ahead of the checkpointed tip and no matching proposed checkpoint exists, holding the line during the window before cleanup. The archiver adds a wall-clock orphan prune that, shortly after a block's build slot ends, removes a block-only tip whose checkpoint was never proposed, restoring liveness even while L1 is quiet.Changes
checkSyncrejects syncing onto a proposed block with no matching proposed-checkpoint tip/data, logging a descriptive warning.pruneOrphanProposedBlockson the L1 synchronizer, run fromArchiver.sync()after the inbound queue drains and before L1 sync; prunes afterstart(blockSlot) + graceusing the epoch-cache pipelining offset and emitsL2PruneUncheckpointed. The existing L1-sync prune is preserved (shared prune/emit helper).orphanProposedBlockPruneGraceSecondsinArchiverSpecificConfig, archiver config mappings (ARCHIVER_ORPHAN_PROPOSED_BLOCK_PRUNE_GRACE_SECONDS),mapArchiverConfig, the synchronizer/archiver config types, and a newEnvVar.blockDurationMs / 1000when unset, falling back toMIN_EXECUTION_TIME; the archiver factory also defaults toMIN_EXECUTION_TIME.undefinedand warns; matching proposed checkpoint proceeds.