Skip to content

fix(archiver): prune blocks without proposed checkpoint by end of build slot#23606

Merged
PhilWindle merged 6 commits into
merge-train/spartanfrom
spl/fix-missing-checkpoint-proposal
May 29, 2026
Merged

fix(archiver): prune blocks without proposed checkpoint by end of build slot#23606
PhilWindle merged 6 commits into
merge-train/spartanfrom
spl/fix-missing-checkpoint-proposal

Conversation

@spalladino

@spalladino spalladino commented May 27, 2026

Copy link
Copy Markdown
Contributor

When the previous proposer sent some block proposals but failed to send the corresponding checkpoint proposal, the current proposer would assume there was no proposed checkpoint to build on top of, but would still use the proposed blocks as chain tip. This meant a failed canPropose check against the Rollup contract as soon as it started its slot, since the proposed blocks from the previous proposer meant the proposer had a wrong chain tip.

To fix, the sequencer is now aware that there may be proposed blocks without the corresponding checkpoints, and it can't start building until that's resolved. Also, the archiver now prunes proposed blocks without a checkpoint when the corresponding build slot is over.


Motivation

Under proposer pipelining a node can receive and reexecute the block-only proposals for a checkpoint before (or without ever) receiving the enclosing proposed checkpoint. This leaves the local tip one checkpoint ahead of the checkpointed tip with no proposed checkpoint backing it. A sequencer that then builds the next checkpoint on top of that orphan tip forks the chain off a parent no other node can follow, which was the root cause behind the sentinel CI flake.

Approach

Two complementary defenses. The sequencer's checkSync refuses to proceed when the synced block's checkpoint is ahead of the checkpointed tip and no matching proposed checkpoint exists, holding the line during the window before cleanup. The archiver adds a wall-clock orphan prune that, shortly after a block's build slot ends, removes a block-only tip whose checkpoint was never proposed, restoring liveness even while L1 is quiet.

Changes

  • sequencer-client: checkSync rejects syncing onto a proposed block with no matching proposed-checkpoint tip/data, logging a descriptive warning.
  • archiver: new pruneOrphanProposedBlocks on the L1 synchronizer, run from Archiver.sync() after the inbound queue drains and before L1 sync; prunes after start(blockSlot) + grace using the epoch-cache pipelining offset and emits L2PruneUncheckpointed. The existing L1-sync prune is preserved (shared prune/emit helper).
  • archiver/stdlib/foundation config: new orphanProposedBlockPruneGraceSeconds in ArchiverSpecificConfig, archiver config mappings (ARCHIVER_ORPHAN_PROPOSED_BLOCK_PRUNE_GRACE_SECONDS), mapArchiverConfig, the synchronizer/archiver config types, and a new EnvVar.
  • aztec-node: defaults the grace window from blockDurationMs / 1000 when unset, falling back to MIN_EXECUTION_TIME; the archiver factory also defaults to MIN_EXECUTION_TIME.
  • sequencer-client (tests): orphan tip returns undefined and warns; matching proposed checkpoint proceeds.
  • archiver (tests): no prune before grace; prune + event after grace; no prune when a matching proposed checkpoint exists; queued proposed checkpoint is processed before the prune.

@spalladino spalladino added the ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure label May 27, 2026
@AztecBot

AztecBot commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Flakey Tests

🤖 says: This CI run detected 2 tests that failed, but were tolerated due to a .test_patterns.yml entry.

\033FLAKED\033 (8;;http://ci.aztec-labs.com/b83902e17e7bb944�b83902e17e7bb9448;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_invalidate_block.parallel.test.ts "archiver skips a descendant of an invalid-attestations checkpoint" (226s) (code: 0) group:e2e-p2p-epoch-flakes
\033FLAKED\033 (8;;http://ci.aztec-labs.com/18798bcaff695f1b�18798bcaff695f1b8;;�):  yarn-project/end-to-end/scripts/run_test.sh simple src/e2e_epochs/epochs_invalidate_block.parallel.test.ts "proposer invalidates multiple checkpoints" (490s) (code: 0) group:e2e-p2p-epoch-flakes

Adds a multi-node e2e (`epochs_orphan_block_prune.test.ts`) that exercises
both defenses from #23606 end-to-end: it picks consecutive distinct
proposers P1/P2, makes P1 publish its block but withhold the matching
CheckpointProposal, then asserts that every archiver (a) ingests the
orphan block at slot S1, (b) prunes it via the wall-clock orphan prune,
and (c) lets P2 rebuild block 1 at slot S2 with a checkpoint that lands
on L1.

To enable that scenario in single-block-per-checkpoint mode, adds a new
test-only `skipBroadcastCheckpointProposal` sequencer config. It is a
narrower variant of the existing `skipBroadcastProposals`: when set, the
sequencer skips the CheckpointProposal gossip broadcast but still
broadcasts the held last block standalone, so peers receive every block
yet never see a proposed checkpoint.
`pruneOrphanProposedBlocks` is wall-clock based and does not touch L1, so
it belongs on the `Archiver` rather than the `ArchiverL1Synchronizer`.
Moves the method (and its `epochCache` / `dateProvider` dependencies)
onto the `Archiver`, called directly from `sync()` between
`processInboundQueue()` and `syncFromL1()`. The synchronizer keeps the
L1-block-driven `pruneUncheckpointedBlocks` (used to clear late stale
blocks once L1 advances past their slot); its inline emit is now
duplicated in both prune paths to keep them self-contained.

No behavior change — verified by the existing orphan-prune unit tests in
`archiver-sync.test.ts` and the full archiver suite.
@spalladino spalladino force-pushed the spl/fix-missing-checkpoint-proposal branch from d6d80f9 to 0c5e020 Compare May 28, 2026 15:27
@spalladino spalladino changed the title fix: prevent building on orphan proposed blocks fix(archiver): prune blocks without proposed checkpoint by end of build slot May 28, 2026
// The L1 rollup contract only exposes proposers for epochs whose randao seed is "stable" (i.e. queryable on L1
// right now). When we look too far into the future the contract reverts with `ValidatorSelection__EpochNotStable`.
// We handle this by warping L1 forward one epoch at a time and retrying.
let S1: SlotNumber | undefined;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed for this PR, but I feel like we have variations of this same code in many places.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Inbetween pipelining and inbox I want to allocate some time to e2e refactoring.

@PhilWindle PhilWindle merged commit a612452 into merge-train/spartan May 29, 2026
17 checks passed
@PhilWindle PhilWindle deleted the spl/fix-missing-checkpoint-proposal branch May 29, 2026 08:49
spalladino added a commit that referenced this pull request Jun 2, 2026
…due (#23807)

## Motivation

The orphan-block guard in `checkSync` (added in #23606) was logging at
`warn` on every non-proposer validator, ~once per second for a full
slot, every slot. Under pipelining a node receives and re-executes a
block proposal for the next checkpoint up to one slot before the
matching checkpoint proposal arrives, so the world-state tip
legitimately sits in an as-yet-unproposed checkpoint for that whole
window. That is the happy path, not the abnormal "proposer published
blocks but never the checkpoint" case the guard is meant to flag.
Observed on `next-net`: 118 warnings in ~59s on a healthy validator for
a single slot.

## Approach

The condition that distinguishes "checkpoint hasn't arrived yet" from
"checkpoint will never arrive" is purely temporal — which is exactly
what the archiver already computes in `pruneOrphanProposedBlocks` to
decide when to prune an orphan block. The guard now reuses that same
deadline: it still refuses to build (`return undefined`) whenever the
orphan-shaped state holds, but only escalates to `warn` once the
enclosing checkpoint is overdue by that deadline; within the normal
pipelining window it logs at `debug`. The warn therefore fires at the
same instant the archiver would prune the orphan.

## Changes

- **sequencer-client**: Add `isProposedCheckpointOverdue`, mirroring the
archiver's orphan-prune deadline (`start of slot after the block's build
slot + grace`, grace derived from `blockDurationMs` as the node wiring
does). Gate the existing guard's log level on it — `warn` when overdue,
`debug` otherwise. Control flow is unchanged.
- **sequencer-client (tests)**: Thread a real `blockSlot` through the
orphan-guard test setup and split the warning test into an overdue case
(expects `warn`) and a within-window case (expects no `warn`).
Thunkar pushed a commit that referenced this pull request Jun 3, 2026
## Problem

CI on `merge-train/fairies` failed on the boxes `react chromium` test
([log](http://ci.aztec-labs.com/1780510430908759), [failing
test](http://ci.aztec-labs.com/243e7294cb8ba269)) with a timeout (code
124). The actual error was during `aztec start` / `createLocalNetwork`:

```
Error: Transaction 0x0826… was dropped. Reason: Tx dropped by P2P node
  at NodeEmbeddedWallet.sendTx
  at DeployAccountMethod.send
  at deployFundedSchnorrAccounts
  at createLocalNetwork
  at aztecStart
```

The local network never came up, so the browser test timed out.

## Root cause

PR #23819 ("embedded wallet defaults to proposed") fixed the embedded
wallet so its default wait status is *actually* `PROPOSED` — previously
the default was a no-op that fell through to `waitForTx`'s
`CHECKPOINTED` default.

`PROPOSED` returns as soon as a tx lands in a proposed L2 block. In the
serial sandbox setup that races against block pruning: a
proposed-but-not-checkpointed block can be pruned by end of build slot
(see #23606), and a tx in it is then neither in the archiver nor the
pool, so `getTxReceipt` returns `DroppedTxReceipt("Tx dropped by P2P
node")`. With the old broken default this path waited for `CHECKPOINTED`
and was reliable.

The real source of flakiness is the local network setup, not the boxes.

## Fix

Thread an explicit `{ waitForStatus: TxStatus.CHECKPOINTED }` wait
through the sandbox-setup sends:

- `createLocalNetwork`: `deployFundedSchnorrAccounts`,
`publishStandardAuthRegistry`, `setupBananaFPC`
- `setup-l2-contracts` CLI wait options

The intended product default of `PROPOSED` for normal wallet usage is
unchanged; only the CI/sandbox bring-up that needs durable inclusion
before the next serial tx is pinned to `CHECKPOINTED`. e2e fixtures use
`TestWallet` (BaseWallet's `CHECKPOINTED` default) and are unaffected.

Also reverts the per-box `CHECKPOINTED` waits that #23819 added to the
react/vite/vanilla boxes: they didn't fix the flakiness (the
local-network setup did), so the box sends go back to using the embedded
wallet `PROPOSED` default.

## Verification

TypeScript-only change in `yarn-project` plus box reverts; the box files
now match their pre-#23819 state exactly. A full `./bootstrap.sh ci`
could not be run in this container (clang 18 vs required 20, zig
missing, no remote build cache; the suite is multi-hour). Confirmed by
the merge-train CI re-run of the boxes tests.
danielntmd pushed a commit to danielntmd/aztec-packages that referenced this pull request Jun 4, 2026
BEGIN_COMMIT_OVERRIDE
test(e2e): unskip pipelining related e2e tests (AztecProtocol#23642)
fix(archiver): prune blocks without proposed checkpoint by end of build
slot (AztecProtocol#23606)
test: migrate benchmarks to pipelining setup (AztecProtocol#23647)
fix(p2p): fall back to archiver in BLOCK_TXS response validation
(AztecProtocol#23624)
docs(slashing): align operator and slasher docs with AZIP-7 (AztecProtocol#23494)
fix(p2p): do not penalize peers that signal a missing block with Fr.ZERO
(AztecProtocol#23672)
chore: adjust metrics deployment (AztecProtocol#23676)
fix(cheat-codes): warpL2TimeAtLeastBy advances relative to leading clock
(AztecProtocol#23675)
chore: tighten node pool sizes (AztecProtocol#23678)
chore: remove archival nodes (AztecProtocol#23630)
chore: merge blob sink duties into RPC node (AztecProtocol#23631)
fix: sync avm-transpiler Cargo.lock with noir submodule (AztecProtocol#23683)
fix(spartan): set validator lag env vars in tps-scenario (AztecProtocol#23684)
fix: make world-state hash queries reorg-aware to close getWorldState
race (AztecProtocol#23677)
fix: pin noir submodule to next's version on merge-train/spartan
(AztecProtocol#23690)
fix: ensure image ref is used by bench runner (AztecProtocol#23682)
fix(ci): retry aztec-nr nargo dependency clone on transient network
flake (AztecProtocol#23653)
chore: run one-off jobs on network nodes (AztecProtocol#23701)
fix: simulate proposals inside target slot (AztecProtocol#23692)
chore: smaller eth-devnet (AztecProtocol#23704)
chore: enable testnet autoscaling (AztecProtocol#23705)
feat(api)!: redesign node log retrieval API around tag-based queries
(AztecProtocol#23625)
fix(sequencer): set own proposed checkpoint locally instead of via p2p
loopback (AztecProtocol#23659)
END_COMMIT_OVERRIDE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-fail-fast Sets NO_FAIL_FAST in the CI so the run is not aborted on the first failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants