test(e2e): fix data_withholding_slash flake by freezing L1 across restart (#23162)

spalladino · web-flow · commit 30640dfecd6b · 2026-05-11T17:12:38.000-03:00
## Motivation

`e2e_p2p_data_withholding_slash` was flaky because L1 raced past the
epoch-8 prune deadline (`aztecProofSubmissionEpochs=0` makes the
deadline ~32s after slot 17) while we stopped, wiped, and recreated the
4 validators (~28s). The recreated archivers detected the prune during
their initial L1 sync and emitted `L2PruneUnproven` for epoch 8 with the
original tx-carrying block, but `EpochPruneWatcher.start()` is only
invoked inside `void archiver.waitForInitialSync().then(...)` in
`aztec-node/server.ts`, so the listener wasn't attached yet and the
event dropped silently. The recreated validators then built an empty
epoch 10 on top of genesis which pruned cleanly later, producing 4
`VALID_EPOCH_PRUNED` offenses instead of the expected 4
`DATA_WITHHOLDING`.

## Approach

Pause anvil block production between `removeInitialNode` and `stopNodes`
so L1 stays inside epoch 8 across the recreate gap. The recreated
archivers then ingest checkpoint 1 cleanly during initial sync (no prune
fires, nothing to miss), `EpochPruneWatcher.start()` attaches its
listener, and we resume L1 with an explicit warp + mine + interval
restart so the deadline crossing is deterministic — the prune now fires
while the watcher is live, producing `DATA_WITHHOLDING` for epoch 8 as
the test expects. A `getCurrentEpoch &lt; 9` assertion right after pausing
fails fast if the timing window ever tightens further.

## Changes

- **end-to-end (tests)**: in `data_withholding_slash.test.ts`, pause L1
mining after `removeInitialNode` and before `stopNodes`; resume after
`waitForP2PMeshConnectivity` by warping to current wall-clock time,
mining one L1 block, and restoring interval mining. Add a fail-fast
assertion that we are still in epoch 8 when we pause.
diff --git a/yarn-project/end-to-end/src/e2e_p2p/data_withholding_slash.test.ts b/yarn-project/end-to-end/src/e2e_p2p/data_withholding_slash.test.ts
@@ -157,8 +157,36 @@ describe('e2e_p2p_data_withholding_slash', () => {
     t.logger.warn('L2 txs mined');
 
     t.logger.warn('Stopping nodes');
+    // removeInitialNode sends a dummy L1 tx and awaits its receipt to sync the
+    // dateProvider, so it must run while L1 mining is still active.
     await t.removeInitialNode();
-    // Now stop the nodes,
+
+    // Pause L1 block production while we tear down and recreate validators. With
+    // `aztecProofSubmissionEpochs=0`, epoch 8 becomes prunable as soon as epoch 9 begins
+    // (~32s after slot 17). The stop/wipe/recreate cycle takes longer than that, so L1
+    // would otherwise race past the prune deadline before the recreated nodes come up.
+    // When that happens, the recreated archivers detect the prune during their initial
+    // sync (`handleEpochPrune` emits `L2PruneUnproven`), but the `EpochPruneWatcher`
+    // listener is only attached after `archiver.waitForInitialSync()` resolves
+    // (see `aztec-node/server.ts`), so the event is dropped and `DATA_WITHHOLDING` is
+    // never emitted. By freezing L1 here, the recreated archivers ingest checkpoint 1
+    // cleanly during initial sync, the watcher starts and attaches its listener, and
+    // then we resume L1 below so the prune fires while the listener is live.
+    const ethCheatCodes = t.ctx.cheatCodes.eth;
+    await ethCheatCodes.setAutomine(false);
+    await ethCheatCodes.setIntervalMining(0);
+
+    // Fail fast if we paused too late — i.e. if L1 already crossed into epoch 9 before
+    // we got here. In that case the recreated nodes would still see the prune during
+    // initial sync and the test would flake exactly the same way.
+    const epochAtPause = await rollup.getCurrentEpoch();
+    expect(Number(epochAtPause)).toBeLessThan(9);
+
+    // Now stop the validator nodes. With L1 paused, any in-flight L1 submissions from
+    // the validator sequencers would hang `sequencer.stop()` (it awaits pending L1
+    // submissions). Since `minTxsPerBlock=1` and no txs are queued for slot 18+, the
+    // sequencers don't submit further L1 transactions after the slot-17 checkpoint
+    // (already published before `waitForTx` returned), so this is safe.
     await t.stopNodes(nodes);
     // And remove the data directories (which forms the crux of the "attack")
     for (let i = 0; i < NUM_VALIDATORS; i++) {
@@ -186,6 +214,16 @@ describe('e2e_p2p_data_withholding_slash', () => {
     // Wait for P2P mesh to be fully formed before proceeding
     await t.waitForP2PMeshConnectivity(nodes, NUM_VALIDATORS);
 
+    // Resume L1 block production. Warp L1 forward to current wall-clock time so the
+    // epoch-8 deadline is crossed immediately on the next L1 block, then re-enable
+    // interval mining. By now each recreated archiver has block 1 stored locally and
+    // its `EpochPruneWatcher` listener is attached, so the next sync iteration emits
+    // `L2PruneUnproven` for epoch 8 to a live listener → `DATA_WITHHOLDING`.
+    const resumeTimestamp = Math.floor(t.ctx.dateProvider.now() / 1000);
+    await ethCheatCodes.setNextBlockTimestamp(resumeTimestamp);
+    await ethCheatCodes.mine();
+    await ethCheatCodes.setIntervalMining(t.ctx.aztecNodeConfig.ethereumSlotDuration);
+
     const offenses = await awaitOffenseDetected({
       epochDuration: t.ctx.aztecNodeConfig.aztecEpochDuration,
       logger: t.logger,