Commit bd71022
authored
fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil (#23979)
Fixes the flaky HA full suite (`e2e_ha_full`) seen in
http://ci.aztec-labs.com/8e1e980c4886df0d, where "should distribute work
across multiple HA nodes" timed out awaiting a trigger tx. Also
re-enables the suite, which #23976 had skipped.
## Root cause
The HA compose suite was the only block-building suite running against
an L1 with no self-advancing clock. Its anvil container ran in automine
with no `--block-time`, and being external, it was excluded from the
`TestDateProvider` sync that locally-spawned anvils get. L1 chain time
only moved when something mined, while the shared sequencer clock
free-ran. #23821 removed the `AnvilTestWatcher` that used to couple the
two clocks in this mode and replaced it with per-iteration nudges in the
test (clock warp + blind `mine(8)`).
Two consequences, both visible in the failed run's logs:
- The `mine(8)` overshoot put L1 ~1.5 slots ahead of the test clock, so
each iteration's first propose raced its slot boundary and was silently
dropped, followed by a prune that destroyed the pipelined builders'
forks (`Fork not found` on all surviving nodes). This race was lost in
passing runs too.
- Recovery then required the proposers' archiver-sync gate to clear, but
the gate's deadline runs on the free-running test clock while nothing
mines L1 during the test's `waitForTx` — `Archiver did not sync L1 past
slot 109 before slot 110 expired, discarding pipelined work`, repeated
until the jest timeout. Whether a run passed or failed came down to
seconds of margin on this gate.
## Fix
Stop emulating L1 time in the test and run the suite in the same regime
as every other block-building e2e (e.g. `e2e_epochs`):
- Drop the anvil container and `ETHEREUM_HOSTS` from the HA compose
file. With no external L1 configured, `setup()` spawns anvil in-proc
with interval mining (`--block-time = ethereumSlotDuration`) and keeps
the `TestDateProvider` snapped to L1 block timestamps via the existing
stdout listener. The sibling web3signer compose suite already works this
way.
- Add `automineL1Setup: true` so L1 contract deployment runs under
temporary automine before interval mining starts.
- Delete all time scaffolding from the test (clock warps, cheat-mining
heartbeats, archiver sync nudges). Tests submit a tx and wait, in real
time. No assertions change.
No production code changes: with a self-advancing L1, the sequencer and
publisher behave exactly as on a real network.
## Parallelization
The suite file is renamed to `e2e_ha_full.parallel.test.ts`, so CI runs
each of its 8 tests as an isolated job in its own compose stack instead
of one 15+ minute serial job:
- `bootstrap.sh` expands the HA suite per test name (same mechanism as
the existing `.parallel` simple tests).
- `run_test.sh` forwards the test name into the compose stack and
namespaces the docker compose project per test so concurrent jobs on one
host don't collide.
- `sendTriggerTx` now starts the HA sequencers idempotently, since under
per-test isolation the governance/reload/distribute tests run without
the first test (previously the only caller of `startHASequencers`).
- Three clock-skew test titles contained parentheses, which jest's
`--testNamePattern` interprets as regex groups (the filter would
silently match nothing); they are retitled.
## Teardown fix (follow-up to the first CI round)
The first CI round passed every test body but three jobs
(produce-blocks, governance, reload) hung in `afterAll` until the job
timeout. Two compounding causes, both fixed here:
- `afterAll` reset the shared `TestDateProvider` *before* stopping
nodes. The reset rewinds the clock from chain time to wall time —
minutes apart after the automine deploy burst — so vote submissions
armed against the rewound clock pushed sequencer stops out by that gap.
The old 30s abandon-race then gave up, and the abandoned nodes outlived
the jest environment, keeping the worker alive until the CI timeout
(jest runs without `forceExit`). `afterAll` now stops sequencers first,
awaits every node stop fully, and resets the clock last. These three
jobs are the ones whose tests end with sequencers still running; the
distribute test (which stops nodes in-test, before any reset) passed for
the same reason.
- Ports #23990 from `merge-train/spartan` (not previously on the v5
line): `CheckpointProposalJob.interrupt()` now propagates to the
publisher, cancelling the `sendRequestsAt` slot-deadline sleep on
sequencer stop, so a pending vote submission can never block shutdown.
The original PR's `e2e_ha_full` teardown changes are superseded by the
rework above and were not ported.
## Verification
- Three full local runs of the suite via `run_test.sh ha` (all 8 tests
each): green in 255s / 254s / 268s of jest time (the old warp-based
suite ran 10+ minutes), with zero occurrences of the old failure
signatures (`Fork not found`, `Archiver did not sync`, `discarding
pipelined work`) — passing runs of the old code showed 12+ `Fork not
found` errors even when green.
- One per-test CI-style run (`run_test.sh ha <file> "should distribute
work across multiple HA nodes"`): the originally flaky test passes
standalone in its own compose stack (7 skipped, 1 passed), exercising
the full `TEST_NAME` plumbing.
- `yarn build`, `yarn format`, `yarn lint` clean; `sequencer-client`
unit tests pass (back to the pre-change suite after the revert).1 parent 6aff5b9 commit bd71022
9 files changed
Lines changed: 143 additions & 163 deletions
File tree
- yarn-project
- end-to-end
- scripts
- ha
- src/composed/ha
- sequencer-client/src
- publisher
- sequencer
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
371 | 371 | | |
372 | 372 | | |
373 | 373 | | |
374 | | - | |
| 374 | + | |
375 | 375 | | |
376 | 376 | | |
377 | 377 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
99 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
100 | 106 | | |
101 | 107 | | |
102 | 108 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | 32 | | |
39 | 33 | | |
40 | 34 | | |
| |||
51 | 45 | | |
52 | 46 | | |
53 | 47 | | |
54 | | - | |
| 48 | + | |
| 49 | + | |
55 | 50 | | |
56 | 51 | | |
57 | 52 | | |
| |||
70 | 65 | | |
71 | 66 | | |
72 | 67 | | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | 68 | | |
78 | 69 | | |
79 | 70 | | |
| |||
84 | 75 | | |
85 | 76 | | |
86 | 77 | | |
87 | | - | |
| 78 | + | |
88 | 79 | | |
89 | 80 | | |
90 | 81 | | |
| |||
96 | 87 | | |
97 | 88 | | |
98 | 89 | | |
99 | | - | |
100 | | - | |
101 | 90 | | |
102 | 91 | | |
103 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
29 | | - | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
30 | 33 | | |
31 | 34 | | |
0 commit comments