Skip to content

Commit bd71022

Browse files
authored
fix: deflake HA full e2e suite by switching to in-proc interval-mining anvil (#23979)
Fixes the flaky HA full suite (`e2e_ha_full`) seen in http://ci.aztec-labs.com/8e1e980c4886df0d, where "should distribute work across multiple HA nodes" timed out awaiting a trigger tx. Also re-enables the suite, which #23976 had skipped. ## Root cause The HA compose suite was the only block-building suite running against an L1 with no self-advancing clock. Its anvil container ran in automine with no `--block-time`, and being external, it was excluded from the `TestDateProvider` sync that locally-spawned anvils get. L1 chain time only moved when something mined, while the shared sequencer clock free-ran. #23821 removed the `AnvilTestWatcher` that used to couple the two clocks in this mode and replaced it with per-iteration nudges in the test (clock warp + blind `mine(8)`). Two consequences, both visible in the failed run's logs: - The `mine(8)` overshoot put L1 ~1.5 slots ahead of the test clock, so each iteration's first propose raced its slot boundary and was silently dropped, followed by a prune that destroyed the pipelined builders' forks (`Fork not found` on all surviving nodes). This race was lost in passing runs too. - Recovery then required the proposers' archiver-sync gate to clear, but the gate's deadline runs on the free-running test clock while nothing mines L1 during the test's `waitForTx` — `Archiver did not sync L1 past slot 109 before slot 110 expired, discarding pipelined work`, repeated until the jest timeout. Whether a run passed or failed came down to seconds of margin on this gate. ## Fix Stop emulating L1 time in the test and run the suite in the same regime as every other block-building e2e (e.g. `e2e_epochs`): - Drop the anvil container and `ETHEREUM_HOSTS` from the HA compose file. With no external L1 configured, `setup()` spawns anvil in-proc with interval mining (`--block-time = ethereumSlotDuration`) and keeps the `TestDateProvider` snapped to L1 block timestamps via the existing stdout listener. The sibling web3signer compose suite already works this way. - Add `automineL1Setup: true` so L1 contract deployment runs under temporary automine before interval mining starts. - Delete all time scaffolding from the test (clock warps, cheat-mining heartbeats, archiver sync nudges). Tests submit a tx and wait, in real time. No assertions change. No production code changes: with a self-advancing L1, the sequencer and publisher behave exactly as on a real network. ## Parallelization The suite file is renamed to `e2e_ha_full.parallel.test.ts`, so CI runs each of its 8 tests as an isolated job in its own compose stack instead of one 15+ minute serial job: - `bootstrap.sh` expands the HA suite per test name (same mechanism as the existing `.parallel` simple tests). - `run_test.sh` forwards the test name into the compose stack and namespaces the docker compose project per test so concurrent jobs on one host don't collide. - `sendTriggerTx` now starts the HA sequencers idempotently, since under per-test isolation the governance/reload/distribute tests run without the first test (previously the only caller of `startHASequencers`). - Three clock-skew test titles contained parentheses, which jest's `--testNamePattern` interprets as regex groups (the filter would silently match nothing); they are retitled. ## Teardown fix (follow-up to the first CI round) The first CI round passed every test body but three jobs (produce-blocks, governance, reload) hung in `afterAll` until the job timeout. Two compounding causes, both fixed here: - `afterAll` reset the shared `TestDateProvider` *before* stopping nodes. The reset rewinds the clock from chain time to wall time — minutes apart after the automine deploy burst — so vote submissions armed against the rewound clock pushed sequencer stops out by that gap. The old 30s abandon-race then gave up, and the abandoned nodes outlived the jest environment, keeping the worker alive until the CI timeout (jest runs without `forceExit`). `afterAll` now stops sequencers first, awaits every node stop fully, and resets the clock last. These three jobs are the ones whose tests end with sequencers still running; the distribute test (which stops nodes in-test, before any reset) passed for the same reason. - Ports #23990 from `merge-train/spartan` (not previously on the v5 line): `CheckpointProposalJob.interrupt()` now propagates to the publisher, cancelling the `sendRequestsAt` slot-deadline sleep on sequencer stop, so a pending vote submission can never block shutdown. The original PR's `e2e_ha_full` teardown changes are superseded by the rework above and were not ported. ## Verification - Three full local runs of the suite via `run_test.sh ha` (all 8 tests each): green in 255s / 254s / 268s of jest time (the old warp-based suite ran 10+ minutes), with zero occurrences of the old failure signatures (`Fork not found`, `Archiver did not sync`, `discarding pipelined work`) — passing runs of the old code showed 12+ `Fork not found` errors even when green. - One per-test CI-style run (`run_test.sh ha <file> "should distribute work across multiple HA nodes"`): the originally flaky test passes standalone in its own compose stack (7 skipped, 1 passed), exercising the full `TEST_NAME` plumbing. - `yarn build`, `yarn format`, `yarn lint` clean; `sequencer-client` unit tests pass (back to the pre-change suite after the revert).
1 parent 6aff5b9 commit bd71022

9 files changed

Lines changed: 143 additions & 163 deletions

File tree

.test_patterns.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ tests:
371371
owners:
372372
- *palla
373373

374-
- regex: "yarn-project/end-to-end/scripts/run_test.sh ha src/composed/ha/e2e_ha_full.test.ts"
374+
- regex: "yarn-project/end-to-end/scripts/run_test.sh ha src/composed/ha/e2e_ha_full.parallel.test.ts"
375375
owners:
376376
- *spyros
377377

yarn-project/end-to-end/bootstrap.sh

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,13 @@ function test_cmds {
9696
)
9797
for test in "${tests[@]}"; do
9898
# We must set ONLY_TERM_PARENT=1 to allow the script to fully control cleanup process.
99-
echo "$hash:ONLY_TERM_PARENT=1:TIMEOUT=30m $run_test_script ha $test"
99+
if [[ "$test" == *.parallel.test.ts ]]; then
100+
while IFS= read -r test_name; do
101+
echo "$hash:ONLY_TERM_PARENT=1:TIMEOUT=30m $run_test_script ha $test \"$test_name\""
102+
done < <(extract_test_names "$test")
103+
else
104+
echo "$hash:ONLY_TERM_PARENT=1:TIMEOUT=30m $run_test_script ha $test"
105+
fi
100106
done
101107

102108
#echo "$hash:ONLY_TERM_PARENT=1 $run_test_script simple src/e2e_multi_validator/e2e_multi_validator_node.test.ts"

yarn-project/end-to-end/scripts/ha/docker-compose.yml

Lines changed: 3 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,6 @@ services:
2929
volumes:
3030
- web3signer_keys:/keys
3131

32-
anvil:
33-
image: aztecprotocol/build:3.0
34-
cpus: 1
35-
mem_limit: 2G
36-
entrypoint: 'anvil --silent -p 8545 --host 0.0.0.0 --chain-id 31337'
37-
3832
end-to-end:
3933
image: aztecprotocol/build:3.0
4034
cpus: 4
@@ -51,7 +45,8 @@ services:
5145
environment:
5246
JEST_CACHE_DIR: /tmp-jest
5347
LOG_LEVEL: ${LOG_LEVEL:-verbose}
54-
ETHEREUM_HOSTS: http://anvil:8545
48+
TEST: ${TEST:-./src/composed/ha/e2e_ha_full.parallel.test.ts}
49+
TEST_NAME: ${TEST_NAME:-}
5550
L1_CHAIN_ID: 31337
5651
DATABASE_URL: postgresql://aztec:aztec@postgres:5432/aztec_ha_test
5752
WEB3_SIGNER_URL: http://web3signer:9000
@@ -70,10 +65,6 @@ services:
7065
while ! nc -z web3signer 9000; do sleep 1; done;
7166
echo "Web3Signer is ready"
7267
73-
# Wait for anvil to be ready
74-
while ! nc -z anvil 8545; do sleep 1; done;
75-
echo "Anvil is ready"
76-
7768
# Run database migrations
7869
echo "Running database migrations..."
7970
cd /root/aztec-packages/yarn-project/aztec
@@ -84,7 +75,7 @@ services:
8475
cd /root/aztec-packages/yarn-project/end-to-end
8576
8677
# Run the test
87-
setsid ./scripts/test_simple.sh ${TEST:-./src/composed/ha/e2e_ha_sequencer.test.ts} &
78+
setsid ./scripts/test_simple.sh "$${TEST}" "$${TEST_NAME}" &
8879
pid=$$!
8980
pgid=$$(($$(ps -o pgid= -p $$pid)))
9081
trap "kill -SIGTERM -$$pgid" SIGTERM
@@ -96,8 +87,6 @@ services:
9687
condition: service_healthy
9788
web3signer:
9889
condition: service_started
99-
anvil:
100-
condition: service_started
10190

10291
volumes:
10392
postgres_data:

yarn-project/end-to-end/scripts/run_test.sh

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,10 @@ case "$type" in
2525
TEST=$test exec run_compose_test $test end-to-end $PWD/web3signer
2626
;;
2727
"ha")
28-
# Remove volumes on cleanup for HA tests to ensure clean database state on retries
29-
TEST=$test REMOVE_COMPOSE_VOLUMES=1 exec run_compose_test $test end-to-end $PWD/ha
28+
# Remove volumes on cleanup for HA tests to ensure clean database state on retries.
29+
# NAME_POSTFIX namespaces the compose project per test so parallel per-test jobs don't collide.
30+
# Compose project names must be lowercase alphanumerics, hyphens, and underscores.
31+
postfix=$(echo "$test_name" | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9]/_/g')
32+
TEST=$test TEST_NAME=$test_name NAME_POSTFIX=${postfix:+_$postfix} REMOVE_COMPOSE_VOLUMES=1 exec run_compose_test $test end-to-end $PWD/ha
3033
;;
3134
esac

0 commit comments

Comments
 (0)