Skip to content

fix(test): wait for full gossip mesh before committee produces (A-1219)#24149

Merged
PhilWindle merged 1 commit into
merge-train/spartan-v5from
phil/a-1219-flaky-e2e_p2p-gossip_network_no_cheat
Jun 17, 2026
Merged

fix(test): wait for full gossip mesh before committee produces (A-1219)#24149
PhilWindle merged 1 commit into
merge-train/spartan-v5from
phil/a-1219-flaky-e2e_p2p-gossip_network_no_cheat

Conversation

@PhilWindle

Copy link
Copy Markdown
Collaborator

Problem

e2e_p2p_network › should rollup txs from all peers (and add the validators without cheating) (in gossip_network_no_cheat.test.ts) intermittently fails with TimeoutError: Timeout awaiting first checkpoint published — the chain never gets a first checkpoint onto L1 within 120s.

Log analysis

From CI run 6d6e74a70fce8826:

  • The test did a blind sleep(8000) for peer discovery, then waited for the first checkpoint. On the 2-CPU runner the gossipsub proposal/checkpoint meshes were not fully formed 8s in.
  • The first checkpoint attempt (slot 97, proposer validator-3) reached only 2 of 3 attestations — validators 1 and 2 never received the slot-97 proposal at all (only validator-4 had a live gossip path). No L1 publish was attempted; the proposer aborted locally on the attestation-collection timeout.
  • Because that checkpoint never landed, the L1-confirmed chain stayed at genesis, so every later slot rebuilt a competing un-checkpointed block 1 (new archive). The blocks are pruned (archiver:l1-sync: Pruning blocks after block 0 ...), but the prune lands ~1.5 slots after the block is built — later than the next proposal arrives. So peers still holding a not-yet-pruned block 1 rejected the new proposal with block_number_already_exists, never re-executed, never attested — capping every round at 2/3 forever.

Root cause: the gossip mesh wasn't formed when the committee started producing, so the first proposal reached only a subset of the committee. That both starved the first checkpoint of quorum and split the validators onto competing block-1 forks that never re-converge.

Fix

Replace the blind sleep(8000) with waitForP2PMeshConnectivity on the block_proposal, checkpoint_proposal, and checkpoint_attestation topics, requiring a full mesh (N-1 peers per node) so the first proposal reaches the whole committee. The first checkpoint then reaches quorum and lands — after which the chain advances to block 2 and no competing block 1 is ever built.

Also adds a minMeshPeerCount parameter to waitForP2PMeshConnectivity (default 1, preserving existing callers — the helper otherwise only requires a single mesh peer per node, which can leave some committee members unreached at first). Quorum-from-genesis tests pass N-1 for a full mesh.

This is the test-side fix that addresses the trigger. There is a separate, more fundamental product-robustness gap — a single missed checkpoint at the chain tip is unrecoverable because of the block_number_already_exists guard vs. the prune latency — which is consensus-sensitive and tracked separately (related to A-1218); it is intentionally not addressed here.

Testing

  • Build, format, lint clean; only the test and its helper changed.
  • Not yet run: the full e2e (gossip_network_no_cheat.test.ts, real-time-dependent, ideally under a 2-CPU constraint). The real validation is running it repeatedly and confirming the committee reaches 3/3 and the first checkpoint publishes within the gate.

Closes A-1219.

gossip_network_no_cheat replaced a blind `sleep(8000)` for peer discovery,
then waited for the first checkpoint. On a constrained (2-CPU) runner the
gossipsub proposal/checkpoint meshes were not fully formed 8s in, so the
first checkpoint (slot 97) reached only 2 of 3 attestations — two committee
members never received the proposal. Because that checkpoint never landed,
every later slot rebuilt a competing un-checkpointed block 1, which peers
holding an earlier (not-yet-pruned) block 1 rejected as
`block_number_already_exists`, capping every round at 2/3 and timing out the
120s "first checkpoint published" gate.

Replace the sleep with `waitForP2PMeshConnectivity` on the block-proposal,
checkpoint-proposal, and checkpoint-attestation topics, requiring a full mesh
(N-1 peers per node) so the first proposal reaches the whole committee and the
first checkpoint can reach quorum and land — after which no competing block 1
is ever built.

Add a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity` (default 1,
preserving existing callers) so quorum-from-genesis tests can require a full
mesh rather than the default single mesh peer.
@PhilWindle PhilWindle marked this pull request as ready for review June 17, 2026 10:32
@PhilWindle PhilWindle enabled auto-merge (squash) June 17, 2026 10:45
@PhilWindle PhilWindle merged commit c6488b7 into merge-train/spartan-v5 Jun 17, 2026
18 checks passed
@PhilWindle PhilWindle deleted the phil/a-1219-flaky-e2e_p2p-gossip_network_no_cheat branch June 17, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants