fix(test): wait for full gossip mesh before committee produces (A-1219)#24149
Merged
PhilWindle merged 1 commit intoJun 17, 2026
Merged
Conversation
gossip_network_no_cheat replaced a blind `sleep(8000)` for peer discovery, then waited for the first checkpoint. On a constrained (2-CPU) runner the gossipsub proposal/checkpoint meshes were not fully formed 8s in, so the first checkpoint (slot 97) reached only 2 of 3 attestations — two committee members never received the proposal. Because that checkpoint never landed, every later slot rebuilt a competing un-checkpointed block 1, which peers holding an earlier (not-yet-pruned) block 1 rejected as `block_number_already_exists`, capping every round at 2/3 and timing out the 120s "first checkpoint published" gate. Replace the sleep with `waitForP2PMeshConnectivity` on the block-proposal, checkpoint-proposal, and checkpoint-attestation topics, requiring a full mesh (N-1 peers per node) so the first proposal reaches the whole committee and the first checkpoint can reach quorum and land — after which no competing block 1 is ever built. Add a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity` (default 1, preserving existing callers) so quorum-from-genesis tests can require a full mesh rather than the default single mesh peer.
fcarreiro
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
e2e_p2p_network › should rollup txs from all peers (and add the validators without cheating)(ingossip_network_no_cheat.test.ts) intermittently fails withTimeoutError: Timeout awaiting first checkpoint published— the chain never gets a first checkpoint onto L1 within 120s.Log analysis
From CI run
6d6e74a70fce8826:sleep(8000)for peer discovery, then waited for the first checkpoint. On the 2-CPU runner the gossipsub proposal/checkpoint meshes were not fully formed 8s in.archiver:l1-sync: Pruning blocks after block 0 ...), but the prune lands ~1.5 slots after the block is built — later than the next proposal arrives. So peers still holding a not-yet-pruned block 1 rejected the new proposal withblock_number_already_exists, never re-executed, never attested — capping every round at 2/3 forever.Root cause: the gossip mesh wasn't formed when the committee started producing, so the first proposal reached only a subset of the committee. That both starved the first checkpoint of quorum and split the validators onto competing block-1 forks that never re-converge.
Fix
Replace the blind
sleep(8000)withwaitForP2PMeshConnectivityon theblock_proposal,checkpoint_proposal, andcheckpoint_attestationtopics, requiring a full mesh (N-1 peers per node) so the first proposal reaches the whole committee. The first checkpoint then reaches quorum and lands — after which the chain advances to block 2 and no competing block 1 is ever built.Also adds a
minMeshPeerCountparameter towaitForP2PMeshConnectivity(default1, preserving existing callers — the helper otherwise only requires a single mesh peer per node, which can leave some committee members unreached at first). Quorum-from-genesis tests passN-1for a full mesh.This is the test-side fix that addresses the trigger. There is a separate, more fundamental product-robustness gap — a single missed checkpoint at the chain tip is unrecoverable because of the
block_number_already_existsguard vs. the prune latency — which is consensus-sensitive and tracked separately (related to A-1218); it is intentionally not addressed here.Testing
gossip_network_no_cheat.test.ts, real-time-dependent, ideally under a 2-CPU constraint). The real validation is running it repeatedly and confirming the committee reaches 3/3 and the first checkpoint publishes within the gate.Closes A-1219.