Skip to content

Commit c6488b7

Browse files
authored
fix(test): wait for full gossip mesh before committee produces (A-1219) (#24149)
## Problem `e2e_p2p_network › should rollup txs from all peers (and add the validators without cheating)` (in `gossip_network_no_cheat.test.ts`) intermittently fails with `TimeoutError: Timeout awaiting first checkpoint published` — the chain never gets a first checkpoint onto L1 within 120s. ## Log analysis From CI run [`6d6e74a70fce8826`](http://ci.aztec-labs.com/6d6e74a70fce8826): - The test did a blind `sleep(8000)` for peer discovery, then waited for the first checkpoint. On the 2-CPU runner the gossipsub **proposal/checkpoint meshes were not fully formed** 8s in. - The first checkpoint attempt (slot 97, proposer validator-3) reached only **2 of 3** attestations — validators 1 and 2 never received the slot-97 proposal at all (only validator-4 had a live gossip path). No L1 publish was attempted; the proposer aborted locally on the attestation-collection timeout. - Because that checkpoint never landed, the L1-confirmed chain stayed at genesis, so every later slot rebuilt a *competing* un-checkpointed block 1 (new archive). The blocks **are** pruned (`archiver:l1-sync: Pruning blocks after block 0 ...`), but the prune lands ~1.5 slots after the block is built — later than the next proposal arrives. So peers still holding a not-yet-pruned block 1 rejected the new proposal with `block_number_already_exists`, never re-executed, never attested — capping every round at 2/3 forever. Root cause: the gossip mesh wasn't formed when the committee started producing, so the first proposal reached only a subset of the committee. That both starved the first checkpoint of quorum and split the validators onto competing block-1 forks that never re-converge. ## Fix Replace the blind `sleep(8000)` with `waitForP2PMeshConnectivity` on the `block_proposal`, `checkpoint_proposal`, and `checkpoint_attestation` topics, requiring a **full mesh (N-1 peers per node)** so the first proposal reaches the whole committee. The first checkpoint then reaches quorum and lands — after which the chain advances to block 2 and no competing block 1 is ever built. Also adds a `minMeshPeerCount` parameter to `waitForP2PMeshConnectivity` (default `1`, preserving existing callers — the helper otherwise only requires a single mesh peer per node, which can leave some committee members unreached at first). Quorum-from-genesis tests pass `N-1` for a full mesh. This is the test-side fix that addresses the trigger. There is a separate, more fundamental product-robustness gap — a single missed checkpoint at the chain tip is unrecoverable because of the `block_number_already_exists` guard vs. the prune latency — which is consensus-sensitive and tracked separately (related to A-1218); it is intentionally **not** addressed here. ## Testing - Build, format, lint clean; only the test and its helper changed. - **Not yet run:** the full e2e (`gossip_network_no_cheat.test.ts`, real-time-dependent, ideally under a 2-CPU constraint). The real validation is running it repeatedly and confirming the committee reaches 3/3 and the first checkpoint publishes within the gate. Closes A-1219.
1 parent 269e6d0 commit c6488b7

2 files changed

Lines changed: 26 additions & 10 deletions

File tree

yarn-project/end-to-end/src/e2e_p2p/gossip_network_no_cheat.test.ts

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ import { sleep } from '@aztec/foundation/sleep';
1313
import { MockZKPassportVerifierAbi } from '@aztec/l1-artifacts/MockZKPassportVerifierAbi';
1414
import { RollupAbi } from '@aztec/l1-artifacts/RollupAbi';
1515
import type { SequencerClient } from '@aztec/sequencer-client';
16-
import { CheckpointAttestation, ConsensusPayload } from '@aztec/stdlib/p2p';
16+
import { CheckpointAttestation, ConsensusPayload, TopicType } from '@aztec/stdlib/p2p';
1717
import { ZkPassportProofParams } from '@aztec/stdlib/zkpassport';
1818

1919
import { jest } from '@jest/globals';
@@ -201,8 +201,21 @@ describe('e2e_p2p_network', () => {
201201
shouldCollectMetrics(),
202202
);
203203

204-
// wait a bit for peers to discover each other
205-
await sleep(8000);
204+
// Wait for the gossipsub mesh to fully form before the committee starts producing. With
205+
// skipInitialSequencer, the first blocks are built by this committee, and the first checkpoint
206+
// must reach quorum (all 4 validators) to land on L1. If the proposal/checkpoint meshes are only
207+
// partly formed, some committee members miss the first proposal, the first checkpoint stalls at
208+
// 2/3, and every later slot rebuilds a competing un-checkpointed block 1 that peers reject as
209+
// `block_number_already_exists` — a permanent 2/3 deadlock. Require a full mesh (N-1 peers per
210+
// node) on the proposal/checkpoint topics so the first proposal reaches the whole committee.
211+
await t.waitForP2PMeshConnectivity(
212+
nodes,
213+
NUM_VALIDATORS,
214+
60,
215+
0.5,
216+
[TopicType.block_proposal, TopicType.checkpoint_proposal, TopicType.checkpoint_attestation],
217+
NUM_VALIDATORS - 1,
218+
);
206219

207220
// Wait for the first checkpoint to be published to L1 before submitting transactions.
208221
// With skipInitialSequencer, no blocks exist from setup, so the first blocks are built by the

yarn-project/end-to-end/src/e2e_p2p/p2p_network.ts

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -431,6 +431,7 @@ export class P2PNetworkTest {
431431
timeoutSeconds = 30,
432432
checkIntervalSeconds = 0.1,
433433
topics: TopicType[] = [TopicType.tx],
434+
minMeshPeerCount = 1,
434435
) {
435436
const nodeCount = expectedNodeCount ?? nodes.length;
436437
const minPeerCount = nodeCount - 1;
@@ -457,27 +458,29 @@ export class P2PNetworkTest {
457458

458459
this.logger.warn('All nodes connected to P2P mesh');
459460

460-
// Wait for GossipSub mesh to form for all specified topics.
461-
// We only require at least 1 mesh peer per node because GossipSub
462-
// stops grafting once it reaches Dlo peers and won't fill the mesh to all available peers.
461+
// Wait for the GossipSub mesh to form for all specified topics. By default we only require at
462+
// least 1 mesh peer per node, since GossipSub stops grafting once it reaches Dlo peers and won't
463+
// fill the mesh to every available peer. Callers that need a proposal to reach the whole
464+
// committee within a slot (e.g. quorum-from-genesis tests) raise `minMeshPeerCount` so the mesh
465+
// is fully formed — a single mesh peer can leave some committee members unreached at first.
463466
for (const topic of topics) {
464-
this.logger.warn(`Waiting for GossipSub mesh to form for ${topic} topic...`);
467+
this.logger.warn(`Waiting for GossipSub mesh (>= ${minMeshPeerCount} peers per node) for ${topic} topic...`);
465468
await Promise.all(
466469
nodes.map(async (node, index) => {
467470
const p2p = node.getP2P();
468471
await retryUntil(
469472
async () => {
470473
const meshPeers = await p2p.getGossipMeshPeerCount(topic);
471474
this.logger.debug(`Node ${index} has ${meshPeers} gossip mesh peers for ${topic} topic`);
472-
return meshPeers >= 1 ? true : undefined;
475+
return meshPeers >= minMeshPeerCount ? true : undefined;
473476
},
474-
`Node ${index} to have gossip mesh peers for ${topic} topic`,
477+
`Node ${index} to have >= ${minMeshPeerCount} gossip mesh peers for ${topic} topic`,
475478
timeoutSeconds,
476479
checkIntervalSeconds,
477480
);
478481
}),
479482
);
480-
this.logger.warn(`All nodes have gossip mesh peers for ${topic} topic`);
483+
this.logger.warn(`All nodes have >= ${minMeshPeerCount} gossip mesh peers for ${topic} topic`);
481484
}
482485
}
483486

0 commit comments

Comments
 (0)