Skip to content

Commit 12b543c

Browse files
AIQnetLabclaude
andcommitted
fix: v19+v20 — authenticated P2P + BFT-certified rotation + scalable PK registry
v19 — architectural BFT hardening (closes h=90-104 stall on 5-node testnet): * Authenticated QUIC handshake (Phase 2.A advisory) Added optional Dilithium3 proof to NodeHandshake binding (node_id, ts, block_height). Three-way deserialize ladder preserves backward compat with v9.7..v18 peers. Receiver verifies via CONSENSUS_PK_REGISTRY using the same path as consensus messages. Bogus proof drops the connection; legacy peers admitted with [WARN][HANDSHAKE] no_dilithium_proof for audit. Closes the spoofer-on-198.36.48.234 admittance vector. * Validator-count source authentication get_active_validator_count() in genesis epoch now reads from consensus_pk_registry_len() (cryptographically bound identities) instead of count_unique_live_peers + 1 (TLS-only admittance). Spoofers without Dilithium SK no longer inflate 2f+1 thresholds. WARN fallback when registry not yet populated. * BFT-certified rotation round (eliminates atomic race) Producer selection at node.rs:19416 reads p2p.get_highest_certified_round directly from the 2f+1 BFT-certified DashMap instead of CURRENT_TIMEOUT_ROUND atomic. The atomic was reset on every tip-advance, occasionally yielding 0 mid-tick while the network was at non-zero round. Direct read is monotonic, deterministic across honest validators. Atomic kept for telemetry only. * Range-sync for big gaps block_pipeline orphan-parent path now batches sync_blocks(from, to) when child_h - local_tip > 5, replacing N×30s single-flight cascade with one parallel top-3-peers fetch. Cuts gap-recovery from 420s to ~3s for 14-block gaps. Single-flight dedup with 60s TTL. * Misc cleanup LightNodeRotation marked deprecated (light nodes are pure-API mobile wallets, not Helios-style light clients). Hardware-flaky test_high_tps_generation marked #[ignore] alongside benchmark::tests::*. v20 — PK Registry scaling (supports up to 1M active super-nodes): * Default cap raised 50K -> 100K (~210 MB at full load, ~5-7 years runway at projected mainnet growth Y1-Y5) * PkEntry struct with pinned flag (genesis anchors NEVER evicted) * Lock-free LAST_ACTIVITY DashMap for O(1) activity tracking on every successful PK lookup * In-line single-shot eviction on cap-full (defence in depth) * Background periodic sweep evict_idle_consensus_pks() wired into hourly cleanup in rpc.rs (default 30-day idle threshold) * deactivate_consensus_pk() for explicit unregistration (refuses pinned) * Env runtime overrides QNET_PK_REGISTRY_CAP (clamped to 1M hard bound) and QNET_PK_REGISTRY_IDLE_DAYS for operator tunability Tests: 21 new regression tests (8 v20 PK registry + 6 v19 range-sync + 7 v19 handshake helpers). Total 212 passed across qnet-consensus (70) and qnet-integration (142, 12 ignored hardware bench). Build: cargo build --release clean in 17m26s, 0 warnings. qnet-node.exe binary 22.3 MB optimized. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a705718 commit 12b543c

8 files changed

Lines changed: 1599 additions & 98 deletions

File tree

core/qnet-consensus/src/consensus_crypto.rs

Lines changed: 612 additions & 29 deletions
Large diffs are not rendered by default.

development/qnet-integration/src/block_pipeline.rs

Lines changed: 340 additions & 1 deletion
Large diffs are not rendered by default.

development/qnet-integration/src/node.rs

Lines changed: 69 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -590,31 +590,43 @@ pub static LAST_SYNC_PROGRESS_TIME: AtomicU64 = AtomicU64::new(0);
590590
// slot starts fresh.
591591
// ═══════════════════════════════════════════════════════════════════════════
592592

593-
/// Current rotation round (certified.max(adopted) from the latest stall
594-
/// evaluation). Read by producer selection and block construction.
593+
/// v19: TELEMETRY-ONLY snapshot of the rotation round most recently observed
594+
/// by the stall-detection loop. Once read by producer selection / block
595+
/// construction; that consensus path now reads `get_highest_certified_round`
596+
/// directly to avoid the per-tip-advance reset race that briefly returned 0
597+
/// while the network was still at a non-zero BFT-certified round.
598+
///
599+
/// Kept for: status RPC (`current_timeout_round`), debug logs, telemetry
600+
/// metrics. Setting / resetting it is side-effect-free with respect to
601+
/// consensus correctness.
595602
static CURRENT_TIMEOUT_ROUND: AtomicU64 = AtomicU64::new(0);
596603

597604
/// Height for which `CURRENT_TIMEOUT_ROUND` was last set. Informational;
598605
/// `reset_timeout_round` clears the round value on every tip advance.
599606
#[allow(dead_code)]
600607
static TIMEOUT_ROUND_HEIGHT: AtomicU64 = AtomicU64::new(0);
601608

602-
/// Read the current BFT rotation round (certified.max(adopted)) for producer
603-
/// selection and block construction.
609+
/// Read the most recently observed BFT rotation round.
610+
/// v19: TELEMETRY-ONLY. Consensus paths (producer selection, block
611+
/// construction) read `get_highest_certified_round(mb_index)` directly
612+
/// from `SimplifiedP2P` instead — that is the same DashMap the stall
613+
/// detector itself reads, with no oscillation under tip-advance races.
604614
pub fn get_current_timeout_round() -> u64 {
605615
CURRENT_TIMEOUT_ROUND.load(Ordering::SeqCst)
606616
}
607617

608-
/// Set the current rotation round. Called by the stall-detection loop after
609-
/// re-reading `certified_timeout_round` and `adopted_timeout_round` from
610-
/// `unified_p2p`. The value is what producer selection will use this tick.
618+
/// Update the telemetry snapshot of the current rotation round. Called by
619+
/// the stall-detection loop after re-reading `certified_timeout_round`.
620+
/// v19: stored value is no longer authoritative for producer selection.
611621
pub fn set_timeout_round(round: u64, height: u64) {
612622
CURRENT_TIMEOUT_ROUND.store(round, Ordering::SeqCst);
613623
TIMEOUT_ROUND_HEIGHT.store(height, Ordering::SeqCst);
614624
}
615625

616-
/// Reset rotation round to 0. Called on tip advance — a fresh block
617-
/// arrived, so the previous slot's failover round no longer applies.
626+
/// Clear the telemetry snapshot on tip advance.
627+
/// v19: this no longer affects consensus — producer selection reads the
628+
/// 2f+1 BFT-certified round directly. The reset is preserved so that
629+
/// status / debug logs do not stale-display a previous slot's round.
618630
pub fn reset_timeout_round() {
619631
CURRENT_TIMEOUT_ROUND.store(0, Ordering::SeqCst);
620632
}
@@ -19403,17 +19415,47 @@ impl BlockchainNode {
1940319415
}
1940419416
}
1940519417

19406-
// v14.8.10: Get timeout_round for deterministic failover from the
19407-
// stored BFT-agreed value. `set_timeout_round()` is called in the
19408-
// stall-detection path above every tick with
19409-
// `certified.max(adopted)` — both operands aggregate only
19410-
// Dilithium3-verified signed votes, so every validator reads the
19411-
// same value once gossip has reached them. On tip advance the
19412-
// value is cleared via `reset_timeout_round()` so a fresh block
19413-
// starts at round 0. Under steady-state (no stall) this is just
19414-
// a cheap atomic load that returns 0 → the primary producer
19415-
// stays selected.
19416-
let timeout_round: u64 = get_current_timeout_round();
19418+
// ═══════════════════════════════════════════════════════════════
19419+
// v19: BFT-CERTIFIED ROTATION ROUND — DIRECT READ
19420+
// ═══════════════════════════════════════════════════════════════
19421+
// Producer selection reads the rotation round DIRECTLY from the
19422+
// 2f+1 BFT-certified state for this height's macroblock index
19423+
// instead of via the globally mutable `CURRENT_TIMEOUT_ROUND`
19424+
// atomic.
19425+
//
19426+
// Why: the legacy atomic was set by the stall-detection loop
19427+
// every tick AND reset to 0 on every tip-advance. Under load,
19428+
// these two writes interleaved with the producer-selection
19429+
// read in this loop, occasionally yielding 0 when the actual
19430+
// BFT-agreed round was non-zero (or vice versa). When local
19431+
// and remote nodes computed different rounds for the same
19432+
// height, they selected different producers — the validator
19433+
// produced an empty-slot attestation while the producer
19434+
// signed a microblock, and the network stalled until gossip
19435+
// reconciled (~30 s × N blocks).
19436+
//
19437+
// The new path queries `get_highest_certified_round(mb_index)`,
19438+
// which is:
19439+
// * a lock-free DashMap read keyed by macroblock index
19440+
// * monotonic (advances only on 2f+1 Dilithium3-verified
19441+
// TimeoutVote certificates) — never resets per-block
19442+
// * deterministic across honest validators once vote
19443+
// gossip propagates (≤ 1 s in steady state)
19444+
//
19445+
// The legacy `CURRENT_TIMEOUT_ROUND` atomic stays in place
19446+
// for telemetry / debug logging and is still updated by the
19447+
// stall-detection loop, but is no longer on the consensus
19448+
// path.
19449+
//
19450+
// Scalability: O(1) DashMap lookup per slot. Identical cost
19451+
// at 5 or 5000 super-nodes.
19452+
// ═══════════════════════════════════════════════════════════════
19453+
let timeout_round: u64 = if let Some(ref p2p) = unified_p2p {
19454+
let mb_index = next_block_height / 90;
19455+
p2p.get_highest_certified_round(mb_index)
19456+
} else {
19457+
0
19458+
};
1941719459

1941819460
// v3.8: Use select_microblock_producer_with_round for deterministic failover.
1941919461
// Value is read-only after this point (validated via is_my_turn_to_produce,
@@ -19428,7 +19470,13 @@ impl BlockchainNode {
1942819470
timeout_round // CRITICAL: Pass timeout_round for deterministic failover!
1942919471
).await;
1943019472

19431-
// v4.3: Cache expected producer for incoming block validation
19473+
// v4.3 / v19: Cache expected producer for incoming block validation.
19474+
// The cached round is the 2f+1 BFT-certified value (read above as
19475+
// `timeout_round`), so all honest validators populate identical
19476+
// entries once vote gossip has propagated. Block-pipeline ingest
19477+
// compares the cached pair against the incoming block's
19478+
// `(producer, timeout_round)` to detect Category B authority
19479+
// violations (same round, wrong signer → HARD reject).
1943219480
cache_expected_producer(next_block_height, &current_producer, timeout_round);
1943319481

1943419482
let mut is_my_turn_to_produce = current_producer == node_id;

0 commit comments

Comments
 (0)