Skip to content

Commit c6a8d6d

Browse files
AIQnetLabclaude
andcommitted
fix: v21 — close macroblock-boundary timeout-cert circular deadlock
Forensic context ---------------- After v19.1 restored fresh-bootstrap, the 5-node testnet successfully produced 359 blocks then stalled for 12.7 hours at h=360 — the very first macroblock-boundary at which the primary producer timed out. Pipeline metrics showed verified=211 applied=211 ingested=221 decoded=220 verify_fail=0 future_drop=0 defer_evict=0 with [CRIT][PIPELINE] verify_stuck repeating every 30 seconds; node 002 voted alone for (mb_idx=4, *) (count=1/4 across 46 248 rounds over 12.7 h) while nodes 001/003/004/005 voted exclusively for mb_idx=3. Root cause was a circular dependency at the boundary: 1. Failover producer at timeout_round R > 0 emits block at h=mb*90; receivers pipeline requires AggregatedTimeoutCert for (mb*90 / 90, R) before applying. 2. Cert generation requires 2f+1 signed TimeoutVotes for that (mb_idx, R) pair. 3. Pre-fix, voters emitted mb_idx = local_height / 90 — at the boundary tick, local_height was still mb*90 - 1, so all honest voters voted for the PREVIOUS macroblock, never the new one. 4. Failover-producing node was the only one already at the new macroblock; its votes alone could not reach 2f+1. 5. Without cert, every receiver deferred the block. 6. Without applied block, every receiver stayed below the boundary, keeping mb_idx = local_height / 90 pinned to the wrong value. 7. Permanent stall — observed liveness loss after the very first boundary primary failure. Pipeline gate at block_pipeline.rs:1761 (added in v16.2) was the proximate trigger; its strict defer-on-cert-miss was correct under the assumption that cert generation could always reach 2f+1, which the vote-pool locality bug above broke. Architecture verification ------------------------- The two-tier (microblock + macroblock-finality) design is a deliberate response to post-quantum signature size — Dilithium3 has no aggregation, so per-block 2f+1 votes at 1000 validators would require ~2.3 MB / block of signatures alone (~18 Mbps sustained). Macroblock amortisation reduces this to ~50 KB/s. Switching to a single-tier classical-BFT design with PQC is structurally infeasible at the target committee size. The architecture is correct. The bug was implementation drift in the vote-pool semantics: production-grade BFT vote-pool patterns require votes to be standalone cryptographic claims about a future round, not functions of the voter local state. v21 restores that invariant with three surgical changes. Changes ------- A1. Forward-looking TimeoutVote target (node.rs:17749) Vote mb_idx is computed from next_height / 90, not microblock_height / 90. At the macroblock boundary this changes mb_idx=3 (wrong — already finalised) into mb_idx=4 (correct — the macroblock whose producer is currently failing). Receiver-side already accepts forward-looking votes within local_mb + 50 lookahead (existing logic at unified_p2p.rs::handle_timeout_vote), so no receiver changes needed. A2. Vote-pool fallback in pipeline cert check (block_pipeline.rs:1761, +2 helpers in unified_p2p.rs) When has_aggregated_timeout_cert(mb_idx, round) returns false, consult the live TIMEOUT_VOTES pool for 2f+1 signed votes. If present, admit (cert is just an aggregated view of the same Dilithium3-signed messages — same trust source, same threshold). If still below threshold, fall back to the original defer-and-request-backfill path. Logged at INFO with boundary flag indicating whether the bypass fired at h % 90 == 0 (the legitimate race window) or mid-macroblock (which is unusual and worth operator attention). New helpers on SimplifiedP2P: * count_timeout_votes_in_pool(mb_idx, round) -> usize * has_two_f_plus_one_timeout_votes(mb_idx, round, threshold) -> bool O(1) lock-free DashMap shard read — identical cost from 5 to 1M super-nodes. B1. Heartbeat-driven forward TimeoutVote emit (node.rs +95 lines) The existing heartbeat-silence detector (HEARTBEAT_SILENT_THRESHOLD_MS = 3000) already fires heartbeat_fast_path and triggers empty-slot attestation. Pre-fix, that signal did NOT cross-wire to the TimeoutVote / cert chain — the vote stream had to wait for the legacy local_delay > timeout_grace_period gate, leaving a window where the attestation channel advanced but the cert channel did not. v21 adds an inline TimeoutVote emit gated on heartbeat_fast_path && proposed_timeout_round == 0 && is_synced_enough && (microblock_height > 0 || genesis_era_dead_producer). Target round is certified_timeout_round.saturating_add(1) — one above the current 2f+1 line, sufficient to advance rotation. Gated on proposed_timeout_round == 0 so this path never double-fires with the legacy stall-driven emit (which only runs when proposed > 0). broadcast_timeout_vote itself dedupes via TIMEOUT_VOTED_HEIGHTS, so even if both fire in the same tick the network sees one effective vote per (mb_idx, round, voter). Safety analysis --------------- A1 — same Dilithium3 signature, same (mb_idx, round, voter_id) anti-replay tracker, same 2f+1 supermajority threshold for cert generation. Just fixes which mb_idx the vote targets. A2 — votes in the local pool were each Dilithium3-verified at gossip ingest by handle_timeout_vote against the consensus PK registry. The cert is a transport-optimised aggregate of those same signed messages; admitting on raw 2f+1 evidence preserves every cryptographic gate the cert path enforced. B1 — same signing path, same broadcast path, same per-voter dedup. The cryptographic floor is unchanged; only the TIMING of vote emission is accelerated when heartbeat absence provides earlier evidence of producer failure. Stress-tested mentally against 12 edge cases including: * network partition recovery * primary recovers mid-failover * adversary forges TimeoutVote (rejected at signature verify) * adversary claims unknown identity (rejected at handshake/inline) * boundary blip with timeout_round=0 (cert check skipped, happy path) * macroblock commit-reveal mid-flight (independent mechanism, untouched) * receiver rejects forward vote (already supports +50 lookahead) * concurrent failures across f=1 budget * spurious votes at every 90-block transition (no, gated by stall detection) * boundary blip at h=89 to h=90 transition (no, cert check skipped at round=0) * genesis bootstrap edge cases (gated by is_synced_enough) * malicious vote spam across rounds (per-voter-per-round dedup) All paths preserved 2f+1 BFT safety; none introduce new attack surface. Scalability ----------- Per-node cost at any committee size: * A1: zero — same vote payload, same broadcast, just earlier emit at boundary * A2: O(1) DashMap shard read + len() — bounded by MAX_VALIDATORS = 1000 per slot * B1: one conditional Dilithium3 sign (~3 ms) + broadcast when heartbeat goes silent — same per-event cost as legacy emit No additional bandwidth, no additional storage, no additional CPU in the steady state. Identical performance from 5-node genesis to 1M super-nodes. Alignment with production-grade BFT invariants ---------------------------------------------- Universal invariants that all top-tier L1 chains satisfy: 1. Vote pool accepts forward-looking votes (any round / height) — A1 restores this (was implementation-locked to local state). 2. Block apply NOT blocked on prior cert (optimistic apply with lazy finality) — A2 implements pool fallback as cert equivalent. 3. Pacemaker / view-change advances on observed signals (heartbeats, timeouts), not on local state — B1 cross-wires heartbeat detection into TimeoutVote chain. 4. Cryptographic floor (signature math + 2f+1 threshold) is the actual safety gate — preserved unchanged across all three fixes. Tests ----- * tests_v21_a2_vote_pool (6 new tests in qnet-integration): - count_returns_zero_for_unknown_key - count_returns_exact_distinct_voter_count - quorum_check_below_threshold_is_false - quorum_check_at_threshold_is_true (boundary >= vs >) - quorum_check_above_threshold_is_true - rounds_are_independent_buckets * All existing v17/v18/v19/v19.1/v20 regression tests still pass: - qnet-consensus: 73 passed (unchanged) - qnet-integration: 149 passed (was 143, +6 new); 12 ignored (hardware bench) - Total: 222 passed, 0 failed Build ----- cargo build --release clean in 17m 11s, 0 warnings, 0 errors. qnet-node.exe binary 22.3 MB optimised. Verification path on deployed cluster -------------------------------------- After Docker image rebuild + container restart with this commit, expected log progression: 1. Network produces past h=360 boundary without verify_stuck storm 2. [INFO][TIMEOUT] heartbeat_driven_emit appears within 3 s of each producer-silent slot 3. [INFO][PIPELINE] cert_pool_grace_admit boundary=true appears once or twice per macroblock boundary during failover races 4. Microblock production resumes at ~1 block/sec sustained; macroblock finalisation continues every 90 blocks If cert_pool_grace_admit boundary=false appears repeatedly outside boundaries, that is a separate cert-aggregation lag worth investigating, but is not a liveness-blocking signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d9c7a89 commit c6a8d6d

3 files changed

Lines changed: 450 additions & 23 deletions

File tree

development/qnet-integration/src/block_pipeline.rs

Lines changed: 94 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1769,30 +1769,106 @@ impl BlockPipeline {
17691769
true
17701770
};
17711771
if !cert_present {
1772-
// Defer block: cert not yet propagated to us. Trigger
1773-
// backfill request and put block in the deferred buffer to
1774-
// be re-checked on the next pipeline pass.
1775-
if let Some(ref p2p) = unified_p2p {
1776-
p2p.request_timeout_proofs(mb_idx_for_cert, mb_idx_for_cert);
1777-
}
1778-
if deferred.len() < DEFERRED_MAX {
1779-
if is_debug() {
1780-
println!(
1781-
"[DBG][PIPELINE] block_deferred_for_cert h={} round={} mb_idx={} buf={}",
1782-
mb.height, mb.timeout_round, mb_idx_for_cert, deferred.len()
1783-
);
1784-
}
1785-
deferred.insert(mb.height, decoded);
1772+
// ═══════════════════════════════════════════════════════════
1773+
// v21 (A2): VOTE-POOL FALLBACK — boundary grace
1774+
// ═══════════════════════════════════════════════════════════
1775+
// The aggregated cert may not have been generated or
1776+
// gossipped yet, but the underlying TimeoutVotes — each
1777+
// Dilithium3-verified at ingest by `handle_timeout_vote`
1778+
// — may already be in the local pool. The cert is just an
1779+
// aggregated view of those votes; if 2f+1 are present in
1780+
// the pool, we have equivalent cryptographic evidence.
1781+
//
1782+
// Why this is needed
1783+
// ──────────────────
1784+
// Forensic case h=360 on the 5-node testnet showed the
1785+
// failure mode:
1786+
// * h=360 first block of new macroblock; primary
1787+
// timed out → failover producer emitted block with
1788+
// timeout_round=R>0;
1789+
// * receivers required AggregatedTimeoutCert for
1790+
// (mb_idx=4, R) before applying;
1791+
// * cert generation requires 2f+1 votes for
1792+
// (mb_idx=4, R) gossipped to at least one node
1793+
// which then aggregates and re-broadcasts;
1794+
// * during the race window between last vote arriving
1795+
// and cert re-gossip, every receiver's pipeline
1796+
// deferred the block — even though the underlying
1797+
// votes WERE locally present.
1798+
//
1799+
// Treating the pool as equivalent evidence closes that
1800+
// race. No new attack surface: the votes counted are
1801+
// the same Dilithium3-signed messages that feed cert
1802+
// generation; threshold (2f+1) is the same.
1803+
//
1804+
// Pattern
1805+
// ───────
1806+
// "cert is a view, not a gate" — same data, two access
1807+
// paths. Aligns with production-grade BFT semantics
1808+
// where vote pool is the canonical source of truth and
1809+
// the aggregated form is a transport optimisation.
1810+
//
1811+
// Scalability
1812+
// ───────────
1813+
// One DashMap shard read + one HashMap len() — O(1)
1814+
// hot-path cost. At 1M super-nodes the inner HashMap
1815+
// is bounded by `MAX_VALIDATORS = 1000` per slot, so
1816+
// the count operation is a constant.
1817+
// ═══════════════════════════════════════════════════════════
1818+
let pool_has_quorum = if let Some(ref p2p) = unified_p2p {
1819+
let total = p2p.get_active_validator_count();
1820+
// Same threshold formula used everywhere in the
1821+
// codebase: `(N * 2 + 2) / 3` = ceil(2N/3) = 2f+1.
1822+
let two_f_plus_1 = (total * 2 + 2) / 3;
1823+
p2p.has_two_f_plus_one_timeout_votes(
1824+
mb_idx_for_cert,
1825+
mb.timeout_round,
1826+
two_f_plus_1,
1827+
)
17861828
} else {
1829+
false
1830+
};
1831+
1832+
if pool_has_quorum {
1833+
// Pool evidence equivalent to cert. Fall through to
1834+
// subsequent verify steps. Boundary flag in the log
1835+
// helps operators distinguish the legitimate macroblock-
1836+
// boundary race window from steady-state mid-macroblock
1837+
// catches (the latter is unusual and worth noting).
1838+
let at_boundary = mb.height % 90 == 0;
17871839
if is_info() {
17881840
println!(
1789-
"[INFO][PIPELINE] deferred_full h={} round={} dropped (buf={})",
1790-
mb.height, mb.timeout_round, DEFERRED_MAX
1841+
"[INFO][PIPELINE] cert_pool_grace_admit h={} mb_idx={} round={} boundary={} \
1842+
reason=2fplus1_votes_in_local_pool hint=cert_aggregation_race_bypassed",
1843+
mb.height, mb_idx_for_cert, mb.timeout_round, at_boundary
17911844
);
17921845
}
1793-
metrics.verify_failed.fetch_add(1, Ordering::Relaxed);
1846+
} else {
1847+
// Defer block: neither cert nor enough pool votes yet.
1848+
// Trigger backfill request and put block in the
1849+
// deferred buffer to be re-checked on the next pass.
1850+
if let Some(ref p2p) = unified_p2p {
1851+
p2p.request_timeout_proofs(mb_idx_for_cert, mb_idx_for_cert);
1852+
}
1853+
if deferred.len() < DEFERRED_MAX {
1854+
if is_debug() {
1855+
println!(
1856+
"[DBG][PIPELINE] block_deferred_for_cert h={} round={} mb_idx={} buf={}",
1857+
mb.height, mb.timeout_round, mb_idx_for_cert, deferred.len()
1858+
);
1859+
}
1860+
deferred.insert(mb.height, decoded);
1861+
} else {
1862+
if is_info() {
1863+
println!(
1864+
"[INFO][PIPELINE] deferred_full h={} round={} dropped (buf={})",
1865+
mb.height, mb.timeout_round, DEFERRED_MAX
1866+
);
1867+
}
1868+
metrics.verify_failed.fetch_add(1, Ordering::Relaxed);
1869+
}
1870+
continue;
17941871
}
1795-
continue;
17961872
}
17971873
}
17981874

development/qnet-integration/src/node.rs

Lines changed: 145 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17694,11 +17694,59 @@ impl BlockchainNode {
1769417694
// extra second of stall past the adaptive grace window.
1769517695
const TIMEOUT_VOTE_INTERVAL: u64 = 1;
1769617696

17697-
// v4.2: Timeout votes/certificates keyed by MACROBLOCK INDEX,
17698-
// not exact microblock height. This ensures nodes at different
17699-
// microblock heights within the same macroblock can still form
17700-
// quorum and produce a TimeoutCertificate.
17701-
let timeout_mb_index = microblock_height / 90;
17697+
// ═══════════════════════════════════════════════════════════════
17698+
// v21 (A1): FORWARD-LOOKING TIMEOUT VOTE TARGET
17699+
// ═══════════════════════════════════════════════════════════════
17700+
// The timeout vote MUST address the macroblock whose producer is
17701+
// currently failing — the macroblock that contains the BLOCK
17702+
// we are waiting for, not the macroblock our local tip is in.
17703+
//
17704+
// At a macroblock boundary, this distinction is what bridges
17705+
// boot-N to boot-N+1:
17706+
//
17707+
// local_height = 359 → next_height = 360
17708+
// old logic: mb_idx = 359 / 90 = 3 (PREVIOUS macroblock)
17709+
// v21 logic: mb_idx = 360 / 90 = 4 (NEXT macroblock —
17710+
// the one whose producer
17711+
// is failing right now)
17712+
//
17713+
// Why the old indexing caused a circular deadlock at the very
17714+
// first failed-primary boundary:
17715+
// * Voter at h=359 emitted votes for mb_idx=3.
17716+
// * mb_idx=3 was already finalised (it ended at h=359 itself).
17717+
// * No node accumulated evidence for mb_idx=4.
17718+
// * The block at h=360 (produced by failover at round>0)
17719+
// required an AggregatedTimeoutCert for (mb_idx=4, round)
17720+
// to apply.
17721+
// * Cert never reached 2f+1 — voters were all voting for
17722+
// mb_idx=3 instead.
17723+
// * Network stalled at h=359 with no path to recover.
17724+
//
17725+
// Targeting `next_height / 90` makes vote semantics align with
17726+
// production-grade BFT vote-pool patterns: a vote is a
17727+
// standalone cryptographic claim about a future round, not a
17728+
// function of the voter's local state. The receiver-side
17729+
// already accepts votes for `mb_idx ≤ local_mb + 50` — see
17730+
// `unified_p2p.rs::handle_timeout_vote` lookahead window — so
17731+
// emitter-side change is sufficient and self-contained.
17732+
//
17733+
// Safety invariants preserved
17734+
// ─────────────────────────────
17735+
// * Vote remains Dilithium3-signed by the voter — emitter
17736+
// identity gated as before.
17737+
// * 2f+1 supermajority threshold unchanged.
17738+
// * VRF determinism for producer selection unchanged.
17739+
// * `voted_for_round` per-voter dedup still bounds emit rate.
17740+
//
17741+
// Scalability
17742+
// ───────────
17743+
// No additional bandwidth: same vote payload, same broadcast
17744+
// path, same gossip fan-out. The change shifts WHEN the vote
17745+
// is emitted (one slot earlier in the boundary case) and
17746+
// WHICH mb_idx it targets, not the cost of emitting it.
17747+
// Identical performance from 5 to 1M super-nodes.
17748+
// ═══════════════════════════════════════════════════════════════
17749+
let timeout_mb_index = next_height / 90;
1770217750

1770317751
// v5.4: Efficient certificate lookup (replaces bounded loop)
1770417752
let certified_timeout_round = if let Some(p2p) = &unified_p2p {
@@ -18154,6 +18202,98 @@ impl BlockchainNode {
1815418202
);
1815518203
}
1815618204

18205+
// ═══════════════════════════════════════════════════════════════
18206+
// v21 (B1): HEARTBEAT-DRIVEN FORWARD TIMEOUT VOTE EMIT
18207+
// ═══════════════════════════════════════════════════════════════
18208+
// When heartbeat absence is detected from the expected producer
18209+
// for `next_height`, emit a TimeoutVote IMMEDIATELY rather than
18210+
// waiting for `local_delay > timeout_grace_period`. This shaves
18211+
// ~5-10 seconds off the failover path because the vote starts
18212+
// propagating ~3 s after heartbeat-silence detection instead of
18213+
// after a full slot-grace window.
18214+
//
18215+
// Bridges the existing empty-slot attestation mechanism (which
18216+
// already fires on heartbeat_fast_path) into the TimeoutVote /
18217+
// cert-aggregation path, so the same observed producer failure
18218+
// produces evidence on BOTH consensus channels:
18219+
//
18220+
// * empty_slot_failover_round (attestation-based —
18221+
// accelerates microblock-level skip)
18222+
// * HIGHEST_CERTIFIED_ROUND (cert-based —
18223+
// drives macroblock-rotation round advancement)
18224+
//
18225+
// Without this cross-wiring, a heartbeat-detected producer
18226+
// failure triggered ONLY the attestation channel; the
18227+
// TimeoutVote stream had to wait for the legacy
18228+
// `local_delay > grace_period` gate, which at macroblock
18229+
// boundaries created a window where attestations advanced but
18230+
// the cert chain did not — leaving the cert-presence pipeline
18231+
// gate stalling blocks (forensic case h=360 at the first
18232+
// macroblock-boundary primary failure on the testnet).
18233+
//
18234+
// Gated on `proposed_timeout_round == 0` so this path never
18235+
// double-fires with the legacy stall-driven emit (which only
18236+
// runs when proposed_timeout_round > 0). Once `local_delay`
18237+
// crosses the grace threshold, control switches cleanly to the
18238+
// legacy path with no overlap.
18239+
//
18240+
// Safety
18241+
// ──────
18242+
// Same Dilithium3 signature, same `(mb_idx, round, voter_id)`
18243+
// anti-replay tracker, same 2f+1 supermajority threshold for
18244+
// cert generation. `broadcast_timeout_vote` itself dedupes via
18245+
// `TIMEOUT_VOTED_HEIGHTS` so repeated invocations within the
18246+
// same tick are no-ops. The cryptographic floor is unchanged.
18247+
//
18248+
// Scalability
18249+
// ───────────
18250+
// One conditional Dilithium3 sign (~3 ms) + one broadcast when
18251+
// heartbeat goes silent — same per-event cost as the legacy
18252+
// emit, just earlier in the timeline. Identical performance
18253+
// profile from 5 to 1M super-nodes.
18254+
// ═══════════════════════════════════════════════════════════════
18255+
if heartbeat_fast_path
18256+
&& proposed_timeout_round == 0
18257+
&& is_synced_enough
18258+
&& (microblock_height > 0 || genesis_era_dead_producer)
18259+
{
18260+
let target_round = certified_timeout_round.saturating_add(1);
18261+
if let Some(p2p) = &unified_p2p {
18262+
let last_block_hash = storage.get_latest_macroblock_hash()
18263+
.unwrap_or([0u8; 32]);
18264+
let vote_msg = format!(
18265+
"TIMEOUT:{}:{}:{}",
18266+
timeout_mb_index, target_round, hex::encode(&last_block_hash)
18267+
);
18268+
if let Some(crypto) = try_get_quantum_crypto() {
18269+
match crypto.create_consensus_signature(&node_id, &vote_msg).await {
18270+
Ok(sig) => {
18271+
p2p.broadcast_timeout_vote(
18272+
timeout_mb_index,
18273+
target_round,
18274+
last_block_hash,
18275+
sig.signature.as_bytes().to_vec(),
18276+
);
18277+
if is_info() {
18278+
println!(
18279+
"[INFO][TIMEOUT] heartbeat_driven_emit mb={} round={} reason=producer_silent_fast_path",
18280+
timeout_mb_index, target_round
18281+
);
18282+
}
18283+
}
18284+
Err(e) => {
18285+
if is_warn() {
18286+
println!(
18287+
"[WARN][TIMEOUT] heartbeat_driven_sign_fail err={}",
18288+
e
18289+
);
18290+
}
18291+
}
18292+
}
18293+
}
18294+
}
18295+
}
18296+
1815718297
// v14.8.11: drift self-pause vote gate REMOVED. A
1815818298
// drifted node still contributes TimeoutVotes because
1815918299
// its `proposed_timeout_round` is clamped by the

0 commit comments

Comments
 (0)