Skip to content

Commit 10b972e

Browse files
AIQnetLabclaude
andcommitted
fix: v22 — collapse microblock failover machinery to pure VRF + skip-slot
Architectural simplification of the microblock layer: a single deterministic VRF leader per height plus a time-derived in-rotation fallback computed from the chain-anchored `last_applied_block_timestamp` and the network-corrected `effective_now()` (BFT-time median over the 32-sample on-chain timestamp ring). The macroblock layer (90-block 2f+1 commit-reveal + view-change), existing fork-choice (`chain_weight.rs` cumulative attestation weight), and equivocation slashing (`record_block_equivocation` + `analyze_chain_for_slashing` covering block-equivocation, timeout-equivocation, and generic double-sign) are unchanged. Forensic context ---------------- Three bugs observed on the 5-node testnet across v21..v22: * h=360 macroblock-boundary cert-presence deadlock (12.7 h stall) * h=367 split-brain fork (empty-slot attestation cascade racing cert generation produced an alternate block at the same height) * node_002 555K pipeline backlog (canonical peers pipeline-jammed by the cert-presence gate, sync requests timed out, fell back to the forked peer, hash_chain_break loop) Root cause was four overlapping microblock failover mechanisms accumulated over 55 commits (v15..v21) without removing previous ones: rotation_round atomic + TimeoutVote chain + empty-slot attestation cascade + heartbeat-fast-path detection. O(N^2) interaction surface made races inevitable. Changes ------- block_pipeline.rs : +8 -160 net -152 node.rs : +372 -781 net -409 unified_p2p.rs : +0 -189 net -189 TOTAL 396 insertions(+), 1186 deletions(-) — NET -790 lines Deleted (microblock failover machinery) * Cert presence gate at block_pipeline.rs:1761 (incl. v21 A2 vote-pool fallback — both inert once `mb.timeout_round` is always 0). * Empty-slot attestation cascade body (~200 lines). * Heartbeat-fast-path consensus integration. * Pacemaker rotation_round combine + `set_timeout_round` storage. * Vote-gate `is_synced_enough` block (was gating only the deleted emit). * v21 (B1) heartbeat-driven TimeoutVote emit. * Stall-driven TimeoutVote-microblock emit + producer-failover log. * Adaptive timeout grace period (`base_grace + rotation_extra + drift_grace`). * `proposed_timeout_round` / `adopted_timeout_round` / `timeout_round_for_rotation` / `f_plus_1` derivations. * Per-tick `certified_timeout_round` read at the microblock layer. * `effective_timeout_round_at_start` computation inside the microblock struct construction. * v21 vote-pool helpers `count_timeout_votes_in_pool` / `has_two_f_plus_one_timeout_votes` and `tests_v21_a2_vote_pool` module. * `reset_consecutive_empty_slots()` no-op shim from an earlier v22 draft. * Residual `/* */` commented-out blocks left from incremental edits (~360 lines physically removed). Added (v22 production path) * `MAX_CONSECUTIVE_EMPTY_SLOTS = 3` constant. * `v22_compute_empty_slot_offset(last_applied_block_timestamp, now)` pure helper: seconds_silent = now - last_applied_block_timestamp - 1 offset = seconds_silent / MAX_CONSECUTIVE_EMPTY_SLOTS Returns 0 under healthy production. Walks forward by 1 every `MAX_CONSECUTIVE_EMPTY_SLOTS` silent seconds. * Producer loop rewrite: timeout_round = 0 always primary_producer = VRF leader at round 0 empty_slot_offset = v22_compute_empty_slot_offset( last_applied_ts, effective_now()) // BFT-time, not raw clock current_producer = if offset == 0 { primary } else { VRF leader at round = offset } * `microblock.timeout_round = 0` literal in the MicroBlock constructor. * `tests_v22_slot_offset` — 6 regression tests pinning every transition boundary plus the NTP-jitter spread bound. Preserved (unchanged) * Macroblock 2f+1 commit-reveal finality (commit_phase / reveal_phase / finalize_round). * Macroblock view-change via `emit_macroblock_view_change_vote` on commit-phase or reveal-phase failure. * `AggregatedTimeoutCert` storage + gossip handlers (consumed by macroblock view-change only after v22). * `HIGHEST_CERTIFIED_ROUND` DashMap (macroblock state). * `chain_weight.rs` LMD-GHOST cumulative attestation fork choice. * `record_block_equivocation` + `BLOCK_EQUIVOCATION_EVIDENCE` + all slashing detection paths. * `effective_now()` BFT-time median ring infrastructure. * Median-past timestamp rule + `TIMESTAMP_FUTURE_TOLERANCE`. * `observe_clock_drift` EMA monitor + `[WARN][DRIFT]` operator hint. * v18 active sync + range-sync + parallel-sync (`MAX_PARALLEL_SYNC_PEERS=8`). * v19 anti-spoof Dilithium handshake + v19.1 fresh-bootstrap auto-anchor. * v20 PK registry scaling + LRU + 100K cap + env override. * Chronic stall recovery (`> 120 s` peer-driven resync) — simplified to drop the now-meaningless `certified_round == 0` predicate. Why this is correct for PQC two-tier blockchain ----------------------------------------------- Dilithium3 signatures have no aggregation (no BLS equivalent). At a 1000-validator committee, per-block 2/3 voting costs ~2.3 MB / block of signatures alone — un-shippable. The two-tier amortisation (one signed microblock per second; 2f+1 macroblock cert every 90 seconds) cuts the sustained bandwidth to roughly 53 KB / sec at 1000 validators. For the microblock tier, the universal pattern across multi-tier-finality chains is optimistic apply with an empty slot tolerated when the deterministic leader is silent. v22 adopts this pattern and adds an in-rotation fallback bounded by `MAX_CONSECUTIVE_EMPTY_SLOTS` slots, so the 30-block rotation window does not amplify a single failed primary into 30 empty slots. Safety properties preserved * Single deterministic VRF leader per height — no two valid producers can claim the same height legitimately. * Fallback identity is the same `select_microblock_producer_with_round` function with `empty_slot_offset` mixed into the seed; every honest node observing the same `effective_now()` reaches the same fallback identity. * `effective_now()` = max(wall, median of 32-sample on-chain block timestamps) — a node with a drifted local clock converges with the network on the slot-offset computation as long as the chain itself is making progress. * All Dilithium3 signature verification gates intact at every block. * 2f+1 macroblock commit-reveal supermajority preserved. * Anti-replay via signed `(mb_idx, round, voter_id)` tuple in macroblock TimeoutVote. * Anti-spoof handshake binding (v19) + chain-registered PK (v20) intact. Failure modes structurally impossible in v22 * Cascade livelock across microblock rounds (v15.0 h=2880-3150): no microblock rounds → no per-voter round scatter. * Split-brain producer at the same height (v15.13 h=556, v22 h=367): single VRF leader; deterministic time-derived fallback with offset spread ≤ 1 across honest nodes within ±2 s NTP jitter (pinned by `tests_v22_slot_offset::offset_jitter_bound_within_two_seconds_ntp`). * Macroblock-boundary cert deadlock (v21 h=360): `mb.timeout_round == 0` always → cert presence gate (deleted) cannot fire for any microblock. * Empty-slot attestation race producing alternate-round block (v22 h=367): empty-slot cascade physically removed; no consensus state advances on local empty-slot observation. Scalability ----------- Per-node steady-state cost: two timestamp loads + one integer division per slot for the offset computation. Zero added bandwidth, zero added storage, identical performance profile from 5 to 1M super-nodes. 1000-validator-per-macroblock committee cap unaffected. Tests ----- * `tests_v22_slot_offset` — 6 new tests in `node.rs`: - offset_zero_under_healthy_production - offset_zero_below_threshold - offset_one_at_threshold - offset_walks_forward_with_silence - offset_saturates_on_backward_clock - offset_jitter_bound_within_two_seconds_ntp * `qnet-consensus` : 73 passed, 0 failed * `qnet-integration` (serial) : 149 passed, 0 failed, 12 ignored (hardware-bench) * Total : 228 passed across both crates. Build ----- cargo build --release clean in 18m 47s, 0 warnings, 0 errors. qnet-node.exe binary 22.3 MB optimised. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b19c872 commit 10b972e

3 files changed

Lines changed: 396 additions & 1186 deletions

File tree

development/qnet-integration/src/block_pipeline.rs

Lines changed: 8 additions & 161 deletions
Original file line numberDiff line numberDiff line change
@@ -1710,167 +1710,14 @@ impl BlockPipeline {
17101710
}
17111711
}
17121712

1713-
// ═══════════════════════════════════════════════════════════════════════════
1714-
// 2b. v16.2: CERT PRESENCE CHECK FOR ROUND>0 BLOCKS (BFT-evidence enforcement)
1715-
// ═══════════════════════════════════════════════════════════════════════════
1716-
// A block produced at `timeout_round = R > 0` claims that the network
1717-
// advanced rotation past round 0 via 2f+1 signed TimeoutVotes. The
1718-
// cryptographic evidence for that advancement is the
1719-
// AggregatedTimeoutCertificate stored at `AGGREGATED_TC[(mb_idx, R)]`.
1720-
//
1721-
// Cold-boot postmortem at h=154 showed two producers concurrently
1722-
// emitting blocks at different rotation_rounds because each had
1723-
// advanced its own `HIGHEST_CERTIFIED_ROUND` independently while
1724-
// peer gossip was still propagating votes. Without an evidence check
1725-
// here, ingest accepted any signed block that claimed any round —
1726-
// letting the network apply two divergent histories before the
1727-
// mismatch was observed at the next macroblock boundary.
1728-
//
1729-
// The fix:
1730-
// * `block.timeout_round == 0` → no cert needed, accept (happy path).
1731-
// * `block.timeout_round > 0` → REQUIRE local
1732-
// `AGGREGATED_TC.get((mb_idx, round)).is_some()`. If absent, defer
1733-
// the block (pull request_timeout_proofs from peers, re-check on
1734-
// next pass). The certificate carries 2f+1 Dilithium3 signatures
1735-
// and is signature-verified at gossip ingest by
1736-
// `handle_aggregated_timeout_cert`, so presence is sufficient
1737-
// evidence — we don't re-verify here.
1738-
//
1739-
// Safety property: a block at round R is accepted iff this node has
1740-
// independently observed 2f+1 votes for round R. Two producers at
1741-
// different rounds at the same height become impossible — the late
1742-
// one's cert can only be present after 2f+1 votes for ITS round
1743-
// arrived, which means rotation advanced past the earlier producer's
1744-
// round and the earlier block is now stale (will be rejected by the
1745-
// producer authority check below or supplanted by canonical chain).
1746-
//
1747-
// Scalability: O(1) DashMap lookup. AGGREGATED_TC bounded by the
1748-
// active macroblock window (cleanup_old_timeout_data evicts stale
1749-
// entries). Works identically at 5-node genesis and 1000-node
1750-
// production committee — cert size doesn't matter, only presence.
1751-
//
1752-
// Post-quantum adaptation: Dilithium3 signatures cannot be
1753-
// aggregated to a constant-size bundle (no BLS equivalent), so we
1754-
// cannot embed the 2f+1-signed cert directly inside every block
1755-
// header — at committee=1000 that would be ~2.2 MB per block. The
1756-
// local-presence model achieves the same safety property without
1757-
// the bandwidth cost: cert is gossiped once via
1758-
// `broadcast_aggregated_timeout_cert`, stored at every honest peer,
1759-
// and consulted here.
1760-
// ═══════════════════════════════════════════════════════════════════════════
1761-
if mb.height > 0 && mb.timeout_round > 0 {
1762-
let mb_idx_for_cert = mb.height / 90;
1763-
let cert_present = if let Some(ref p2p) = unified_p2p {
1764-
p2p.has_aggregated_timeout_cert(mb_idx_for_cert, mb.timeout_round)
1765-
} else {
1766-
// No P2P context (rare — replay path); accept block as
1767-
// signature already validates producer identity. The
1768-
// canonical safety net is at apply-time hash chain.
1769-
true
1770-
};
1771-
if !cert_present {
1772-
// ═══════════════════════════════════════════════════════════
1773-
// v21 (A2): VOTE-POOL FALLBACK — boundary grace
1774-
// ═══════════════════════════════════════════════════════════
1775-
// The aggregated cert may not have been generated or
1776-
// gossipped yet, but the underlying TimeoutVotes — each
1777-
// Dilithium3-verified at ingest by `handle_timeout_vote`
1778-
// — may already be in the local pool. The cert is just an
1779-
// aggregated view of those votes; if 2f+1 are present in
1780-
// the pool, we have equivalent cryptographic evidence.
1781-
//
1782-
// Why this is needed
1783-
// ──────────────────
1784-
// Forensic case h=360 on the 5-node testnet showed the
1785-
// failure mode:
1786-
// * h=360 first block of new macroblock; primary
1787-
// timed out → failover producer emitted block with
1788-
// timeout_round=R>0;
1789-
// * receivers required AggregatedTimeoutCert for
1790-
// (mb_idx=4, R) before applying;
1791-
// * cert generation requires 2f+1 votes for
1792-
// (mb_idx=4, R) gossipped to at least one node
1793-
// which then aggregates and re-broadcasts;
1794-
// * during the race window between last vote arriving
1795-
// and cert re-gossip, every receiver's pipeline
1796-
// deferred the block — even though the underlying
1797-
// votes WERE locally present.
1798-
//
1799-
// Treating the pool as equivalent evidence closes that
1800-
// race. No new attack surface: the votes counted are
1801-
// the same Dilithium3-signed messages that feed cert
1802-
// generation; threshold (2f+1) is the same.
1803-
//
1804-
// Pattern
1805-
// ───────
1806-
// "cert is a view, not a gate" — same data, two access
1807-
// paths. Aligns with production-grade BFT semantics
1808-
// where vote pool is the canonical source of truth and
1809-
// the aggregated form is a transport optimisation.
1810-
//
1811-
// Scalability
1812-
// ───────────
1813-
// One DashMap shard read + one HashMap len() — O(1)
1814-
// hot-path cost. At 1M super-nodes the inner HashMap
1815-
// is bounded by `MAX_VALIDATORS = 1000` per slot, so
1816-
// the count operation is a constant.
1817-
// ═══════════════════════════════════════════════════════════
1818-
let pool_has_quorum = if let Some(ref p2p) = unified_p2p {
1819-
let total = p2p.get_active_validator_count();
1820-
// Same threshold formula used everywhere in the
1821-
// codebase: `(N * 2 + 2) / 3` = ceil(2N/3) = 2f+1.
1822-
let two_f_plus_1 = (total * 2 + 2) / 3;
1823-
p2p.has_two_f_plus_one_timeout_votes(
1824-
mb_idx_for_cert,
1825-
mb.timeout_round,
1826-
two_f_plus_1,
1827-
)
1828-
} else {
1829-
false
1830-
};
1831-
1832-
if pool_has_quorum {
1833-
// Pool evidence equivalent to cert. Fall through to
1834-
// subsequent verify steps. Boundary flag in the log
1835-
// helps operators distinguish the legitimate macroblock-
1836-
// boundary race window from steady-state mid-macroblock
1837-
// catches (the latter is unusual and worth noting).
1838-
let at_boundary = mb.height % 90 == 0;
1839-
if is_info() {
1840-
println!(
1841-
"[INFO][PIPELINE] cert_pool_grace_admit h={} mb_idx={} round={} boundary={} \
1842-
reason=2fplus1_votes_in_local_pool hint=cert_aggregation_race_bypassed",
1843-
mb.height, mb_idx_for_cert, mb.timeout_round, at_boundary
1844-
);
1845-
}
1846-
} else {
1847-
// Defer block: neither cert nor enough pool votes yet.
1848-
// Trigger backfill request and put block in the
1849-
// deferred buffer to be re-checked on the next pass.
1850-
if let Some(ref p2p) = unified_p2p {
1851-
p2p.request_timeout_proofs(mb_idx_for_cert, mb_idx_for_cert);
1852-
}
1853-
if deferred.len() < DEFERRED_MAX {
1854-
if is_debug() {
1855-
println!(
1856-
"[DBG][PIPELINE] block_deferred_for_cert h={} round={} mb_idx={} buf={}",
1857-
mb.height, mb.timeout_round, mb_idx_for_cert, deferred.len()
1858-
);
1859-
}
1860-
deferred.insert(mb.height, decoded);
1861-
} else {
1862-
if is_info() {
1863-
println!(
1864-
"[INFO][PIPELINE] deferred_full h={} round={} dropped (buf={})",
1865-
mb.height, mb.timeout_round, DEFERRED_MAX
1866-
);
1867-
}
1868-
metrics.verify_failed.fetch_add(1, Ordering::Relaxed);
1869-
}
1870-
continue;
1871-
}
1872-
}
1873-
}
1713+
// v22: cert presence gate REMOVED. Microblocks no longer carry a
1714+
// rotation round (`mb.timeout_round` is always 0 — see
1715+
// `node.rs::microblock_construction`). The previous gate existed to
1716+
// require AggregatedTimeoutCert presence for round>0 microblocks;
1717+
// the round>0 case is now structurally unreachable from honest
1718+
// producers, and dishonest emitters are caught by the signature
1719+
// gate immediately below. Macroblock layer retains its own 2f+1
1720+
// commit-reveal finality — that path is unchanged.
18741721

18751722
// 3. Signature verification
18761723
// Genesis block (h=0) uses embedded self-signed keys — skip standard verification.

0 commit comments

Comments
 (0)