Commit eee6001
fix: v16.2 deterministic rotation hardening + observer-based fork recovery
Forensic root cause (h=154 cold-boot deadlock, deploy v16.1):
* My v16.1 escalation ladder Stage 1 locally mutated CURRENT_TIMEOUT_ROUND
outside BFT consensus -> nodes hit cycle threshold at different moments,
bumped global selection input to different values, produced divergent
producer selections and visible forks
* Cross-round pacemaker (v15.0) advanced certified_round in mixed-round
bursts during cold-boot (5 NTP-synced nodes hitting stall detection
simultaneously) -> certified_round raced 1->2->6->11 in 5 seconds,
multiple producers concurrently emitting blocks at different rotation
rounds for the same height
* Phase 4.A source-based 2f+1 destructive rollback was dead code: with
byzantine <= f, source count tops at f, threshold unreachable
* Smart vote gate using p2p.get_best_peer_height() -> race-prone metric
could deny valid votes to nodes that were finality-current
* Phase 2.A force_round_advance + runtime EXCLUDED_PRODUCERS scan in
rotation -> per-node-divergent state injected into producer selection
FIX 1: REMOVE force_round_advance ladder stage
* node.rs::escalate_error_state: stages 2 (resync), 3 (peer_refresh),
4 (halt) retained -- all signal-based, never mutate consensus state
* Stage 1 deletion restores rotation determinism that existed pre-v16.1
FIX 2: CERT PRESENCE CHECK at verify stage
* block_pipeline.rs verify: blocks with timeout_round > 0 require local
AGGREGATED_TC[(mb_idx, round)] before accept -- defer + request_timeout_proofs
if absent. Cert is 2f+1 Dilithium3-signed at gossip ingest, presence
is sufficient evidence
* unified_p2p.rs: has_aggregated_timeout_cert() O(1) lookup
* No block bloat: Dilithium3 not BLS-aggregateable; cert stays gossiped
separately, only presence consulted -- works at committee=1000 (~2.2 MB
cert local memory, never embedded in blocks)
FIX 3: SAME-ROUND-ONLY PACEMAKER on cold-boot
* unified_p2p.rs::handle_timeout_vote: cross-round aggregation disabled
at mb_idx == 0 (genesis). Production unchanged (mb_idx >= 1 keeps
cross-round for liveness)
* Cold-boot determinism: 5 NTP-synced nodes converge on same round R
via same-round 2f+1 simultaneously, no propagation race window
FIX 4: ROUND-CHANGE READY HANDSHAKE
* New NetworkMessage variants: ProducerReady + ReadyAck (signed Dilithium3)
* unified_p2p.rs handlers: producer at round > 0 broadcasts ProducerReady,
peers ack ONLY when local_certified == round (exact match), producer
waits for 2f+1 distinct signed acks before constructing block
* node.rs::wait_for_round_change_ready_quorum: 800ms timeout, yields slot
if quorum not met (pacemaker advances naturally on next stall)
* Steady-state happy path (round=0) has zero handshake overhead
* Handshake fires only at rotation events (rare under healthy network)
FIX 5: PEER-COUNT SIGNAL cold-boot gate
* node.rs: replaced arbitrary 60s magic timer with event-based release
when peer_count + consensus_pk_registry_len reach committee size
AND 10s safety floor elapsed. Auto-calibrates to actual network
conditions (Docker boot + P2P discovery + key exchange variance)
FIX 6: REMOVE redundant runtime exclusion scan
* node.rs::select_microblock_producer_with_round: candidates list is
ALREADY filtered by canonical excluded_producers_for_next_epoch from
mb#N-2 (2f+1-finalised chain state) inside calculate_qualified_candidates.
Adding a runtime EXCLUDED_PRODUCERS DashMap scan introduced
per-node-divergent skip paths -- removed, canonical filter is the
single authoritative source
FIX 7: OBSERVER-BASED 2f+1 DESTRUCTIVE ROLLBACK
* New NetworkMessage::BlockRejection (signed Dilithium3)
* block_pipeline.rs::record_hash_chain_break_witness: f+1 advisory log
+ peer cooldown retained; dead source-based 2f+1 destructive code
removed
* On hash_chain_break, observer signs and broadcasts BlockRejection.
Receivers aggregate distinct observer_ids per (height, source_peer_id);
2f+1 distinct observers raise FORK_RECOVERY_HEIGHT for canonical
destructive rollback path
* Anti-Sybil: signed by observer's registered Dilithium3 key, verified
against consensus_pk_registry. Self-attestation rejected
* Industry-canonical observer-supermajority pattern (works at any
committee size up to 1000)
FIX 8: VOTE GATE finality-locked
* node.rs: smart vote gate uses storage.get_latest_macroblock_index()
(2f+1 commit-reveal-locked on-chain state) instead of race-prone
best_peer_h gossip metric. Node at canonical finalised height can vote
even if microblock tip lags
Properties enforced after v16.2:
* 2f+1 invariant in every consensus decision (verified per fix)
* No wall-clock in consensus (all decisions: chain state OR 2f+1
verified evidence OR explicit 2f+1 signaling)
* All hot-path operations bounded by committee size (<= 1000), no
O(network_size) work
* Single magic number in hot path: MIN_SAFETY_FLOOR_SECS=10 (sanity
floor for cold-boot peer-count gate)
* Industry parity: cert defer-until-evidence, round-change handshake,
observer-supermajority rollback, peer-discovery quorum gate
* Post-quantum adapted: Dilithium3 throughout, defer pattern compensates
for non-aggregateable signatures (no BLS available)
Diff: 3 files, +1116 / -171 lines
Build: cargo build --release exit 0 (14m 17s, 0 warnings)
Workspace: cargo check --release --workspace exit 0 (1m 57s, 0 warnings)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent b3224b6 commit eee6001
3 files changed
Lines changed: 1116 additions & 171 deletions
0 commit comments