Skip to content

Commit b68a570

Browse files
AIQnetLabclaude
andcommitted
fix: v17.1 — gossip-safe identity binding + bootstrap race guard
REVERT v17 IP-anchor gate from gossip-relayed handlers ───────────────────────────────────────────────────────── The v17 IP-anchor check ran on 12 P2P handlers that receive gossip-propagated messages: ProducerHeartbeat, BlockRejection, ProducerReady, ReadyAck, ConsensusCommit, ConsensusReveal, VrfLeaderClaim, TimeoutVote, BlockAttestation, EmptySlotAttestationMsg, VrfKeyAnnounce, ActiveNodeAnnouncement. For gossip-relayed messages, from_peer carries the relay's IP, NOT the originator's signed identity. Anchoring the relay to the genesis IP rejected legitimate cross-genesis traffic and broke 2f+1 quorum (testnet symptom: macroblock #2 stuck, heights diverged 119/210/220/207/222, log floods of genesis_ip_mismatch ... REJECTED). Identity binding is enforced cryptographically by verify_consensus_signature against CONSENSUS_PK_REGISTRY (immutable, pre-pinned at startup from genesis_anchors.json). Fix #2/#3 from v17 close the legacy fallback and TOFV-on- genesis paths, so the signature gate is the canonical, gossip-safe security boundary. The check_genesis_ip_gate helper is preserved with #[allow(dead_code)] for any future point-to-point message type where IP anchoring is sound. ADD bootstrap race guard for genesis nodes ────────────────────────────────────────── A genesis node started without genesis_anchors.json enters trust-on-first-verify mode for VrfKeyAnnounce — whichever peer announces first locks the genesis identity to its PK. Operator-unaware anchor loss between restarts is exactly how a squat-on-bootstrap attack succeeds. install_genesis_anchors_at_startup now refuses to start when QNET_BOOTSTRAP_ID is set but anchors are missing, unless QNET_BOOTSTRAP_FRESH=1 explicitly opts into the race window (legitimate for first-ever boot or full-state cleanup). Super-nodes are unaffected — they bind identity via signed NodeRegistration TX. Added 4 regression tests covering the boot-decision truth table (super/genesis × no-opt-in/opt-in). Files: 2 modified, +307 / -148 lines. Tests: 22/22 v17 regression tests pass (was 18, +4 new). Build: cargo check --release clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 346445f commit b68a570

2 files changed

Lines changed: 307 additions & 150 deletions

File tree

development/qnet-integration/src/genesis_constants.rs

Lines changed: 168 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,50 @@ pub fn get_all_vrf_keys() -> HashMap<String, Vec<u8>> {
374374
/// Default location of the genesis anchors JSON file inside the container.
375375
pub const GENESIS_ANCHORS_PATH: &str = "/app/data/genesis_anchors.json";
376376

377+
/// Outcome of the bootstrap-race guard in `install_genesis_anchors_at_startup`
378+
/// when the anchors file is absent. Exposed (and computed by the pure helper
379+
/// `anchors_missing_boot_decision`) so the policy can be unit-tested without
380+
/// touching process env or invoking `std::process::exit`.
381+
#[derive(Debug, Eq, PartialEq, Clone, Copy)]
382+
pub(crate) enum BootDecision {
383+
/// Not a genesis node — anchors are irrelevant. Proceed.
384+
Allowed,
385+
/// Genesis node, no anchors, operator explicitly opted in via
386+
/// `QNET_BOOTSTRAP_FRESH=1`. Proceed but emit a CRIT warning every boot
387+
/// so the dangerous mode is impossible to miss in operational logs.
388+
AllowedFreshOptIn,
389+
/// Genesis node, no anchors, no opt-in. Caller must abort startup —
390+
/// silently continuing would open the squat-on-bootstrap race window.
391+
Refused,
392+
}
393+
394+
/// Pure-logic decision for whether `install_genesis_anchors_at_startup` may
395+
/// proceed when the anchors file is absent. Inputs are taken explicitly so
396+
/// this function is fully testable without reading env vars or panicking.
397+
///
398+
/// Policy:
399+
/// * Super-node (no `QNET_BOOTSTRAP_ID`): always allowed — they bind
400+
/// identity via signed `NodeRegistration` TX, not via anchors.
401+
/// * Genesis node + opt-in via `QNET_BOOTSTRAP_FRESH=1`: allowed with a
402+
/// CRIT warning. The operator has accepted the race risk.
403+
/// * Genesis node, no opt-in: refused. Caller must terminate the process.
404+
///
405+
/// The opt-in is intentionally a single discrete env var rather than a
406+
/// timeout / heuristic — silent continuation in dangerous mode is exactly
407+
/// what we are defending against, so the gate must be operator-explicit.
408+
pub(crate) fn anchors_missing_boot_decision(
409+
is_genesis_node: bool,
410+
fresh_opt_in: bool,
411+
) -> BootDecision {
412+
if !is_genesis_node {
413+
BootDecision::Allowed
414+
} else if fresh_opt_in {
415+
BootDecision::AllowedFreshOptIn
416+
} else {
417+
BootDecision::Refused
418+
}
419+
}
420+
377421
/// Load genesis Dilithium3 anchor PKs from `path`. Returns empty map if file
378422
/// missing or malformed (logged as WARN, not fatal — boot proceeds without
379423
/// anchors so a fresh cluster can complete first-time keygen + anchor write).
@@ -453,8 +497,76 @@ pub fn load_genesis_anchor_pks_from_file(path: &str) -> HashMap<String, Vec<u8>>
453497
pub fn install_genesis_anchors_at_startup() -> usize {
454498
let map = load_genesis_anchor_pks_from_file(GENESIS_ANCHORS_PATH);
455499
if map.is_empty() {
456-
// First-boot path: no anchor file yet. Caller logs the appropriate
457-
// INFO; we return 0 so caller can decide whether to fail or proceed.
500+
// ─────────────────────────────────────────────────────────────────
501+
// v17.1: GENESIS BOOTSTRAP RACE GUARD
502+
// ─────────────────────────────────────────────────────────────────
503+
// A genesis node started without anchors is in the dangerous "fresh
504+
// bootstrap" path: cross-registration via `VrfKeyAnnounce` uses
505+
// trust-on-first-verify (the announce handler verifies a self-
506+
// signature against the SUPPLIED public key, not against the
507+
// registry — see unified_p2p.rs::NetworkMessage::VrfKeyAnnounce).
508+
// Whichever peer announces a genesis identity FIRST locks that
509+
// identity to its PK in the local consensus PK registry. If a
510+
// non-genesis peer (e.g. a whitelisted but otherwise hostile IP)
511+
// is online and faster than the legitimate genesis bootstrap, it
512+
// can squat the slot.
513+
//
514+
// Refuse to start unless the operator has explicitly acknowledged
515+
// the race by setting `QNET_BOOTSTRAP_FRESH=1`. Two situations are
516+
// legitimate uses of that opt-in:
517+
// * Truly first-ever cluster boot before any anchors have ever
518+
// been auto-written.
519+
// * Operator-driven full state cleanup where the
520+
// `dilithium_keypair.bin` files were also wiped (so a new
521+
// round of cross-registration is required).
522+
//
523+
// Any other situation — anchors lost between restarts, deploy
524+
// script forgot to copy the file, host filesystem corruption —
525+
// should fail loudly so the operator can restore from backup
526+
// BEFORE the race window opens. Silent continuation in fresh-boot
527+
// mode after operator-unaware anchor loss is exactly how an
528+
// attacker squat succeeds on the next restart.
529+
//
530+
// Super-node identities (no `QNET_BOOTSTRAP_ID` env var) do NOT
531+
// need anchors — their identity binding is established via signed
532+
// `NodeRegistration` TX, which carries the Dilithium3 PK in the
533+
// payload and is verified end-to-end. The guard skips them.
534+
//
535+
// Scalability: O(1) — two env-var lookups and a string compare.
536+
// Independent of cluster size or network state.
537+
let is_genesis_node = std::env::var("QNET_BOOTSTRAP_ID").is_ok();
538+
let fresh_opt_in = std::env::var("QNET_BOOTSTRAP_FRESH")
539+
.map(|v| v == "1")
540+
.unwrap_or(false);
541+
match anchors_missing_boot_decision(is_genesis_node, fresh_opt_in) {
542+
BootDecision::Allowed => { /* proceed below */ }
543+
BootDecision::AllowedFreshOptIn => {
544+
let bootstrap_id = std::env::var("QNET_BOOTSTRAP_ID").unwrap_or_default();
545+
eprintln!(
546+
"[CRIT][GENESIS] fresh_bootstrap_mode_active bootstrap_id={} path={} \
547+
risk=identity_squat_window_open \
548+
hint=ensure_QNET_WHITELIST_IPS_contains_only_genesis_or_trusted_peers",
549+
bootstrap_id, GENESIS_ANCHORS_PATH
550+
);
551+
}
552+
BootDecision::Refused => {
553+
let bootstrap_id = std::env::var("QNET_BOOTSTRAP_ID").unwrap_or_default();
554+
eprintln!(
555+
"[CRIT][GENESIS] genesis_node_started_without_anchors \
556+
bootstrap_id={} path={} action=halt_startup",
557+
bootstrap_id, GENESIS_ANCHORS_PATH
558+
);
559+
eprintln!(
560+
"[CRIT][GENESIS] hint=restore_genesis_anchors_json_from_backup \
561+
OR set_QNET_BOOTSTRAP_FRESH=1_to_acknowledge_race_risk"
562+
);
563+
eprintln!(
564+
"[CRIT][GENESIS] race_summary=a_non-genesis_peer_with_valid_dilithium3_keypair \
565+
can_announce_first_and_lock_genesis_identity_to_its_PK_squat_attack"
566+
);
567+
std::process::exit(2);
568+
}
569+
}
458570
return 0;
459571
}
460572
let count = map.len();
@@ -624,5 +736,59 @@ mod tests_v17_security {
624736
fn genesis_node_count_matches_ip_table() {
625737
assert_eq!(genesis_node_count(), GENESIS_NODE_IPS.len());
626738
}
739+
740+
// ────────────────────────────────────────────────────────────────────────
741+
// v17.1: BOOTSTRAP-RACE GUARD (anchors_missing_boot_decision)
742+
// ────────────────────────────────────────────────────────────────────────
743+
// The four cases below exhaustively cover the truth table of the policy
744+
// documented above the function. A regression on ANY of these means the
745+
// refuse-to-start guard has been broken — either we'd start dangerously
746+
// (squat window open) or we'd crash super-nodes that have no business
747+
// touching anchors. Both are loud production failures.
748+
749+
/// Super-node (no `QNET_BOOTSTRAP_ID`) MUST always boot regardless of
750+
/// the `QNET_BOOTSTRAP_FRESH` flag — they have no anchor relationship.
751+
#[test]
752+
fn boot_decision_super_node_no_opt_in_allowed() {
753+
assert_eq!(
754+
anchors_missing_boot_decision(false, false),
755+
BootDecision::Allowed
756+
);
757+
}
758+
759+
/// Super-node + opt-in: still allowed. Opt-in is irrelevant for a node
760+
/// type that doesn't consult anchors. We don't error on irrelevant flags.
761+
#[test]
762+
fn boot_decision_super_node_with_opt_in_allowed() {
763+
assert_eq!(
764+
anchors_missing_boot_decision(false, true),
765+
BootDecision::Allowed
766+
);
767+
}
768+
769+
/// Genesis node + no opt-in is the SECURITY-CRITICAL case. Booting here
770+
/// would let any whitelisted hostile peer with a fresh Dilithium3 keypair
771+
/// announce a genesis identity first and pin its PK in the local
772+
/// registry — squat-on-bootstrap. The guard MUST refuse.
773+
#[test]
774+
fn boot_decision_genesis_no_anchors_no_opt_in_refused() {
775+
assert_eq!(
776+
anchors_missing_boot_decision(true, false),
777+
BootDecision::Refused
778+
);
779+
}
780+
781+
/// Genesis node + explicit opt-in: allowed but flagged. This is the
782+
/// legitimate first-cluster-boot path; it must succeed so a brand-new
783+
/// network can complete cross-registration and auto-write its anchors.
784+
/// The CRIT log emitted alongside this decision is the operator's
785+
/// evidence that they are running in dangerous mode for this boot.
786+
#[test]
787+
fn boot_decision_genesis_no_anchors_with_opt_in_allowed_with_warning() {
788+
assert_eq!(
789+
anchors_missing_boot_decision(true, true),
790+
BootDecision::AllowedFreshOptIn
791+
);
792+
}
627793
}
628794

0 commit comments

Comments
 (0)