Skip to content

Commit d9c7a89

Browse files
AIQnetLabclaude
andcommitted
fix: v19.1 — restore fresh-bootstrap auto-creation of genesis_anchors.json
The v19.0 handshake-proof gate was too strict: it returned `Err` whenever `verify_dilithium_heartbeat_signature_async` returned false, conflating two genuinely different conditions: (a) PK is in the consensus registry but the supplied signature does not verify under it — a real identity-squat attempt, MUST drop. (b) PK is NOT yet in the registry — a legitimate fresh-bootstrap peer whose identity binding will be installed once its `VrfKeyAnnounce` (carrying its own self-signature, verified inline) reaches us. Treating (b) as a hard error broke fresh-cluster bootstrap end-to-end. The very first cross-peer connection from each genesis node arrives BEFORE that peer's PK has been cross-registered, so the connection was dropped before VrfKeyAnnounce gossip could propagate. Without that gossip the registry never reached 5 entries, `try_autowrite_genesis_anchors_locked` never triggered, `genesis_anchors.json` was never written, and the cluster sat at height=0 indefinitely (observed 4h+ stall on five-node genesis testnet, ~13K rejections per node per error category). This patch restores the universal L1 invariant that connection-level handshake admits unknown-identity peers (TLS/QUIC is a transport, not an identity oracle); identity binding happens through signed messages carried OVER the connection, not embedded in the handshake itself. Changes ------- 1. `verify_handshake_proof` (quic_transport.rs) — three-state contract: * Ok(true) — proof present and verifies under registered PK * Ok(false) — advisory admit (no proof / pre-init / PK absent) * Err — proof present, PK present, signature mismatch The PK-absence branch uses `qnet_consensus::has_consensus_pk` to check membership separately from signature verification, so the "PK not registered yet" condition no longer collapses into the "signature invalid" failure path. Receiver-side log messages updated from `no_dilithium_proof reason=legacy_peer_phase_2A` to `advisory_admit reason=pk_unknown_or_no_proof` since the same Ok(false) path now covers three operational causes, not just legacy peers. 2. `verify_consensus_signature` Tier 3 (consensus_crypto.rs) — first-seen policy for genesis identities aligned with `anchors_missing_boot_decision`: * anchors loaded → strict reject (steady-state squat protection) * anchors absent + QNET_BOOTSTRAP_FRESH=1 → admit TOFV (signature math below the gate is the cryptographic floor — an attacker without the SK cannot pass it) * anchors absent + no opt-in → strict reject (misconfigured deploy surfaces explicitly with the actual flag values in the log) Decision logic extracted into `tier3_genesis_first_seen_admit` pub(crate) helper so production verify-path and unit tests share a single source of truth. Security -------- `Ok(false)` from the handshake admits a connection but does NOT authenticate the peer. Every consensus-relevant message that flows over the admitted connection still passes through full Dilithium3 verification: * VrfKeyAnnounce inline self-signature verify (the registration pathway itself is cryptographically gated) * `verify_consensus_signature` for TimeoutVote / heartbeat / commit * Tier 2 catches any later identity mismatch under a registered PK The Tier 3 fresh-window admit is gated on the same operator-explicit QNET_BOOTSTRAP_FRESH=1 flag that already governs whether the process is allowed to start at all when anchors are absent. Operators who do not set the flag get the same hard reject as before. Operators who deploy `genesis_anchors.json` get strict mode regardless of the flag. Scalability ----------- Hot-path cost stays sub-1% of a core at 1000+ super-node mesh: * one lock-free `has_consensus_pk` check per handshake (~50ns) * at most one Dilithium3 verify per handshake (~3ms, only when proof actually attached AND PK present) * one env-var read + RwLock-read on Tier 3 hits (negligible vs the Dilithium3 math that follows on the same call) Tests ----- * New `verify_returns_ok_false_for_unknown_pk_with_proof` in qnet-integration confirms admit-on-PK-miss. * New `tests_v19_1_tier3_fresh_window` module in qnet-consensus pins the policy decision across all three states (strict-when-anchors-loaded, admit-when-fresh-window-open, strict-when-misconfigured). * All existing v19/v20 regression tests still pass: 73 in qnet-consensus (was 70, +3 new), 143 in qnet-integration (was 142, +1 new), 0 fail. Build ----- cargo build --release clean in 16m 34s, 0 warnings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 12b543c commit d9c7a89

2 files changed

Lines changed: 303 additions & 46 deletions

File tree

core/qnet-consensus/src/consensus_crypto.rs

Lines changed: 168 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1602,36 +1602,91 @@ async fn verify_with_real_dilithium(
16021602
None => {
16031603
// Tier 3: policy depends on identity class.
16041604
if node_id.starts_with("genesis_node_") {
1605-
// Genesis identity with no registry binding. The boot
1606-
// sequence of every honest node guarantees a binding is
1607-
// installed BEFORE P2P traffic is processed, so an
1608-
// unbound genesis claim arriving here is either:
1605+
// Genesis identity with no registry binding. Three causes:
16091606
// (a) a race against a not-yet-completed self-register
1610-
// (transient, will resolve on retry/regossip), or
1611-
// (b) a squat attempt from a non-genesis peer.
1612-
// Both cases are handled identically by hard-rejecting:
1613-
// case (a) self-heals because the legitimate sender's
1614-
// gossip continues; case (b) is the attack we exist to
1615-
// block.
1607+
// — transient, resolves once the legitimate sender's
1608+
// VrfKeyAnnounce or self-register completes;
1609+
// (b) a squat attempt from a non-genesis peer
1610+
// presenting their own keypair under a genesis
1611+
// node_id; or
1612+
// (c) the FIRST sync of a fresh-bootstrap cluster: the
1613+
// anchor file does not yet exist and the
1614+
// cross-registration round-trip via VrfKeyAnnounce
1615+
// has not completed for this peer yet.
1616+
//
1617+
// v19.1: The previous policy was a blanket hard-reject.
1618+
// That broke case (c) end-to-end — fresh genesis clusters
1619+
// could not bootstrap because the very first cross-peer
1620+
// consensus message was rejected before the registry
1621+
// could be populated, leaving every genesis node
1622+
// permanently isolated.
1623+
//
1624+
// Aligned policy:
1625+
// * If anchors are loaded (`genesis_anchor_pks_len() > 0`),
1626+
// the registry MUST already contain every genesis PK
1627+
// (anchors are mirrored into the registry at install
1628+
// time). A Tier-3 hit on a genesis identity in that
1629+
// state is an actual squat attempt → hard reject.
1630+
// * If anchors are absent AND `QNET_BOOTSTRAP_FRESH=1`
1631+
// is set (operator opted into the fresh-bootstrap
1632+
// race window — same gate that allows the process
1633+
// to start in `anchors_missing_boot_decision`), this
1634+
// is case (c). Admit (TOFV) and let signature math
1635+
// below decide the outcome. An attacker without the
1636+
// SK for the claimed PK cannot produce a valid
1637+
// signature, so the cryptographic floor is
1638+
// preserved; the only state we relax is the
1639+
// anchor-binding precheck — which by definition does
1640+
// not exist yet during fresh bootstrap.
1641+
// * Otherwise (no anchors AND no opt-in) it is a
1642+
// misconfigured deploy. Hard reject so the operator
1643+
// sees the failure and either deploys anchors or
1644+
// opts into fresh mode explicitly.
1645+
//
1646+
// Security note: the TOFV admit DOES NOT register the
1647+
// PK. Registration happens through:
1648+
// (1) `VrfKeyAnnounce` handler (inline self-signature
1649+
// verify + register_consensus_pk_from_chain), or
1650+
// (2) signed `NodeRegistration` TX application.
1651+
// Both are themselves cryptographic proofs of ownership.
1652+
// Tier-3 here only widens the message-acceptance gate
1653+
// during the documented fresh window so those
1654+
// registration flows can complete.
16161655
let extracted_prefix = if public_key_bytes.len() >= 8 {
16171656
hex::encode(&public_key_bytes[..8])
16181657
} else {
16191658
String::new()
16201659
};
1621-
eprintln!(
1622-
"[CRIT][CONSENSUS] genesis_pk_first_seen_rejected node={} extracted={}.. \
1623-
action=hard_reject hint=anchor_or_self_register_must_run_before_p2p",
1624-
node_id, extracted_prefix
1625-
);
1626-
return false;
1627-
}
1628-
// Non-genesis identity (Super-node, Light-node, etc.). TOFV
1629-
// is acceptable; chain-state will lock the canonical binding
1630-
// shortly via NodeRegistration TX application, after which
1631-
// any future mismatch is caught by Tier 2 above.
1632-
if public_key_bytes.len() >= 8 {
1633-
println!("[WARN][CONSENSUS] pk_first_seen node={} extracted={}..",
1634-
node_id, hex::encode(&public_key_bytes[..8]));
1660+
let anchors_loaded = genesis_anchor_pks_len() > 0;
1661+
let fresh_opt_in =
1662+
std::env::var("QNET_BOOTSTRAP_FRESH").as_deref() == Ok("1");
1663+
if tier3_genesis_first_seen_admit(anchors_loaded, fresh_opt_in) {
1664+
// Case (c): admit TOFV, signature math below is the gate.
1665+
println!(
1666+
"[WARN][CONSENSUS] genesis_pk_first_seen_admit_fresh_window \
1667+
node={} extracted={}.. anchors_loaded=false bootstrap_fresh=true \
1668+
hint=signature_math_will_decide",
1669+
node_id, extracted_prefix
1670+
);
1671+
// fall through to math verification
1672+
} else {
1673+
eprintln!(
1674+
"[CRIT][CONSENSUS] genesis_pk_first_seen_rejected node={} extracted={}.. \
1675+
anchors_loaded={} bootstrap_fresh={} action=hard_reject \
1676+
hint=deploy_anchors_or_set_QNET_BOOTSTRAP_FRESH",
1677+
node_id, extracted_prefix, anchors_loaded, fresh_opt_in
1678+
);
1679+
return false;
1680+
}
1681+
} else {
1682+
// Non-genesis identity (Super-node, Light-node, etc.). TOFV
1683+
// is acceptable; chain-state will lock the canonical binding
1684+
// shortly via NodeRegistration TX application, after which
1685+
// any future mismatch is caught by Tier 2 above.
1686+
if public_key_bytes.len() >= 8 {
1687+
println!("[WARN][CONSENSUS] pk_first_seen node={} extracted={}..",
1688+
node_id, hex::encode(&public_key_bytes[..8]));
1689+
}
16351690
}
16361691
}
16371692
}
@@ -2064,3 +2119,92 @@ mod tests_v20_pk_registry {
20642119
LAST_ACTIVITY.remove(id);
20652120
}
20662121
}
2122+
2123+
// ═══════════════════════════════════════════════════════════════════════════
2124+
// v19.1: REGRESSION TESTS — TIER 3 FRESH-BOOTSTRAP WINDOW
2125+
// ═══════════════════════════════════════════════════════════════════════════
2126+
// The Tier 3 path of `verify_consensus_signature` makes a policy decision
2127+
// for first-seen identities. v19.1 widens the policy for genesis identities
2128+
// during the documented fresh-bootstrap window; these tests pin the new
2129+
// contract:
2130+
//
2131+
// * anchors loaded → strict reject for first-seen genesis (anchor squat)
2132+
// * anchors absent + QNET_BOOTSTRAP_FRESH=1 → admit TOFV (signature math
2133+
// is the cryptographic gate; an attacker without the SK cannot pass it)
2134+
// * anchors absent + no opt-in → strict reject (misconfigured deploy
2135+
// surfaces explicitly to the operator)
2136+
//
2137+
// The tests assert the POLICY decision via a dedicated pure helper rather
2138+
// than going through the full `verify_consensus_signature` path — that path
2139+
// also performs Dilithium3 math which would require keypair generation +
2140+
// real signatures and is covered by upper-layer integration tests. The
2141+
// policy helper isolates exactly the v19.1 logic added here.
2142+
// ═══════════════════════════════════════════════════════════════════════════
2143+
2144+
/// Pure-logic helper for the Tier 3 first-seen genesis policy.
2145+
///
2146+
/// Returns `true` when the connection should ADMIT the first-seen claim
2147+
/// (TOFV — signature math will gate further), `false` when it MUST be
2148+
/// hard-rejected as an identity squat or misconfigured deploy.
2149+
///
2150+
/// Single source of truth for the v19.1 policy: the inline verify-path
2151+
/// uses this helper too (see `verify_dilithium_signature` Tier 3
2152+
/// branch), keeping production behaviour and unit-test assertions in
2153+
/// lockstep.
2154+
pub(crate) fn tier3_genesis_first_seen_admit(
2155+
anchors_loaded: bool,
2156+
fresh_opt_in: bool,
2157+
) -> bool {
2158+
!anchors_loaded && fresh_opt_in
2159+
}
2160+
2161+
#[cfg(test)]
2162+
mod tests_v19_1_tier3_fresh_window {
2163+
use super::*;
2164+
2165+
/// Anchors loaded means every legitimate genesis PK is already in the
2166+
/// registry. A first-seen claim for a genesis identity in that state
2167+
/// is an actual squat attempt → MUST hard reject regardless of any
2168+
/// fresh-bootstrap opt-in flag. This is the steady-state security
2169+
/// invariant for genesis identity-key binding.
2170+
#[test]
2171+
fn tier3_strict_when_anchors_loaded() {
2172+
// Even with QNET_BOOTSTRAP_FRESH=1 set: anchors are authoritative.
2173+
assert!(
2174+
!tier3_genesis_first_seen_admit(/*anchors_loaded=*/ true, /*fresh_opt_in=*/ true),
2175+
"anchors loaded + fresh opt-in MUST reject (squat attempt under loaded anchors)"
2176+
);
2177+
assert!(
2178+
!tier3_genesis_first_seen_admit(/*anchors_loaded=*/ true, /*fresh_opt_in=*/ false),
2179+
"anchors loaded + no opt-in MUST reject"
2180+
);
2181+
}
2182+
2183+
/// The fresh-bootstrap path: no anchors yet AND operator opted in via
2184+
/// QNET_BOOTSTRAP_FRESH=1. This is the only state in which a first-seen
2185+
/// genesis claim is admitted — the signature math below the policy
2186+
/// gate is what actually verifies the message. Without this admit,
2187+
/// fresh genesis clusters cannot bootstrap (the v19.0 regression).
2188+
#[test]
2189+
fn tier3_admit_when_fresh_window_open() {
2190+
assert!(
2191+
tier3_genesis_first_seen_admit(/*anchors_loaded=*/ false, /*fresh_opt_in=*/ true),
2192+
"anchors absent + fresh opt-in MUST admit (case (c) bootstrap)"
2193+
);
2194+
}
2195+
2196+
/// Misconfigured deploy: anchors absent AND no opt-in. The operator
2197+
/// did not deploy `genesis_anchors.json` and did not set
2198+
/// `QNET_BOOTSTRAP_FRESH=1`. Hard reject so the failure is visible
2199+
/// in operational logs (`anchors_loaded=false bootstrap_fresh=false`)
2200+
/// and the operator has to make an explicit choice. Silent admit
2201+
/// here would hide a deploy bug AND open the squat-on-bootstrap
2202+
/// race that the anchor file exists to close.
2203+
#[test]
2204+
fn tier3_strict_when_no_anchors_no_opt_in() {
2205+
assert!(
2206+
!tier3_genesis_first_seen_admit(/*anchors_loaded=*/ false, /*fresh_opt_in=*/ false),
2207+
"anchors absent + no opt-in MUST reject (misconfigured deploy)"
2208+
);
2209+
}
2210+
}

0 commit comments

Comments
 (0)