|
| 1 | +# QuanuX Tier 1 HA: The 3:00 AM Panic Runbook |
| 2 | + |
| 3 | +> [!CAUTION] |
| 4 | +> **READ THIS FIRST:** If you are reading this at 3:00 AM, the cluster is screaming. Do not think. Execute the physical laws of the system. |
| 5 | +
|
| 6 | +## The Rule of the KV Lock |
| 7 | +**The Truth lives ONLY in NATS JetStream.** Whoever holds the `quanux.tier1.leader` KV lock is the Leader. No exceptions. |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## 1. If leader dies, do this |
| 12 | +**Symptoms:** The Leader node becomes unresponsive or drops off the network. |
| 13 | +**Standard Protocol:** The system is designed to auto-failover and STONITH the dead node. If you receive an alert that the leader died but the auto-failover succeeded, **no immediate action is required**. Monitor the cluster until the dead node can be physically replaced. |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## 2. If OOB network unavailable, do this (Split-Brain / Failed STONITH) |
| 18 | +**Symptoms:** You have two nodes claiming to be Leader. The API is flapping. |
| 19 | +**Cause:** The OOB hardware power-kill (STONITH) failed its 2000ms timeout during an election. |
| 20 | + |
| 21 | +### Recovery Steps: |
| 22 | +1. **Identify the true holding node:** |
| 23 | + ```bash |
| 24 | + quanuxctl cluster status |
| 25 | + ``` |
| 26 | +2. **Manually Fence the Usurper:** |
| 27 | + Identify the node that DOES NOT hold the lock and terminate it with extreme prejudice. |
| 28 | + ```bash |
| 29 | + quanuxctl cluster fence <rogue_node_id> |
| 30 | + ``` |
| 31 | +3. **Verify App State:** Ensure FastAPI on the new Leader is propagating the heartbeat loop. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## 3. If KV stuck, do this (Total Cluster Freeze) |
| 36 | +**Symptoms:** The Leader is dead (e.g., kernel panic, completely dark), but no Follower is spinning up to take its place. |
| 37 | +**Cause:** NATS JetStream edge-case where the Leader disconnected dirty, but the lock TTL hasn't expired or is hung. |
| 38 | + |
| 39 | +### Recovery Steps: |
| 40 | +1. **Force Promotion on a Follower:** |
| 41 | + Pick the healthiest Follower (e.g., the closest geographic standby) and force Raft election override. |
| 42 | + ```bash |
| 43 | + quanuxctl cluster promote <fallback_node_id> |
| 44 | + ``` |
| 45 | +2. **If that fails, Demote the Ghost Leader:** |
| 46 | + ```bash |
| 47 | + quanuxctl cluster demote |
| 48 | + ``` |
| 49 | +3. Wait 3 seconds for the BGP and Anycast IP shift. |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## 4. If edge execution nodes detach, do this (The "Long-Dark" & Control Plane Genesis) |
| 54 | +**Symptoms:** Sub-nodes (Tier 4 Execution Nests like SFO) are dropping connection to the Control Plane but still executing trades, or they boot and print "Awaiting Control Plane Genesis". |
| 55 | +**Cause:** |
| 56 | +- *The Long-Dark:* Global Anycast routing takes 3 to 180 seconds to shift BGP convergence. |
| 57 | +- *Genesis Race Condition:* A Nest booted before the Leader and encountered a `BucketNotFoundError` because the NATS bucket doesn't exist yet. |
| 58 | +**Action:** Let the Ritchie FSM run. *Do nothing.* Edge nodes will blindly execute exits and halt entries. They will safely wait in the dark and automatically reconnect when NATS becomes reachable or the Leader creates the bucket. |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Reference Walkthrough: DigitalOcean 3-Node Chaos Engineering Test |
| 63 | +*Completed March 2, 2026 across NYC, LON, SFO components.* |
| 64 | + |
| 65 | +This deployment validates the physical boundaries of our high-availability architecture. |
| 66 | +1. **The Setup**: |
| 67 | + - NYC (Node A): Primary Leader holding NATS KV lock. |
| 68 | + - LON (Node B): Follower, watching NATS. |
| 69 | + - SFO (Node C): Tier 4 Execution edge node. |
| 70 | +2. **The Induction**: NYC eth0 interface was artificially dropped (simulating catastrophic instance failure). |
| 71 | +3. **The Lock Release**: NATS JetStream eventually registered the NYC session dropped. The lock was released. |
| 72 | +4. **The STONITH execution**: LON acquired the lock. LON's Sentinel loop immediately triggered a DigitalOcean API execution to power-off NYC within 2000ms to prevent split-brain if eth0 returned. |
| 73 | +5. **The Long-Dark**: SFO lost connection to NYC. SFO engaged the Ritchie FSM, blocking new entries but dumping active exposure. |
| 74 | +6. **Convergence**: Within ~74 seconds, Global Anycast BGP converged to LON. SFO reconnected to LON, recognized the new Leader heartbeat, and resumed normal operation. |
0 commit comments