|
| 1 | +# HA Implementation Roadmap & CLI Expansion Plan |
| 2 | + |
| 3 | +This document serves as the exact engineering blueprint for migrating the QuanuX Tier 1 Server from a single-node Python orchestrator into a fully distributed Global Supercluster. |
| 4 | + |
| 5 | +## Phase 1: NATS JetStream vs Analytical Boundary |
| 6 | +**Objective:** Replace local state management with a globally distributed KV lock while enforcing the boundary between Execution State and Analytical Memory. |
| 7 | + |
| 8 | +1. **Initialize Global Bucket:** Create a NATS JetStream Key-Value bucket named `quanux_cluster_state` replicated across $N$ geographic regions. |
| 9 | +2. **Lock Definition:** Define the primary lock `quanux.tier1.leader`. |
| 10 | +3. **State Reflection (The Dichotomy):** |
| 11 | + - **Control State (NATS JetStream):** Risk profiles, Supervisor limits, active deployments, and real-time orchestration events are strictly appended to the JetStream Event Log for immediate HA replay. |
| 12 | + - **Analytical State (Hybrid):** Historic tick data, heavy backtesting memory, and massive order-flow archives must NEVER be stored in NATS. They are safely offloaded to a user-configured storage engine (DuckDB, HDF5, NAS). HA Failover strictly guarantees the Control State, not the analytics volume. |
| 13 | + |
| 14 | +## Phase 2: Python Background Leader Election Loop |
| 15 | +**Objective:** Implement the Active-Passive heartbeat and Raft observer within the FastAPI backend with defensive STONITH timeouts. |
| 16 | + |
| 17 | +1. **The Observer Task (FastAPI Lifespan):** The Tier 1 Server is driven by FastAPI. The background `asyncio` task (`GlobalSentinelLoop`) does not exist in a vacuum; it is spun up and managed by the FastAPI lifespan context manager (or startup/shutdown events). Furthermore, the FastAPI routing logic must be Raft-aware: if the node is a Follower, specific orchestration endpoints must automatically reject or redirect traffic until the node is promoted to Leader. |
| 18 | +2. **Heartbeat Maintenance:** If the server is Leader, it writes the current timestamp to `quanux.tier1.leader` every `50ms`. |
| 19 | +3. **The Watcher:** If the server is a Follower, it establishes a JetStream Watcher on the lock. If the lock's TTL is exceeded (e.g., no update for `250ms`), the server attempts a targeted `Update` with its own `Node_ID` to seize the lock. |
| 20 | +4. **Apoptosis Hook (Defended):** Upon acquiring the lock, the backend triggers `execute_stonith(old_leader_id)`. **CRITICAL:** This call must have a strict hard-timeout (e.g., `2000ms`). If the IPMI interface of the dead datacenter is offline, it cannot block infinitely. If the script hits the timeout, it abandons the lock, enters a `CRITICAL_PENDING` state, and fires a severe alarm via `quanuxctl`/SMS/PagerDuty to the Architect. |
| 21 | +5. **State Rehydration:** Once Fencing is verified, the server replays the NATS Event Log to rehydrate the application state and begins accepting `quanuxctl` and Nest connections. |
| 22 | + |
| 23 | +## Phase 3: The `quanuxctl` CLI Expansion |
| 24 | +**Objective:** Grant the Architect "God-Mode" over the Raft cluster and manual failover hierarchy. Any automated clustering protocol must have deterministic manual overrides. |
| 25 | + |
| 26 | +The `quanuxctl` CLI will be expanded to include the `cluster` command group. |
| 27 | + |
| 28 | +### `quanuxctl cluster status` |
| 29 | +* **Action:** Queries NATS for the telemetry of the global supercluster. |
| 30 | +* **Output:** |
| 31 | + * Identifies the current **Leader** (Node ID, Region, Uptime). |
| 32 | + * Lists all **Followers** (Node IDs, Ping to Leader, Replay Lag). |
| 33 | + * Displays the health and state of the `quanux.tier1.leader` lock. |
| 34 | + |
| 35 | +### `quanuxctl cluster promote <node_id>` |
| 36 | +* **Action:** Forces a manual Raft election override. |
| 37 | +* **Execution:** Administratively commands the current Leader to drop the lock and artificially boosts the priority/election-timer of the specified `<node_id>` so it is guaranteed to become the new Leader. |
| 38 | +* **Use Case:** Pre-emptive maintenance of a datacenter or shifting latency footprints before major economic releases. |
| 39 | + |
| 40 | +### `quanuxctl cluster demote` |
| 41 | +* **Action:** Forces the current Leader to step down gracefully without explicitly assigning a successor. |
| 42 | +* **Execution:** The Leader deletes its lock on `quanux.tier1.leader` and enters a 5-second backoff period where it refuses to vote OR run for election, allowing the remaining Followers to elect a new Leader. |
| 43 | + |
| 44 | +### `quanuxctl cluster fence <node_id>` |
| 45 | +* **Action:** Manually triggers STONITH (Apoptosis) against a rogue or "zombie" node. |
| 46 | +* **Execution:** Bypasses Raft consensus entirely and immediately fires the deepest available Fencing mechanism (Cryptographic -> OS -> Hardware) against a specific Node ID. |
| 47 | +* **Use Case:** Resolving complex network splits or permanently blinding a node that has been compromised or is behaving erratically outside of normal cluster logic. |
| 48 | + |
| 49 | +--- |
| 50 | +**Execution Mandate:** Development must proceed linearly from Phase 1 to Phase 3. The foundational AI context (tier1_ha_skill.md) provides the parameters. Code generation algorithms are to strictly reference this plan when structuring the `quanuxctl` Typer framework extensions. |
0 commit comments