@@ -65,3 +65,58 @@ For transient errors (network glitches, 503s), we implement **Exponential Backof
6565* ** Non-Retryable Errors** : ` SyntaxError ` , ` SecurityViolation ` , ` ContextWindowExceeded ` .
6666
6767The ** Retry Node** in the graph manages the loop, ensuring we don't spiral into an infinite retry storm.
68+
69+ ---
70+
71+ ## 4. Operational Scenarios (Runbook)
72+
73+ Strategies for common failure modes:
74+
75+ | Scenario | System Behavior | Error Code | Recovery |
76+ | :--- | :--- | :--- | :--- |
77+ | ** Worker Crash (Segfault)** | Request fails immediately. Worker replaced. | ` EXECUTOR_CRASH ` | Automatic (Next request uses new worker) |
78+ | ** DB Timeout (>30s)** | Request cancelled. | ` EXECUTION_TIMEOUT ` | Retryable (if configured) |
79+ | ** DB Outage (5+ fails)** | ** Circuit Breaker Trips** . All DB calls rejected fast. | ` SERVICE_UNAVAILABLE ` | Auto-reset after 30s |
80+ | ** LLM Rate Limit (429)** | ** Backoff & Retry** . Does NOT trip breaker. | ` None ` (Handled internally) | Exponential Backoff |
81+ | ** LLM Outage (503)** | ** Circuit Breaker Trips** . Fails fast. | ` SERVICE_UNAVAILABLE ` | Auto-reset after 60s |
82+
83+ ### Deep Dive: Worker Crash Recovery
84+
85+ The following diagram illustrates exactly how the system handles a "Hard Crash" (Segfault/OOM) in the worker process:
86+
87+ ``` mermaid
88+ sequenceDiagram
89+ participant Main as ExecutorNode (Main Process)
90+ participant Breaker as DB_BREAKER
91+ participant Sandbox as SandboxManager
92+ participant Worker as Worker Process (PID: 1234)
93+
94+ Main->>Breaker: Call _execute_guarded()
95+ activate Breaker
96+
97+ Breaker->>Sandbox: execute_in_sandbox(request)
98+ activate Sandbox
99+
100+ Sandbox->>Worker: Dispatch Task (Pickle)
101+ activate Worker
102+
103+ Note right of Worker: SEGFAULT / OOM
104+ Worker--xSandbox: Process Terminated (Signal 9)
105+ deactivate Worker
106+
107+ Sandbox->>Sandbox: Catch BrokenProcessPool
108+ Sandbox-->>Breaker: Return Result(success=False, metrics={'is_crash': 1})
109+ deactivate Sandbox
110+
111+ Breaker->>Breaker: Check metrics['is_crash']
112+ Breaker->>Breaker: Raise RuntimeError("Sandbox Crash")
113+
114+ Note over Breaker: Failure Count++
115+
116+ Breaker-->>Main: Re-raise RuntimeError
117+ deactivate Breaker
118+
119+ Main->>Main: Catch RuntimeError
120+ Main-->>Main: Log CRITICAL Error
121+ Main-->>User: Return ErrorCode.EXECUTOR_CRASH
122+ ```
0 commit comments