Skip to content

Commit a396587

Browse files
committed
docs(core): add failure mode analysis and crash recovery diagram
1 parent 5eaa815 commit a396587

1 file changed

Lines changed: 55 additions & 0 deletions

File tree

docs/core/reliability.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,3 +65,58 @@ For transient errors (network glitches, 503s), we implement **Exponential Backof
6565
* **Non-Retryable Errors**: `SyntaxError`, `SecurityViolation`, `ContextWindowExceeded`.
6666

6767
The **Retry Node** in the graph manages the loop, ensuring we don't spiral into an infinite retry storm.
68+
69+
---
70+
71+
## 4. Operational Scenarios (Runbook)
72+
73+
Strategies for common failure modes:
74+
75+
| Scenario | System Behavior | Error Code | Recovery |
76+
| :--- | :--- | :--- | :--- |
77+
| **Worker Crash (Segfault)** | Request fails immediately. Worker replaced. | `EXECUTOR_CRASH` | Automatic (Next request uses new worker) |
78+
| **DB Timeout (>30s)** | Request cancelled. | `EXECUTION_TIMEOUT` | Retryable (if configured) |
79+
| **DB Outage (5+ fails)** | **Circuit Breaker Trips**. All DB calls rejected fast. | `SERVICE_UNAVAILABLE` | Auto-reset after 30s |
80+
| **LLM Rate Limit (429)** | **Backoff & Retry**. Does NOT trip breaker. | `None` (Handled internally) | Exponential Backoff |
81+
| **LLM Outage (503)** | **Circuit Breaker Trips**. Fails fast. | `SERVICE_UNAVAILABLE` | Auto-reset after 60s |
82+
83+
### Deep Dive: Worker Crash Recovery
84+
85+
The following diagram illustrates exactly how the system handles a "Hard Crash" (Segfault/OOM) in the worker process:
86+
87+
```mermaid
88+
sequenceDiagram
89+
participant Main as ExecutorNode (Main Process)
90+
participant Breaker as DB_BREAKER
91+
participant Sandbox as SandboxManager
92+
participant Worker as Worker Process (PID: 1234)
93+
94+
Main->>Breaker: Call _execute_guarded()
95+
activate Breaker
96+
97+
Breaker->>Sandbox: execute_in_sandbox(request)
98+
activate Sandbox
99+
100+
Sandbox->>Worker: Dispatch Task (Pickle)
101+
activate Worker
102+
103+
Note right of Worker: SEGFAULT / OOM
104+
Worker--xSandbox: Process Terminated (Signal 9)
105+
deactivate Worker
106+
107+
Sandbox->>Sandbox: Catch BrokenProcessPool
108+
Sandbox-->>Breaker: Return Result(success=False, metrics={'is_crash': 1})
109+
deactivate Sandbox
110+
111+
Breaker->>Breaker: Check metrics['is_crash']
112+
Breaker->>Breaker: Raise RuntimeError("Sandbox Crash")
113+
114+
Note over Breaker: Failure Count++
115+
116+
Breaker-->>Main: Re-raise RuntimeError
117+
deactivate Breaker
118+
119+
Main->>Main: Catch RuntimeError
120+
Main-->>Main: Log CRITICAL Error
121+
Main-->>User: Return ErrorCode.EXECUTOR_CRASH
122+
```

0 commit comments

Comments
 (0)