Skip to content

Commit 316df25

Browse files
committed
docs: Add ADR-001 Sandboxed Execution
1 parent 0608127 commit 316df25

1 file changed

Lines changed: 109 additions & 0 deletions

File tree

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
# ADR-001: Sandboxed Execution & Indexing Plane (ENH-001)
2+
3+
## 1. Problem Statement
4+
5+
### The Core Issue: "Blast Radius" & Reliability
6+
7+
Currently, the NL2SQL Agent executes SQL queries via `ExecutorNode`, performs dry-runs via `PhysicalValidatorNode`, and indexes schemas via `OrchestratorVectorStore` **in-process**.
8+
9+
This couples the stability of the Agent to the stability of the underlying SQL Drivers and the Database itself.
10+
11+
* **Risk 1: Application Crashes (Segfaults)**
12+
* SQL Drivers (ODBC/C-Ext) running in the main process can trigger Segmentation Faults on malformed data or driver bugs.
13+
* **Impact**: The entire Agent process dies, terminating all concurrent user sessions.
14+
15+
* **Risk 2: Resource Starvation (The "Noisy Neighbor")**
16+
* Schema Indexing (`fetch_schema`) is a heavy, long-running I/O and CPU operation.
17+
* **Impact**: If run in the main event loop, it starves query execution, causing timeouts for other users.
18+
19+
* **Risk 3: Security (Defense in Depth)**
20+
* **True Risk**: If a driver vulnerability allows RCE, the attacker gains the privileges of the Agent (access to all secrets/envs).
21+
* **Mitigation Goal**: We want to run execution in a lower-privilege context.
22+
23+
## 2. Non-Goals
24+
25+
* **Prompt Injection**: This architecture does not solve LLM jailbreaks (handled by `IntentValidator`).
26+
* **Semantic Correctness**: We do not guarantee the SQL is "right", only that executing it won't crash the Agent.
27+
* **Query Optimization**: We are not rewriting queries, just running them safely.
28+
29+
## 3. Options Analysis
30+
31+
### Option A: Python Multiprocessing (Process Pool) - **Recommended**
32+
33+
Spawn worker processes on the same machine (Process Boundary Isolation).
34+
35+
* **Mechanism**:
36+
* **Execution Pool**: A low-latency pool for `ExecutorNode` and `PhysicalValidatorNode` (User-facing queries).
37+
* **Indexing Pool**: A separate, lower-priority pool for `OrchestratorVectorStore` (Background tasks).
38+
* **Controls**:
39+
* `max_workers`: Hard cap on concurrency to prevent OOM.
40+
* `initializer`: Setup persistent global `SQLAlchemy Engine` for connection pooling.
41+
* `timeouts`: Hard kill signals for stuck processes.
42+
* **Pros**:
43+
* **Crash Safety**: Parent catches `BrokenProcessPool` and survives.
44+
* **Performance**: ~1-10ms overhead (warm worker).
45+
* **Simplicity**: Deployment agnostic (works on Local, Docker, K8s).
46+
* **Cons**:
47+
* **Weak Security**: Workers share the same FileSystem and Environment (container) as the Agent context unless explicitly scrubbed.
48+
* **Coupled Scaling**: Validating 1000 queries requires scaling the Agent pods.
49+
50+
### Option B: Dedicated Execution Service (Microservice)
51+
52+
Deploy a separate, independently scalable service (Network Boundary Isolation).
53+
54+
* **Mechanism**: A FastAPI/gRPC service running in a separate deployment (`execution-service`).
55+
* **Pros**:
56+
* **Strong Security**: Separate Container, separate IAM role, separate filesystem.
57+
* **Independent Scaling**: Scale Execution (50 replicas) vs Agent (5 replicas).
58+
* **Connection Pooling**: Excellent, persistent pooling in long-lived pods.
59+
* **Cons**:
60+
* **Operational Complexity**: Requires Service Discovery, Network Policy, and separate CD pipelines.
61+
* **Latency**: ~10ms+ Network I/O overhead.
62+
63+
### Option C: Ephemeral Sandbox ("Spawn-on-Demand" Container)
64+
65+
Spin up a fresh Docker container for *every single query*.
66+
67+
* **Verdict**: **Not viable** for high-throughput real-time apps due to "Connection Storms" (no pooling) and 1s+ startup latency.
68+
69+
## 4. Comparison Matrix
70+
71+
| Feature | Option A: Multiprocessing | Option B: Dedicated Service | Option C: Ephemeral Container |
72+
| :--- | :--- | :--- | :--- |
73+
| **Crash Recovery** | ✅ Excellent (Parent survives) | ✅ Excellent (Agent survives) | ✅ Excellent |
74+
| **Implementation Cost** | 🟢 Low (Code only) | 🟡 Medium (New Repo/Service) | 🔴 High (Orchestration) |
75+
| **Ops Complexity** | 🟢 Low (Single Pod) | 🟠 High (Multi-Container) | 🔴 Very High |
76+
| **Per-Query Latency** | 🚀 ~1-10ms (Warm) |~10ms + Network | 🐢 ~1000ms+ |
77+
| **Connection Pooling** | 🟢 Yes (Persistent Workers) | 🟢 Excellent | 🔴 Impossible |
78+
79+
## 5. Interface Contract (Internal Protocol)
80+
81+
Regardless of the implementation (Process or Service), the interface remains constant:
82+
83+
```python
84+
class ExecutionRequest(BaseModel):
85+
mode: Literal["execute", "dry_run", "fetch_schema"]
86+
datasource_id: str
87+
connection_args: Dict[str, Any]
88+
sql: str
89+
parameters: Dict[str, Any]
90+
limits: Dict[str, int] # e.g. { "max_rows": 1000, "timeout_sec": 30 }
91+
92+
class ExecutionResponse(BaseModel):
93+
success: bool
94+
data: Optional[List[Dict[str, Any]]] # The rows
95+
error: Optional[ExecutionError]
96+
metrics: Dict[str, float] # { "execution_time_ms": 12.5 }
97+
```
98+
99+
## 6. Recommendation
100+
101+
**Phase 1: Option A (Multiprocessing)**
102+
103+
* It solves the critical **Blast Radius** immediately.
104+
* It separates **Indexing** from **Execution** via distinct pools.
105+
* It is "Service-Ready": The `ExecutionRequest` protocol makes migrating to Option B trivial later.
106+
107+
**Phase 2: Option B (Dedicated Service)**
108+
109+
* Migrate only when we need to scale Execution independently of the Agent or require strict network segmentation.

0 commit comments

Comments
 (0)