|
| 1 | +# ADR-001: Sandboxed Execution & Indexing Plane (ENH-001) |
| 2 | + |
| 3 | +## 1. Problem Statement |
| 4 | + |
| 5 | +### The Core Issue: "Blast Radius" & Reliability |
| 6 | + |
| 7 | +Currently, the NL2SQL Agent executes SQL queries via `ExecutorNode`, performs dry-runs via `PhysicalValidatorNode`, and indexes schemas via `OrchestratorVectorStore` **in-process**. |
| 8 | + |
| 9 | +This couples the stability of the Agent to the stability of the underlying SQL Drivers and the Database itself. |
| 10 | + |
| 11 | +* **Risk 1: Application Crashes (Segfaults)** |
| 12 | + * SQL Drivers (ODBC/C-Ext) running in the main process can trigger Segmentation Faults on malformed data or driver bugs. |
| 13 | + * **Impact**: The entire Agent process dies, terminating all concurrent user sessions. |
| 14 | + |
| 15 | +* **Risk 2: Resource Starvation (The "Noisy Neighbor")** |
| 16 | + * Schema Indexing (`fetch_schema`) is a heavy, long-running I/O and CPU operation. |
| 17 | + * **Impact**: If run in the main event loop, it starves query execution, causing timeouts for other users. |
| 18 | + |
| 19 | +* **Risk 3: Security (Defense in Depth)** |
| 20 | + * **True Risk**: If a driver vulnerability allows RCE, the attacker gains the privileges of the Agent (access to all secrets/envs). |
| 21 | + * **Mitigation Goal**: We want to run execution in a lower-privilege context. |
| 22 | + |
| 23 | +## 2. Non-Goals |
| 24 | + |
| 25 | +* **Prompt Injection**: This architecture does not solve LLM jailbreaks (handled by `IntentValidator`). |
| 26 | +* **Semantic Correctness**: We do not guarantee the SQL is "right", only that executing it won't crash the Agent. |
| 27 | +* **Query Optimization**: We are not rewriting queries, just running them safely. |
| 28 | + |
| 29 | +## 3. Options Analysis |
| 30 | + |
| 31 | +### Option A: Python Multiprocessing (Process Pool) - **Recommended** |
| 32 | + |
| 33 | +Spawn worker processes on the same machine (Process Boundary Isolation). |
| 34 | + |
| 35 | +* **Mechanism**: |
| 36 | + * **Execution Pool**: A low-latency pool for `ExecutorNode` and `PhysicalValidatorNode` (User-facing queries). |
| 37 | + * **Indexing Pool**: A separate, lower-priority pool for `OrchestratorVectorStore` (Background tasks). |
| 38 | +* **Controls**: |
| 39 | + * `max_workers`: Hard cap on concurrency to prevent OOM. |
| 40 | + * `initializer`: Setup persistent global `SQLAlchemy Engine` for connection pooling. |
| 41 | + * `timeouts`: Hard kill signals for stuck processes. |
| 42 | +* **Pros**: |
| 43 | + * **Crash Safety**: Parent catches `BrokenProcessPool` and survives. |
| 44 | + * **Performance**: ~1-10ms overhead (warm worker). |
| 45 | + * **Simplicity**: Deployment agnostic (works on Local, Docker, K8s). |
| 46 | +* **Cons**: |
| 47 | + * **Weak Security**: Workers share the same FileSystem and Environment (container) as the Agent context unless explicitly scrubbed. |
| 48 | + * **Coupled Scaling**: Validating 1000 queries requires scaling the Agent pods. |
| 49 | + |
| 50 | +### Option B: Dedicated Execution Service (Microservice) |
| 51 | + |
| 52 | +Deploy a separate, independently scalable service (Network Boundary Isolation). |
| 53 | + |
| 54 | +* **Mechanism**: A FastAPI/gRPC service running in a separate deployment (`execution-service`). |
| 55 | +* **Pros**: |
| 56 | + * **Strong Security**: Separate Container, separate IAM role, separate filesystem. |
| 57 | + * **Independent Scaling**: Scale Execution (50 replicas) vs Agent (5 replicas). |
| 58 | + * **Connection Pooling**: Excellent, persistent pooling in long-lived pods. |
| 59 | +* **Cons**: |
| 60 | + * **Operational Complexity**: Requires Service Discovery, Network Policy, and separate CD pipelines. |
| 61 | + * **Latency**: ~10ms+ Network I/O overhead. |
| 62 | + |
| 63 | +### Option C: Ephemeral Sandbox ("Spawn-on-Demand" Container) |
| 64 | + |
| 65 | +Spin up a fresh Docker container for *every single query*. |
| 66 | + |
| 67 | +* **Verdict**: **Not viable** for high-throughput real-time apps due to "Connection Storms" (no pooling) and 1s+ startup latency. |
| 68 | + |
| 69 | +## 4. Comparison Matrix |
| 70 | + |
| 71 | +| Feature | Option A: Multiprocessing | Option B: Dedicated Service | Option C: Ephemeral Container | |
| 72 | +| :--- | :--- | :--- | :--- | |
| 73 | +| **Crash Recovery** | ✅ Excellent (Parent survives) | ✅ Excellent (Agent survives) | ✅ Excellent | |
| 74 | +| **Implementation Cost** | 🟢 Low (Code only) | 🟡 Medium (New Repo/Service) | 🔴 High (Orchestration) | |
| 75 | +| **Ops Complexity** | 🟢 Low (Single Pod) | 🟠 High (Multi-Container) | 🔴 Very High | |
| 76 | +| **Per-Query Latency** | 🚀 ~1-10ms (Warm) | ⚡ ~10ms + Network | 🐢 ~1000ms+ | |
| 77 | +| **Connection Pooling** | 🟢 Yes (Persistent Workers) | 🟢 Excellent | 🔴 Impossible | |
| 78 | + |
| 79 | +## 5. Interface Contract (Internal Protocol) |
| 80 | + |
| 81 | +Regardless of the implementation (Process or Service), the interface remains constant: |
| 82 | + |
| 83 | +```python |
| 84 | +class ExecutionRequest(BaseModel): |
| 85 | + mode: Literal["execute", "dry_run", "fetch_schema"] |
| 86 | + datasource_id: str |
| 87 | + connection_args: Dict[str, Any] |
| 88 | + sql: str |
| 89 | + parameters: Dict[str, Any] |
| 90 | + limits: Dict[str, int] # e.g. { "max_rows": 1000, "timeout_sec": 30 } |
| 91 | + |
| 92 | +class ExecutionResponse(BaseModel): |
| 93 | + success: bool |
| 94 | + data: Optional[List[Dict[str, Any]]] # The rows |
| 95 | + error: Optional[ExecutionError] |
| 96 | + metrics: Dict[str, float] # { "execution_time_ms": 12.5 } |
| 97 | +``` |
| 98 | + |
| 99 | +## 6. Recommendation |
| 100 | + |
| 101 | +**Phase 1: Option A (Multiprocessing)** |
| 102 | + |
| 103 | +* It solves the critical **Blast Radius** immediately. |
| 104 | +* It separates **Indexing** from **Execution** via distinct pools. |
| 105 | +* It is "Service-Ready": The `ExecutionRequest` protocol makes migrating to Option B trivial later. |
| 106 | + |
| 107 | +**Phase 2: Option B (Dedicated Service)** |
| 108 | + |
| 109 | +* Migrate only when we need to scale Execution independently of the Agent or require strict network segmentation. |
0 commit comments