ambient-code
diff --git a/‎docs/internal/agents/openshell-runner-adaptation.md‎
Lines changed: 242 additions & 0 deletions b/‎docs/internal/agents/openshell-runner-adaptation.md‎
Lines changed: 242 additions & 0 deletions
@@ -0,0 +1,242 @@
+# Adapting ambient-runner to Use OpenShell
+
+> Analysis date: 2026-06-03
+> Companion doc: [OpenShell Security Model Analysis](openshell-security-analysis.md)
+> Target component: `components/runners/ambient-runner/ambient_runner/`
+
+---
+
+## Current Runner Credential Model (The Problem)
+
+The runner puts **real secrets directly into `os.environ`** and the agent's process memory. If the agent inspects its own environment, it sees real credentials.
+
+### How Secrets Flow Today
+
+| Mechanism | File | What Happens |
+|-----------|------|-------------|
+| `populate_runtime_credentials()` | `platform/auth.py` | Fetches real tokens from backend API, writes them into `os.environ`: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, etc. |
+| Token files on disk | `platform/auth.py` | Writes real tokens to `/tmp/.ambient_github_token`, `/tmp/.ambient_gitlab_token`, `/tmp/.ambient_kubeconfig` for the git credential helper and `gh` wrapper |
+| Git credential helper | `platform/auth.py` | Shell script at `/tmp/git-credential-ambient` reads the real token from temp file and pipes it to git |
+| `gh` CLI wrapper | `platform/auth.py` | Shell script reads real GitHub token from file, exports `GH_TOKEN`, then exec's the real `gh` |
+| Secret redaction middleware | `middleware/secret_redaction.py` | Post-hoc defense: scrubs secrets from *outbound AG-UI events* only — the agent process still has full access to real secrets in memory and on disk |
+
+### The Gap
+
+```
+Agent reads /proc/self/environ     → sees GITHUB_TOKEN=ghp_real_secret
+Agent runs: cat /tmp/.ambient_*    → sees real tokens
+Agent runs: echo $ANTHROPIC_API_KEY → sees real API key
+```
+
+The redaction middleware protects the *output stream* (events sent to the frontend), not the agent itself. A compromised or misbehaving agent has unrestricted access to all credentials.
+
+---
+
+## OpenShell Integration Strategies
+
+### Strategy 1: OpenShell as Sidecar Supervisor (Recommended)
+
+Replace the runner container's direct credential injection with OpenShell's Supervisor running as a sidecar (or init container + persistent process) in the same pod.
+
+#### What Changes
+
+| Component | Current | With OpenShell |
+|-----------|---------|---------------|
+| `auth.py:populate_runtime_credentials()` | Sets `os.environ["GITHUB_TOKEN"] = real_token` | Sets `os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"` |
+| Token files (`/tmp/.ambient_*`) | Contain real tokens | Contain placeholder strings |
+| Git credential helper | Reads real token from file | Reads placeholder; OpenShell proxy rewrites on outbound |
+| `gh` wrapper | Exports real `GH_TOKEN` | Exports placeholder; proxy rewrites |
+| Network egress | Direct to `api.github.com`, etc. | Via OpenShell HTTP CONNECT proxy at `10.200.0.1:3128` |
+| `secret_redaction.py` | Primary defense for output stream | Redundant but kept as defense-in-depth |
+| `_grpc_client.py` | Direct gRPC to API server | Whitelisted in network policy (intra-cluster) |
+| Claude CLI subprocess | Full env access with real secrets | Runs in sandbox netns with placeholders only |
+
+#### Implementation Steps
+
+**1. New OpenShell provider type**
+
+Register Ambient's credential store as an OpenShell provider. The Operator creates a provider config that maps each credential type (github, gitlab, jira, etc.) to the corresponding backend API credential endpoint. Two options:
+
+- OpenShell's Gateway calls the Ambient backend to fetch the real token on demand
+- The Operator pre-populates the provider at pod creation time (simpler, no Gateway dependency)
+
+**2. Modify `platform/auth.py`**
+
+Replace `populate_runtime_credentials()` with a version that writes placeholders instead of real values:
+
+```python
+# Before (current)
+os.environ["GITHUB_TOKEN"] = github_creds["token"]  # real secret
+_GITHUB_TOKEN_FILE.write_text(github_creds["token"])  # real secret on disk
+
+# After (with OpenShell)
+os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"  # placeholder
+_GITHUB_TOKEN_FILE.write_text("openshell:resolve:env:GITHUB_TOKEN")  # placeholder
+# Real secret held only in Supervisor memory → proxy rewrites on outbound
+```
+
+The same pattern applies to all credential types: `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, `KUBECONFIG`.
+
+**3. Modify the Dockerfile**
+
+Add OpenShell Supervisor binary. The runner (uvicorn) starts normally; the Supervisor is invoked by `bridge.py` when launching the Claude CLI subprocess:
+
+```dockerfile
+# Add OpenShell binary
+COPY --from=openshell/supervisor:latest /usr/bin/openshell-sandbox /usr/bin/openshell-sandbox
+
+# Entrypoint unchanged — uvicorn runs unsandboxed:
+CMD ["/bin/bash", "-c", "umask 0022 && cd /app/ambient-runner && uvicorn main:app --host 0.0.0.0 --port 8001"]
+```
+
+The Supervisor wraps only the Claude CLI subprocess (launched from `bridges/claude/bridge.py`), applying Landlock + seccomp + netns to the agent process. The runner itself (FastAPI/uvicorn, gRPC client, credential fetching) runs outside the sandbox boundary.
+
+**4. Network policy via OpenShell**
+
+Replace the K8s `NetworkPolicy` with OpenShell's per-sandbox network namespace + OPA policy:
+
+```yaml
+network_policies:
+  ambient_backend:
+    name: ambient-backend-access
+    endpoints:
+      - host: backend-service.ambient-code.svc.cluster.local
+        port: 8080
+        protocol: rest
+        access: read-write
+    binaries:
+      - { path: /usr/bin/python3 }
+
+  ambient_grpc:
+    name: ambient-grpc-access
+    endpoints:
+      - host: ambient-api-server.ambient-code.svc.cluster.local
+        port: 9000
+        protocol: connect
+        access: read-write
+    binaries:
+      - { path: /usr/bin/python3 }
+
+  github_api:
+    name: github-api-access
+    endpoints:
+      - host: api.github.com
+        port: 443
+        protocol: rest
+        access: read-write
+
+  anthropic_api:
+    name: anthropic-api-access
+    endpoints:
+      - host: api.anthropic.com
+        port: 443
+        protocol: rest
+        access: read-write
+
+  gitlab_api:
+    name: gitlab-api-access
+    endpoints:
+      - host: "*.gitlab.com"
+        port: 443
+        protocol: rest
+        access: read-write
+```
+
+**5. `_grpc_client.py` — No changes needed**
+
+The gRPC channel to the API server is established by the runner process, which runs outside the OpenShell sandbox boundary. Since only the Claude CLI subprocess is sandboxed, the gRPC client is unaffected.
+
+**6. Modify `bridges/claude/bridge.py`**
+
+Set `HTTP_PROXY`/`HTTPS_PROXY` for the Claude CLI subprocess so it routes through the OpenShell proxy. OpenShell injects these automatically when the sandbox starts — the bridge needs to pass them through to the subprocess env.
+
+**7. Operator changes**
+
+The Operator (`components/operator/`) configures OpenShell provider + policy per session Job:
+
+- Inject OpenShell provider config as a ConfigMap or Secret
+- Mount the Supervisor binary (or use a sidecar container)
+- Generate per-session OPA policies based on the session's credential bindings
+- Pass the policy YAML as a volume mount
+
+#### Files to Modify
+
+| File | Change |
+|------|--------|
+| `platform/auth.py` | `populate_runtime_credentials()` writes placeholders, not real tokens |
+| `platform/auth.py` | Token files (`/tmp/.ambient_*`) get placeholder values |
+| `platform/auth.py` | `install_git_credential_helper()` — helper returns placeholder; proxy rewrites |
+| `platform/auth.py` | `install_gh_wrapper()` — wrapper exports placeholder `GH_TOKEN` |
+| `_grpc_client.py` | No changes needed — gRPC runs in runner process, outside Claude subprocess sandbox boundary |
+| `Dockerfile` | Add OpenShell Supervisor binary, modify CMD |
+| `bridges/claude/bridge.py` | Proxy env vars for Claude CLI subprocess |
+| `middleware/secret_redaction.py` | Keep as defense-in-depth (now truly redundant) |
+| `components/operator/` | Configure OpenShell provider + policy per session Job |
+
+---
+
+### Strategy 2: OpenShell as Pod Runtime (Operator-Level)
+
+The Operator spawns Jobs using an OpenShell-managed container runtime instead of raw K8s containers. The integration moves up a level — runner code doesn't change, but the Operator configures OpenShell as the execution environment.
+
+**Pros:** Zero runner code changes.
+
+**Cons:** Requires OpenShell's Kubernetes compute driver to be production-ready (currently alpha). Heavier Operator changes. Less control over per-session policy granularity from the runner's perspective.
+
+---
+
+### Strategy 3: OpenShell Provider Bridge (Minimal, Credential-Only)
+
+Adopt only the credential placeholder/proxy pattern without the full sandbox. Write a thin Python adapter that:
+
+1. Starts a local HTTP CONNECT proxy in the runner pod
+2. Holds real secrets in proxy memory (separate process, higher privilege)
+3. Injects placeholders into `os.environ`
+4. Rewrites placeholders to real values on outbound requests
+
+**Pros:** No Rust dependency, no kernel features (Landlock/seccomp) needed. Works on any kernel version. Smallest change surface.
+
+**Cons:** No Landlock/seccomp/netns isolation — only credential isolation. Agent can still bypass the proxy if it makes raw socket calls (no network namespace enforcement). No L7 inspection or OPA policy evaluation.
+
+---
+
+## Strategy Comparison
+
+| Criterion | Strategy 1 (Sidecar) | Strategy 2 (Pod Runtime) | Strategy 3 (Proxy Only) |
+|-----------|---------------------|------------------------|------------------------|
+| Credential isolation | Full (placeholder/proxy) | Full (placeholder/proxy) | Partial (no netns enforcement) |
+| Network isolation | Full (netns + iptables) | Full (netns + iptables) | None |
+| Filesystem isolation | Landlock LSM | Landlock LSM | None |
+| Syscall filtering | seccomp-BPF | seccomp-BPF | None |
+| L7 inspection (OPA) | Yes | Yes | No |
+| Runner code changes | Moderate (`auth.py`, `Dockerfile`) | None | Small (new proxy module) |
+| Operator changes | Moderate (provider + policy config) | Heavy (new compute driver) | None |
+| Kernel requirements | Linux 5.13+ (Landlock) | Linux 5.13+ (Landlock) | None |
+| OpenShell maturity dependency | Supervisor (stable) | K8s driver (alpha) | None (custom code) |
+| Defense depth | 5 layers | 5 layers | 1 layer |
+
+---
+
+## Recommendation
+
+**Strategy 1 (Sidecar Supervisor)** is the right path. It provides:
+
+- Agent never sees real secrets (even `/proc/self/environ` inspection fails)
+- L7 inspection via OPA policies (audit which APIs the agent calls)
+- Landlock + seccomp hardening within the container
+- Binary identity via SHA256 TOFU (only known binaries can make network calls)
+- The existing `secret_redaction.py` becomes a true defense-in-depth layer rather than the primary defense
+
+The critical architectural insight: OpenShell's credential proxy pattern eliminates the single point of failure in the current design. Today, `populate_runtime_credentials()` puts real secrets into a space the agent fully controls. OpenShell moves real secrets into Supervisor memory — a separate privilege domain the agent cannot access.
+
+### Prerequisite: Kernel Version
+
+OpenShell's Landlock LSM requires Linux 5.13+. The runner containers run on UBI 10 (RHEL 10), which ships kernel 6.x — this is satisfied. OpenShell's `best_effort` Landlock mode also provides graceful degradation if the kernel lacks support.
+
+### Migration Path
+
+1. **Phase 1 — Credential proxy only (Strategy 3):** Ship a Python-only credential proxy as a proof of concept. Validates the placeholder/rewrite pattern works with git credential helper, `gh` wrapper, and Claude CLI without requiring OpenShell binary.
+
+2. **Phase 2 — Sidecar Supervisor (Strategy 1):** Add OpenShell Supervisor binary, network namespace isolation, Landlock, and seccomp. This is the production target.
+
+3. **Phase 3 — OPA policies:** Add L7 inspection with per-session OPA policies generated by the Operator from the session's credential bindings and project settings.