ambient-code
diff --git a/‎docs/internal/agents/openshell-runner-adaptation.md‎
Lines changed: 243 additions & 0 deletions b/‎docs/internal/agents/openshell-runner-adaptation.md‎
Lines changed: 243 additions & 0 deletions
@@ -0,0 +1,243 @@
+# Adapting ambient-runner to Use OpenShell
+
+> Analysis date: 2026-06-03
+> Companion doc: [OpenShell Security Model Analysis](openshell-security-analysis.md)
+> Target component: `components/runners/ambient-runner/ambient_runner/`
+
+---
+
+## Current Runner Credential Model (The Problem)
+
+The runner puts **real secrets directly into `os.environ`** and the agent's process memory. If the agent inspects its own environment, it sees real credentials.
+
+### How Secrets Flow Today
+
+| Mechanism | File | What Happens |
+|-----------|------|-------------|
+| `populate_runtime_credentials()` | `platform/auth.py` | Fetches real tokens from backend API, writes them into `os.environ`: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, etc. |
+| Token files on disk | `platform/auth.py` | Writes real tokens to `/tmp/.ambient_github_token`, `/tmp/.ambient_gitlab_token`, `/tmp/.ambient_kubeconfig` for the git credential helper and `gh` wrapper |
+| Git credential helper | `platform/auth.py` | Shell script at `/tmp/git-credential-ambient` reads the real token from temp file and pipes it to git |
+| `gh` CLI wrapper | `platform/auth.py` | Shell script reads real GitHub token from file, exports `GH_TOKEN`, then exec's the real `gh` |
+| Secret redaction middleware | `middleware/secret_redaction.py` | Post-hoc defense: scrubs secrets from *outbound AG-UI events* only — the agent process still has full access to real secrets in memory and on disk |
+
+### The Gap
+
+```
+Agent reads /proc/self/environ     → sees GITHUB_TOKEN=ghp_real_secret
+Agent runs: cat /tmp/.ambient_*    → sees real tokens
+Agent runs: echo $ANTHROPIC_API_KEY → sees real API key
+```
+
+The redaction middleware protects the *output stream* (events sent to the frontend), not the agent itself. A compromised or misbehaving agent has unrestricted access to all credentials.
+
+---
+
+## OpenShell Integration Strategies
+
+### Strategy 1: OpenShell as Sidecar Supervisor (Recommended)
+
+Replace the runner container's direct credential injection with OpenShell's Supervisor running as a sidecar (or init container + persistent process) in the same pod.
+
+#### What Changes
+
+| Component | Current | With OpenShell |
+|-----------|---------|---------------|
+| `auth.py:populate_runtime_credentials()` | Sets `os.environ["GITHUB_TOKEN"] = real_token` | Sets `os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"` |
+| Token files (`/tmp/.ambient_*`) | Contain real tokens | Contain placeholder strings |
+| Git credential helper | Reads real token from file | Reads placeholder; OpenShell proxy rewrites on outbound |
+| `gh` wrapper | Exports real `GH_TOKEN` | Exports placeholder; proxy rewrites |
+| Network egress | Direct to `api.github.com`, etc. | Via OpenShell HTTP CONNECT proxy at `10.200.0.1:3128` |
+| `secret_redaction.py` | Primary defense for output stream | Redundant but kept as defense-in-depth |
+| `_grpc_client.py` | Direct gRPC to API server | Whitelisted in network policy (intra-cluster) |
+| Claude CLI subprocess | Full env access with real secrets | Runs in sandbox netns with placeholders only |
+
+#### Implementation Steps
+
+**1. New OpenShell provider type**
+
+Register Ambient's credential store as an OpenShell provider. The Operator creates a provider config that maps each credential type (github, gitlab, jira, etc.) to the corresponding backend API credential endpoint. Two options:
+
+- OpenShell's Gateway calls the Ambient backend to fetch the real token on demand
+- The Operator pre-populates the provider at pod creation time (simpler, no Gateway dependency)
+
+**2. Modify `platform/auth.py`**
+
+Replace `populate_runtime_credentials()` with a version that writes placeholders instead of real values:
+
+```python
+# Before (current)
+os.environ["GITHUB_TOKEN"] = github_creds["token"]  # real secret
+_GITHUB_TOKEN_FILE.write_text(github_creds["token"])  # real secret on disk
+
+# After (with OpenShell)
+os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"  # placeholder
+_GITHUB_TOKEN_FILE.write_text("openshell:resolve:env:GITHUB_TOKEN")  # placeholder
+# Real secret held only in Supervisor memory → proxy rewrites on outbound
+```
+
+The same pattern applies to all credential types: `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, `KUBECONFIG`.
+
+**3. Modify the Dockerfile**
+
+Add OpenShell Supervisor binary. The runner entrypoint wraps with `openshell-sandbox`:
+
+```dockerfile
+# Add OpenShell binary
+COPY --from=openshell/supervisor:latest /usr/bin/openshell-sandbox /usr/bin/openshell-sandbox
+
+# Entrypoint becomes:
+CMD ["openshell-sandbox", "--provider", "ambient", "--", \
+     "/bin/bash", "-c", "umask 0022 && cd /app/ambient-runner && uvicorn main:app --host 0.0.0.0 --port 8001"]
+```
+
+The Supervisor wraps the uvicorn process, applying Landlock + seccomp + netns before exec.
+
+**4. Network policy via OpenShell**
+
+Replace the K8s `NetworkPolicy` with OpenShell's per-sandbox network namespace + OPA policy:
+
+```yaml
+network_policies:
+  ambient_backend:
+    name: ambient-backend-access
+    endpoints:
+      - host: backend-service.ambient-code.svc.cluster.local
+        port: 8080
+        protocol: rest
+        access: read-write
+    binaries:
+      - { path: /usr/bin/python3 }
+
+  ambient_grpc:
+    name: ambient-grpc-access
+    endpoints:
+      - host: ambient-api-server.ambient-code.svc.cluster.local
+        port: 9000
+        protocol: connect
+        access: read-write
+    binaries:
+      - { path: /usr/bin/python3 }
+
+  github_api:
+    name: github-api-access
+    endpoints:
+      - host: api.github.com
+        port: 443
+        protocol: rest
+        access: read-write
+
+  anthropic_api:
+    name: anthropic-api-access
+    endpoints:
+      - host: api.anthropic.com
+        port: 443
+        protocol: rest
+        access: read-write
+
+  gitlab_api:
+    name: gitlab-api-access
+    endpoints:
+      - host: "*.gitlab.com"
+        port: 443
+        protocol: rest
+        access: read-write
+```
+
+**5. Modify `_grpc_client.py`**
+
+The gRPC channel to the API server needs to be whitelisted in OpenShell's network policy. Since it's intra-cluster, it routes through the proxy with credential rewriting. The `_build_channel()` function may need proxy-awareness if OpenShell's netns routes all TCP through the CONNECT proxy.
+
+**6. Modify `bridges/claude/bridge.py`**
+
+Set `HTTP_PROXY`/`HTTPS_PROXY` for the Claude CLI subprocess so it routes through the OpenShell proxy. OpenShell injects these automatically when the sandbox starts — the bridge needs to pass them through to the subprocess env.
+
+**7. Operator changes**
+
+The Operator (`components/operator/`) configures OpenShell provider + policy per session Job:
+
+- Inject OpenShell provider config as a ConfigMap or Secret
+- Mount the Supervisor binary (or use a sidecar container)
+- Generate per-session OPA policies based on the session's credential bindings
+- Pass the policy YAML as a volume mount
+
+#### Files to Modify
+
+| File | Change |
+|------|--------|
+| `platform/auth.py` | `populate_runtime_credentials()` writes placeholders, not real tokens |
+| `platform/auth.py` | Token files (`/tmp/.ambient_*`) get placeholder values |
+| `platform/auth.py` | `install_git_credential_helper()` — helper returns placeholder; proxy rewrites |
+| `platform/auth.py` | `install_gh_wrapper()` — wrapper exports placeholder `GH_TOKEN` |
+| `_grpc_client.py` | Proxy-aware channel construction for intra-cluster gRPC |
+| `Dockerfile` | Add OpenShell Supervisor binary, modify CMD |
+| `bridges/claude/bridge.py` | Proxy env vars for Claude CLI subprocess |
+| `middleware/secret_redaction.py` | Keep as defense-in-depth (now truly redundant) |
+| `components/operator/` | Configure OpenShell provider + policy per session Job |
+
+---
+
+### Strategy 2: OpenShell as Pod Runtime (Operator-Level)
+
+The Operator spawns Jobs using an OpenShell-managed container runtime instead of raw K8s containers. The integration moves up a level — runner code doesn't change, but the Operator configures OpenShell as the execution environment.
+
+**Pros:** Zero runner code changes.
+
+**Cons:** Requires OpenShell's Kubernetes compute driver to be production-ready (currently alpha). Heavier Operator changes. Less control over per-session policy granularity from the runner's perspective.
+
+---
+
+### Strategy 3: OpenShell Provider Bridge (Minimal, Credential-Only)
+
+Adopt only the credential placeholder/proxy pattern without the full sandbox. Write a thin Python adapter that:
+
+1. Starts a local HTTP CONNECT proxy in the runner pod
+2. Holds real secrets in proxy memory (separate process, higher privilege)
+3. Injects placeholders into `os.environ`
+4. Rewrites placeholders to real values on outbound requests
+
+**Pros:** No Rust dependency, no kernel features (Landlock/seccomp) needed. Works on any kernel version. Smallest change surface.
+
+**Cons:** No Landlock/seccomp/netns isolation — only credential isolation. Agent can still bypass the proxy if it makes raw socket calls (no network namespace enforcement). No L7 inspection or OPA policy evaluation.
+
+---
+
+## Strategy Comparison
+
+| Criterion | Strategy 1 (Sidecar) | Strategy 2 (Pod Runtime) | Strategy 3 (Proxy Only) |
+|-----------|---------------------|------------------------|------------------------|
+| Credential isolation | Full (placeholder/proxy) | Full (placeholder/proxy) | Partial (no netns enforcement) |
+| Network isolation | Full (netns + iptables) | Full (netns + iptables) | None |
+| Filesystem isolation | Landlock LSM | Landlock LSM | None |
+| Syscall filtering | seccomp-BPF | seccomp-BPF | None |
+| L7 inspection (OPA) | Yes | Yes | No |
+| Runner code changes | Moderate (`auth.py`, `Dockerfile`) | None | Small (new proxy module) |
+| Operator changes | Moderate (provider + policy config) | Heavy (new compute driver) | None |
+| Kernel requirements | Linux 5.13+ (Landlock) | Linux 5.13+ (Landlock) | None |
+| OpenShell maturity dependency | Supervisor (stable) | K8s driver (alpha) | None (custom code) |
+| Defense depth | 5 layers | 5 layers | 1 layer |
+
+---
+
+## Recommendation
+
+**Strategy 1 (Sidecar Supervisor)** is the right path. It provides:
+
+- Agent never sees real secrets (even `/proc/self/environ` inspection fails)
+- L7 inspection via OPA policies (audit which APIs the agent calls)
+- Landlock + seccomp hardening within the container
+- Binary identity via SHA256 TOFU (only known binaries can make network calls)
+- The existing `secret_redaction.py` becomes a true defense-in-depth layer rather than the primary defense
+
+The critical architectural insight: OpenShell's credential proxy pattern eliminates the single point of failure in the current design. Today, `populate_runtime_credentials()` puts real secrets into a space the agent fully controls. OpenShell moves real secrets into Supervisor memory — a separate privilege domain the agent cannot access.
+
+### Prerequisite: Kernel Version
+
+OpenShell's Landlock LSM requires Linux 5.13+. The runner containers run on UBI 10 (RHEL 10), which ships kernel 6.x — this is satisfied. OpenShell's `best_effort` Landlock mode also provides graceful degradation if the kernel lacks support.
+
+### Migration Path
+
+1. **Phase 1 — Credential proxy only (Strategy 3):** Ship a Python-only credential proxy as a proof of concept. Validates the placeholder/rewrite pattern works with git credential helper, `gh` wrapper, and Claude CLI without requiring OpenShell binary.
+
+2. **Phase 2 — Sidecar Supervisor (Strategy 1):** Add OpenShell Supervisor binary, network namespace isolation, Landlock, and seccomp. This is the production target.
+
+3. **Phase 3 — OPA policies:** Add L7 inspection with per-session OPA policies generated by the Operator from the session's credential bindings and project settings.