Skip to content

Commit 597cc4c

Browse files
userclaude
andcommitted
spec(runner): sync spec with code drift, add OpenShell desired state
- Fix source layout: add model.py, observability files, fixtures/, remove duplicate workspace.py - Document AGUI_TOKEN session auth middleware and SDK_OPTIONS env var - Document runtime model switching via POST /model - Add 'Desired State: OpenShell Credential Isolation' section with migration path 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 233a2cc commit 597cc4c

3 files changed

Lines changed: 589 additions & 2 deletions

File tree

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
# Adapting ambient-runner to Use OpenShell
2+
3+
> Analysis date: 2026-06-03
4+
> Companion doc: [OpenShell Security Model Analysis](openshell-security-analysis.md)
5+
> Target component: `components/runners/ambient-runner/ambient_runner/`
6+
7+
---
8+
9+
## Current Runner Credential Model (The Problem)
10+
11+
The runner puts **real secrets directly into `os.environ`** and the agent's process memory. If the agent inspects its own environment, it sees real credentials.
12+
13+
### How Secrets Flow Today
14+
15+
| Mechanism | File | What Happens |
16+
|-----------|------|-------------|
17+
| `populate_runtime_credentials()` | `platform/auth.py` | Fetches real tokens from backend API, writes them into `os.environ`: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, etc. |
18+
| Token files on disk | `platform/auth.py` | Writes real tokens to `/tmp/.ambient_github_token`, `/tmp/.ambient_gitlab_token`, `/tmp/.ambient_kubeconfig` for the git credential helper and `gh` wrapper |
19+
| Git credential helper | `platform/auth.py` | Shell script at `/tmp/git-credential-ambient` reads the real token from temp file and pipes it to git |
20+
| `gh` CLI wrapper | `platform/auth.py` | Shell script reads real GitHub token from file, exports `GH_TOKEN`, then exec's the real `gh` |
21+
| Secret redaction middleware | `middleware/secret_redaction.py` | Post-hoc defense: scrubs secrets from *outbound AG-UI events* only — the agent process still has full access to real secrets in memory and on disk |
22+
23+
### The Gap
24+
25+
```
26+
Agent reads /proc/self/environ → sees GITHUB_TOKEN=ghp_real_secret
27+
Agent runs: cat /tmp/.ambient_* → sees real tokens
28+
Agent runs: echo $ANTHROPIC_API_KEY → sees real API key
29+
```
30+
31+
The redaction middleware protects the *output stream* (events sent to the frontend), not the agent itself. A compromised or misbehaving agent has unrestricted access to all credentials.
32+
33+
---
34+
35+
## OpenShell Integration Strategies
36+
37+
### Strategy 1: OpenShell Supervisor wrapping Claude CLI (Recommended)
38+
39+
Replace the runner container's direct credential injection with OpenShell's Supervisor wrapping the Claude CLI subprocess. The Supervisor is **not** a sidecar container — it is a binary invoked by `bridge.py` that fork/execs the Claude CLI, applying Landlock, seccomp, and netns isolation in the `pre_exec` closure (after fork, before exec). This gives the Supervisor control over the agent's process setup, which a separate sidecar container cannot achieve.
40+
41+
#### What Changes
42+
43+
| Component | Current | With OpenShell |
44+
|-----------|---------|---------------|
45+
| `auth.py:populate_runtime_credentials()` | Sets `os.environ["GITHUB_TOKEN"] = real_token` | Sets `os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"` |
46+
| Token files (`/tmp/.ambient_*`) | Contain real tokens | Contain placeholder strings |
47+
| Git credential helper | Reads real token from file | Reads placeholder; OpenShell proxy rewrites on outbound |
48+
| `gh` wrapper | Exports real `GH_TOKEN` | Exports placeholder; proxy rewrites |
49+
| Network egress | Direct to `api.github.com`, etc. | Via OpenShell HTTP CONNECT proxy at `10.200.0.1:3128` |
50+
| `secret_redaction.py` | Primary defense for output stream | Redundant but kept as defense-in-depth |
51+
| `_grpc_client.py` | Direct gRPC to API server | Whitelisted in network policy (intra-cluster) |
52+
| Claude CLI subprocess | Full env access with real secrets | Runs in sandbox netns with placeholders only |
53+
54+
#### Implementation Steps
55+
56+
**1. New OpenShell provider type**
57+
58+
Register Ambient's credential store as an OpenShell provider. The Operator creates a provider config that maps each credential type (github, gitlab, jira, etc.) to the corresponding backend API credential endpoint. Two options:
59+
60+
- OpenShell's Gateway calls the Ambient backend to fetch the real token on demand
61+
- The Operator pre-populates the provider at pod creation time (simpler, no Gateway dependency)
62+
63+
**2. Modify `platform/auth.py`**
64+
65+
Replace `populate_runtime_credentials()` with a version that writes placeholders instead of real values:
66+
67+
```python
68+
# Before (current)
69+
os.environ["GITHUB_TOKEN"] = github_creds["token"] # real secret
70+
_GITHUB_TOKEN_FILE.write_text(github_creds["token"]) # real secret on disk
71+
72+
# After (with OpenShell)
73+
os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN" # placeholder
74+
_GITHUB_TOKEN_FILE.write_text("openshell:resolve:env:GITHUB_TOKEN") # placeholder
75+
# Real secret held only in Supervisor memory → proxy rewrites on outbound
76+
```
77+
78+
The same pattern applies to all HTTP-based credential types: `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`.
79+
80+
> **HTTP-only limitation:** The placeholder/proxy pattern works at the HTTP layer only. The proxy rewrites `Authorization: Bearer openshell:resolve:env:GITHUB_TOKEN` in HTTP requests, but cannot intercept credential usage in non-HTTP contexts. The git credential helper and `gh` wrapper work because git/gh ultimately make HTTPS requests that pass through the proxy. However, SSH-based git auth, kubeconfig client certificates, and any non-HTTP protocol would receive the placeholder string verbatim. Future credential types using non-HTTP protocols will need a different isolation approach (e.g., agent-side socket forwarding or dedicated MCP tools).
81+
>
82+
> Current credential types and their compatibility:
83+
> - `GITHUB_TOKEN` — HTTP-based, works with proxy rewrite
84+
> - `GITLAB_TOKEN` — HTTP-based, works with proxy rewrite
85+
> - `JIRA_API_TOKEN` — HTTP-based, works with proxy rewrite
86+
> - `ANTHROPIC_API_KEY` — HTTP-based, works with proxy rewrite
87+
> - `CODERABBIT_API_KEY` — HTTP-based, works with proxy rewrite
88+
> - `KUBECONFIG`**Mixed**: API server calls are HTTPS (works), but client certificate auth embeds certs in the kubeconfig file (placeholder won't work for cert-based auth). Token-based kubeconfig auth works.
89+
90+
**3. Modify the Dockerfile**
91+
92+
Add OpenShell Supervisor binary. The runner (uvicorn) starts normally; the Supervisor is invoked by `bridge.py` when launching the Claude CLI subprocess:
93+
94+
```dockerfile
95+
# Add OpenShell binary
96+
COPY --from=openshell/supervisor:latest /usr/bin/openshell-sandbox /usr/bin/openshell-sandbox
97+
98+
# Entrypoint unchanged — uvicorn runs unsandboxed:
99+
CMD ["/bin/bash", "-c", "umask 0022 && cd /app/ambient-runner && uvicorn main:app --host 0.0.0.0 --port 8001"]
100+
```
101+
102+
The Supervisor wraps only the Claude CLI subprocess (launched from `bridges/claude/bridge.py`), applying Landlock + seccomp + netns to the agent process. The runner itself (FastAPI/uvicorn, gRPC client, credential fetching) runs outside the sandbox boundary.
103+
104+
> **Capability requirement:** The Supervisor needs `NET_ADMIN` capability to create the network namespace (`unshare(CLONE_NEWNET)`) and set up the veth pair that routes agent traffic through `10.200.0.1:3128`. Without `CLONE_NEWNET`, placeholders will be sent as-is to upstream APIs — the proxy has no way to intercept requests outside its network namespace. The Operator must add `NET_ADMIN` to the runner container's `securityContext.capabilities.add`.
105+
106+
**4. Network policy via OpenShell**
107+
108+
Replace the K8s `NetworkPolicy` with OpenShell's per-sandbox network namespace + OPA policy:
109+
110+
```yaml
111+
network_policies:
112+
ambient_backend:
113+
name: ambient-backend-access
114+
endpoints:
115+
- host: backend-service.ambient-code.svc.cluster.local
116+
port: 8080
117+
protocol: rest
118+
access: read-write
119+
binaries:
120+
- { path: /usr/bin/python3 }
121+
122+
ambient_grpc:
123+
name: ambient-grpc-access
124+
endpoints:
125+
- host: ambient-api-server.ambient-code.svc.cluster.local
126+
port: 9000
127+
protocol: connect
128+
access: read-write
129+
binaries:
130+
- { path: /usr/bin/python3 }
131+
132+
github_api:
133+
name: github-api-access
134+
endpoints:
135+
- host: api.github.com
136+
port: 443
137+
protocol: rest
138+
access: read-write
139+
140+
anthropic_api:
141+
name: anthropic-api-access
142+
endpoints:
143+
- host: api.anthropic.com
144+
port: 443
145+
protocol: rest
146+
access: read-write
147+
148+
gitlab_api:
149+
name: gitlab-api-access
150+
endpoints:
151+
- host: "*.gitlab.com"
152+
port: 443
153+
protocol: rest
154+
access: read-write
155+
```
156+
157+
**5. `_grpc_client.py` — No changes needed**
158+
159+
The gRPC channel to the API server is established by the runner process, which runs outside the OpenShell sandbox boundary. Since only the Claude CLI subprocess is sandboxed, the gRPC client is unaffected.
160+
161+
**6. Modify `bridges/claude/bridge.py`**
162+
163+
The bridge launches Claude CLI via the Supervisor binary instead of directly. The Supervisor fork/execs the agent process, applying sandbox restrictions in the `pre_exec` closure:
164+
165+
```python
166+
# Before (current)
167+
subprocess.Popen(["claude", "--sdk", ...], env=agent_env)
168+
169+
# After (with OpenShell)
170+
subprocess.Popen(
171+
["openshell-sandbox", "--provider", "ambient", "--", "claude", "--sdk", ...],
172+
env=agent_env
173+
)
174+
```
175+
176+
The Supervisor owns the agent's process lifecycle — it creates the netns, applies Landlock/seccomp, drops privileges, then execs the Claude CLI. `HTTP_PROXY`/`HTTPS_PROXY` are injected automatically by the Supervisor into the sandboxed process environment.
177+
178+
**7. Operator changes**
179+
180+
The Operator (`components/operator/`) configures OpenShell provider + policy per session Job:
181+
182+
- Register the Ambient provider via OpenShell's **gRPC-only** Gateway API (`openshell.v1.OpenShell` service — `CreateProvider`, `SetClusterInference`). There are no REST equivalents; the Gateway multiplexes gRPC and HTTP on port 8080, but provider/inference management is exclusively gRPC. Proto definitions: `proto/openshell.proto`, `proto/inference.proto` in the OpenShell upstream repo.
183+
- Add `NET_ADMIN` capability to the runner container's `securityContext` (required for Supervisor to create network namespace)
184+
- Generate per-session OPA policies based on the session's credential bindings
185+
- Pass the policy YAML as a volume mount
186+
187+
#### Files to Modify
188+
189+
| File | Change |
190+
|------|--------|
191+
| `platform/auth.py` | `populate_runtime_credentials()` writes placeholders, not real tokens |
192+
| `platform/auth.py` | Token files (`/tmp/.ambient_*`) get placeholder values |
193+
| `platform/auth.py` | `install_git_credential_helper()` — helper returns placeholder; proxy rewrites |
194+
| `platform/auth.py` | `install_gh_wrapper()` — wrapper exports placeholder `GH_TOKEN` |
195+
| `_grpc_client.py` | No changes needed — gRPC runs in runner process, outside Claude subprocess sandbox boundary |
196+
| `Dockerfile` | Add OpenShell Supervisor binary (entrypoint unchanged) |
197+
| `bridges/claude/bridge.py` | Launch Claude CLI via `openshell-sandbox` binary; Supervisor fork/execs with sandbox pre_exec |
198+
| `middleware/secret_redaction.py` | Keep as defense-in-depth (now truly redundant) |
199+
| `components/operator/` | Configure OpenShell provider via gRPC Gateway API; add `NET_ADMIN` capability; generate per-session OPA policies |
200+
201+
---
202+
203+
### Strategy 2: OpenShell as Pod Runtime (Operator-Level)
204+
205+
The Operator spawns Jobs using an OpenShell-managed container runtime instead of raw K8s containers. The integration moves up a level — runner code doesn't change, but the Operator configures OpenShell as the execution environment.
206+
207+
**Pros:** Zero runner code changes.
208+
209+
**Cons:** Requires OpenShell's Kubernetes compute driver to be production-ready (currently alpha). Heavier Operator changes. Less control over per-session policy granularity from the runner's perspective.
210+
211+
---
212+
213+
### Strategy 3: OpenShell Provider Bridge (Minimal, Credential-Only)
214+
215+
Adopt only the credential placeholder/proxy pattern without the full sandbox. Write a thin Python adapter that:
216+
217+
1. Starts a local HTTP CONNECT proxy in the runner pod
218+
2. Holds real secrets in proxy memory (separate process, higher privilege)
219+
3. Injects placeholders into `os.environ`
220+
4. Rewrites placeholders to real values on outbound requests
221+
222+
**Pros:** No Rust dependency, no kernel features (Landlock/seccomp) needed. Works on any kernel version. Smallest change surface.
223+
224+
**Cons:** No Landlock/seccomp/netns isolation — only credential isolation. Agent can still bypass the proxy if it makes raw socket calls (no network namespace enforcement). No L7 inspection or OPA policy evaluation.
225+
226+
---
227+
228+
## Strategy Comparison
229+
230+
| Criterion | Strategy 1 (Sidecar) | Strategy 2 (Pod Runtime) | Strategy 3 (Proxy Only) |
231+
|-----------|---------------------|------------------------|------------------------|
232+
| Credential isolation | Full (placeholder/proxy) | Full (placeholder/proxy) | Partial (no netns enforcement) |
233+
| Network isolation | Full (netns + iptables) | Full (netns + iptables) | None |
234+
| Filesystem isolation | Landlock LSM | Landlock LSM | None |
235+
| Syscall filtering | seccomp-BPF | seccomp-BPF | None |
236+
| L7 inspection (OPA) | Yes | Yes | No |
237+
| Runner code changes | Moderate (`auth.py`, `bridge.py`, `Dockerfile`) | None | Small (new proxy module) |
238+
| Operator changes | Moderate (provider + policy config) | Heavy (new compute driver) | None |
239+
| Kernel requirements | Linux 5.13+ (Landlock) | Linux 5.13+ (Landlock) | None |
240+
| OpenShell maturity dependency | Supervisor (stable) | K8s driver (alpha) | None (custom code) |
241+
| Container capability requirement | `NET_ADMIN` (for netns setup) | Depends on runtime | None |
242+
| Gateway API protocol | gRPC only (`openshell.v1.OpenShell`) | gRPC only | N/A |
243+
| Credential protocol support | HTTP-only (placeholder/proxy rewrite) | HTTP-only | HTTP-only |
244+
| Defense depth | 5 layers | 5 layers | 1 layer |
245+
246+
---
247+
248+
## Recommendation
249+
250+
**Strategy 1 (Sidecar Supervisor)** is the right path. It provides:
251+
252+
- Agent never sees real secrets (even `/proc/self/environ` inspection fails)
253+
- L7 inspection via OPA policies (audit which APIs the agent calls)
254+
- Landlock + seccomp hardening within the container
255+
- Binary identity via SHA256 TOFU (only known binaries can make network calls)
256+
- The existing `secret_redaction.py` becomes a true defense-in-depth layer rather than the primary defense
257+
258+
The critical architectural insight: OpenShell's credential proxy pattern eliminates the single point of failure in the current design. Today, `populate_runtime_credentials()` puts real secrets into a space the agent fully controls. OpenShell moves real secrets into Supervisor memory — a separate privilege domain the agent cannot access.
259+
260+
### Prerequisite: Kernel Version
261+
262+
OpenShell's Landlock LSM requires Linux 5.13+. The runner containers run on UBI 10 (RHEL 10), which ships kernel 6.x — this is satisfied. OpenShell's `best_effort` Landlock mode also provides graceful degradation if the kernel lacks support.
263+
264+
### Migration Path
265+
266+
1. **Phase 1 — Credential proxy only (Strategy 3):** Ship a Python-only credential proxy as a proof of concept. Validates the placeholder/rewrite pattern works with git credential helper, `gh` wrapper, and Claude CLI without requiring OpenShell binary.
267+
268+
2. **Phase 2 — Sidecar Supervisor (Strategy 1):** Add OpenShell Supervisor binary, network namespace isolation, Landlock, and seccomp. This is the production target.
269+
270+
3. **Phase 3 — OPA policies:** Add L7 inspection with per-session OPA policies generated by the Operator from the session's credential bindings and project settings.

0 commit comments

Comments
 (0)