Skip to content

Commit fc5c3e2

Browse files
userclaude
andcommitted
spec(runner): sync spec with code drift, add OpenShell desired state
- Fix source layout: add model.py, observability files, fixtures/, remove duplicate workspace.py - Document AGUI_TOKEN session auth middleware and SDK_OPTIONS env var - Document runtime model switching via POST /model - Add 'Desired State: OpenShell Credential Isolation' section with migration path 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 233a2cc commit fc5c3e2

3 files changed

Lines changed: 561 additions & 2 deletions

File tree

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Adapting ambient-runner to Use OpenShell
2+
3+
> Analysis date: 2026-06-03
4+
> Companion doc: [OpenShell Security Model Analysis](openshell-security-analysis.md)
5+
> Target component: `components/runners/ambient-runner/ambient_runner/`
6+
7+
---
8+
9+
## Current Runner Credential Model (The Problem)
10+
11+
The runner puts **real secrets directly into `os.environ`** and the agent's process memory. If the agent inspects its own environment, it sees real credentials.
12+
13+
### How Secrets Flow Today
14+
15+
| Mechanism | File | What Happens |
16+
|-----------|------|-------------|
17+
| `populate_runtime_credentials()` | `platform/auth.py` | Fetches real tokens from backend API, writes them into `os.environ`: `GITHUB_TOKEN`, `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, etc. |
18+
| Token files on disk | `platform/auth.py` | Writes real tokens to `/tmp/.ambient_github_token`, `/tmp/.ambient_gitlab_token`, `/tmp/.ambient_kubeconfig` for the git credential helper and `gh` wrapper |
19+
| Git credential helper | `platform/auth.py` | Shell script at `/tmp/git-credential-ambient` reads the real token from temp file and pipes it to git |
20+
| `gh` CLI wrapper | `platform/auth.py` | Shell script reads real GitHub token from file, exports `GH_TOKEN`, then exec's the real `gh` |
21+
| Secret redaction middleware | `middleware/secret_redaction.py` | Post-hoc defense: scrubs secrets from *outbound AG-UI events* only — the agent process still has full access to real secrets in memory and on disk |
22+
23+
### The Gap
24+
25+
```
26+
Agent reads /proc/self/environ → sees GITHUB_TOKEN=ghp_real_secret
27+
Agent runs: cat /tmp/.ambient_* → sees real tokens
28+
Agent runs: echo $ANTHROPIC_API_KEY → sees real API key
29+
```
30+
31+
The redaction middleware protects the *output stream* (events sent to the frontend), not the agent itself. A compromised or misbehaving agent has unrestricted access to all credentials.
32+
33+
---
34+
35+
## OpenShell Integration Strategies
36+
37+
### Strategy 1: OpenShell as Sidecar Supervisor (Recommended)
38+
39+
Replace the runner container's direct credential injection with OpenShell's Supervisor running as a sidecar (or init container + persistent process) in the same pod.
40+
41+
#### What Changes
42+
43+
| Component | Current | With OpenShell |
44+
|-----------|---------|---------------|
45+
| `auth.py:populate_runtime_credentials()` | Sets `os.environ["GITHUB_TOKEN"] = real_token` | Sets `os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN"` |
46+
| Token files (`/tmp/.ambient_*`) | Contain real tokens | Contain placeholder strings |
47+
| Git credential helper | Reads real token from file | Reads placeholder; OpenShell proxy rewrites on outbound |
48+
| `gh` wrapper | Exports real `GH_TOKEN` | Exports placeholder; proxy rewrites |
49+
| Network egress | Direct to `api.github.com`, etc. | Via OpenShell HTTP CONNECT proxy at `10.200.0.1:3128` |
50+
| `secret_redaction.py` | Primary defense for output stream | Redundant but kept as defense-in-depth |
51+
| `_grpc_client.py` | Direct gRPC to API server | Whitelisted in network policy (intra-cluster) |
52+
| Claude CLI subprocess | Full env access with real secrets | Runs in sandbox netns with placeholders only |
53+
54+
#### Implementation Steps
55+
56+
**1. New OpenShell provider type**
57+
58+
Register Ambient's credential store as an OpenShell provider. The Operator creates a provider config that maps each credential type (github, gitlab, jira, etc.) to the corresponding backend API credential endpoint. Two options:
59+
60+
- OpenShell's Gateway calls the Ambient backend to fetch the real token on demand
61+
- The Operator pre-populates the provider at pod creation time (simpler, no Gateway dependency)
62+
63+
**2. Modify `platform/auth.py`**
64+
65+
Replace `populate_runtime_credentials()` with a version that writes placeholders instead of real values:
66+
67+
```python
68+
# Before (current)
69+
os.environ["GITHUB_TOKEN"] = github_creds["token"] # real secret
70+
_GITHUB_TOKEN_FILE.write_text(github_creds["token"]) # real secret on disk
71+
72+
# After (with OpenShell)
73+
os.environ["GITHUB_TOKEN"] = "openshell:resolve:env:GITHUB_TOKEN" # placeholder
74+
_GITHUB_TOKEN_FILE.write_text("openshell:resolve:env:GITHUB_TOKEN") # placeholder
75+
# Real secret held only in Supervisor memory → proxy rewrites on outbound
76+
```
77+
78+
The same pattern applies to all credential types: `GITLAB_TOKEN`, `JIRA_API_TOKEN`, `ANTHROPIC_API_KEY`, `CODERABBIT_API_KEY`, `KUBECONFIG`.
79+
80+
**3. Modify the Dockerfile**
81+
82+
Add OpenShell Supervisor binary. The runner entrypoint wraps with `openshell-sandbox`:
83+
84+
```dockerfile
85+
# Add OpenShell binary
86+
COPY --from=openshell/supervisor:latest /usr/bin/openshell-sandbox /usr/bin/openshell-sandbox
87+
88+
# Entrypoint becomes:
89+
CMD ["openshell-sandbox", "--provider", "ambient", "--", \
90+
"/bin/bash", "-c", "umask 0022 && cd /app/ambient-runner && uvicorn main:app --host 0.0.0.0 --port 8001"]
91+
```
92+
93+
The Supervisor wraps the uvicorn process, applying Landlock + seccomp + netns before exec.
94+
95+
**4. Network policy via OpenShell**
96+
97+
Replace the K8s `NetworkPolicy` with OpenShell's per-sandbox network namespace + OPA policy:
98+
99+
```yaml
100+
network_policies:
101+
ambient_backend:
102+
name: ambient-backend-access
103+
endpoints:
104+
- host: backend-service.ambient-code.svc.cluster.local
105+
port: 8080
106+
protocol: rest
107+
access: read-write
108+
binaries:
109+
- { path: /usr/bin/python3 }
110+
111+
ambient_grpc:
112+
name: ambient-grpc-access
113+
endpoints:
114+
- host: ambient-api-server.ambient-code.svc.cluster.local
115+
port: 9000
116+
protocol: connect
117+
access: read-write
118+
binaries:
119+
- { path: /usr/bin/python3 }
120+
121+
github_api:
122+
name: github-api-access
123+
endpoints:
124+
- host: api.github.com
125+
port: 443
126+
protocol: rest
127+
access: read-write
128+
129+
anthropic_api:
130+
name: anthropic-api-access
131+
endpoints:
132+
- host: api.anthropic.com
133+
port: 443
134+
protocol: rest
135+
access: read-write
136+
137+
gitlab_api:
138+
name: gitlab-api-access
139+
endpoints:
140+
- host: "*.gitlab.com"
141+
port: 443
142+
protocol: rest
143+
access: read-write
144+
```
145+
146+
**5. Modify `_grpc_client.py`**
147+
148+
The gRPC channel to the API server needs to be whitelisted in OpenShell's network policy. Since it's intra-cluster, it routes through the proxy with credential rewriting. The `_build_channel()` function may need proxy-awareness if OpenShell's netns routes all TCP through the CONNECT proxy.
149+
150+
**6. Modify `bridges/claude/bridge.py`**
151+
152+
Set `HTTP_PROXY`/`HTTPS_PROXY` for the Claude CLI subprocess so it routes through the OpenShell proxy. OpenShell injects these automatically when the sandbox starts — the bridge needs to pass them through to the subprocess env.
153+
154+
**7. Operator changes**
155+
156+
The Operator (`components/operator/`) configures OpenShell provider + policy per session Job:
157+
158+
- Inject OpenShell provider config as a ConfigMap or Secret
159+
- Mount the Supervisor binary (or use a sidecar container)
160+
- Generate per-session OPA policies based on the session's credential bindings
161+
- Pass the policy YAML as a volume mount
162+
163+
#### Files to Modify
164+
165+
| File | Change |
166+
|------|--------|
167+
| `platform/auth.py` | `populate_runtime_credentials()` writes placeholders, not real tokens |
168+
| `platform/auth.py` | Token files (`/tmp/.ambient_*`) get placeholder values |
169+
| `platform/auth.py` | `install_git_credential_helper()` — helper returns placeholder; proxy rewrites |
170+
| `platform/auth.py` | `install_gh_wrapper()` — wrapper exports placeholder `GH_TOKEN` |
171+
| `_grpc_client.py` | Proxy-aware channel construction for intra-cluster gRPC |
172+
| `Dockerfile` | Add OpenShell Supervisor binary, modify CMD |
173+
| `bridges/claude/bridge.py` | Proxy env vars for Claude CLI subprocess |
174+
| `middleware/secret_redaction.py` | Keep as defense-in-depth (now truly redundant) |
175+
| `components/operator/` | Configure OpenShell provider + policy per session Job |
176+
177+
---
178+
179+
### Strategy 2: OpenShell as Pod Runtime (Operator-Level)
180+
181+
The Operator spawns Jobs using an OpenShell-managed container runtime instead of raw K8s containers. The integration moves up a level — runner code doesn't change, but the Operator configures OpenShell as the execution environment.
182+
183+
**Pros:** Zero runner code changes.
184+
185+
**Cons:** Requires OpenShell's Kubernetes compute driver to be production-ready (currently alpha). Heavier Operator changes. Less control over per-session policy granularity from the runner's perspective.
186+
187+
---
188+
189+
### Strategy 3: OpenShell Provider Bridge (Minimal, Credential-Only)
190+
191+
Adopt only the credential placeholder/proxy pattern without the full sandbox. Write a thin Python adapter that:
192+
193+
1. Starts a local HTTP CONNECT proxy in the runner pod
194+
2. Holds real secrets in proxy memory (separate process, higher privilege)
195+
3. Injects placeholders into `os.environ`
196+
4. Rewrites placeholders to real values on outbound requests
197+
198+
**Pros:** No Rust dependency, no kernel features (Landlock/seccomp) needed. Works on any kernel version. Smallest change surface.
199+
200+
**Cons:** No Landlock/seccomp/netns isolation — only credential isolation. Agent can still bypass the proxy if it makes raw socket calls (no network namespace enforcement). No L7 inspection or OPA policy evaluation.
201+
202+
---
203+
204+
## Strategy Comparison
205+
206+
| Criterion | Strategy 1 (Sidecar) | Strategy 2 (Pod Runtime) | Strategy 3 (Proxy Only) |
207+
|-----------|---------------------|------------------------|------------------------|
208+
| Credential isolation | Full (placeholder/proxy) | Full (placeholder/proxy) | Partial (no netns enforcement) |
209+
| Network isolation | Full (netns + iptables) | Full (netns + iptables) | None |
210+
| Filesystem isolation | Landlock LSM | Landlock LSM | None |
211+
| Syscall filtering | seccomp-BPF | seccomp-BPF | None |
212+
| L7 inspection (OPA) | Yes | Yes | No |
213+
| Runner code changes | Moderate (`auth.py`, `Dockerfile`) | None | Small (new proxy module) |
214+
| Operator changes | Moderate (provider + policy config) | Heavy (new compute driver) | None |
215+
| Kernel requirements | Linux 5.13+ (Landlock) | Linux 5.13+ (Landlock) | None |
216+
| OpenShell maturity dependency | Supervisor (stable) | K8s driver (alpha) | None (custom code) |
217+
| Defense depth | 5 layers | 5 layers | 1 layer |
218+
219+
---
220+
221+
## Recommendation
222+
223+
**Strategy 1 (Sidecar Supervisor)** is the right path. It provides:
224+
225+
- Agent never sees real secrets (even `/proc/self/environ` inspection fails)
226+
- L7 inspection via OPA policies (audit which APIs the agent calls)
227+
- Landlock + seccomp hardening within the container
228+
- Binary identity via SHA256 TOFU (only known binaries can make network calls)
229+
- The existing `secret_redaction.py` becomes a true defense-in-depth layer rather than the primary defense
230+
231+
The critical architectural insight: OpenShell's credential proxy pattern eliminates the single point of failure in the current design. Today, `populate_runtime_credentials()` puts real secrets into a space the agent fully controls. OpenShell moves real secrets into Supervisor memory — a separate privilege domain the agent cannot access.
232+
233+
### Prerequisite: Kernel Version
234+
235+
OpenShell's Landlock LSM requires Linux 5.13+. The runner containers run on UBI 10 (RHEL 10), which ships kernel 6.x — this is satisfied. OpenShell's `best_effort` Landlock mode also provides graceful degradation if the kernel lacks support.
236+
237+
### Migration Path
238+
239+
1. **Phase 1 — Credential proxy only (Strategy 3):** Ship a Python-only credential proxy as a proof of concept. Validates the placeholder/rewrite pattern works with git credential helper, `gh` wrapper, and Claude CLI without requiring OpenShell binary.
240+
241+
2. **Phase 2 — Sidecar Supervisor (Strategy 1):** Add OpenShell Supervisor binary, network namespace isolation, Landlock, and seccomp. This is the production target.
242+
243+
3. **Phase 3 — OPA policies:** Add L7 inspection with per-session OPA policies generated by the Operator from the session's credential bindings and project settings.

0 commit comments

Comments
 (0)