Skip to content

Commit 2308ab4

Browse files
jbprattclaude
andcommitted
feat(specs): add custom runner image specification
Define the stable runner contract and a ProjectSettings-driven image override so workspace admins can layer tools onto the base runner via Dockerfile FROM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 28874a9 commit 2308ab4

1 file changed

Lines changed: 379 additions & 0 deletions

File tree

specs/agents/runner-image.spec.md

Lines changed: 379 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,379 @@
1+
# Custom Runner Image Specification
2+
3+
**Date:** 2026-05-12
4+
**Status:** Proposed
5+
**Related:**
6+
- `runner.spec.md` — Runner runtime, AG-UI protocol, bridge layer
7+
- `../control-plane/control-plane.spec.md` — Pod provisioning, image selection, env var injection
8+
- `../api/ambient-model.spec.md` — ProjectSettings, Session data model
9+
- `../security/security.spec.md` — Per-session SA isolation, credential boundaries
10+
11+
---
12+
13+
## Purpose
14+
15+
The Ambient Runner ships a single image containing Python, git, Node.js, Go, and several CLI tools. Workspace admins who need additional tools — Terraform, kubectl, language-specific SDKs, internal CLIs — have no supported extension path short of forking the image.
16+
17+
This spec defines a **stable runner contract** (the set of filesystem paths, HTTP endpoints, environment variables, and security constraints that custom images must preserve), a **Dockerfile FROM extension model** (users layer tools onto a published base image), and a **ProjectSettings-driven image override** (workspace admins declare a custom image per project).
18+
19+
The extension model is Dockerfile FROM only. Init hooks (scripts run at pod startup) were rejected: they are non-reproducible across pods, add startup latency, require runtime network egress that conflicts with NetworkPolicy isolation, and create OpenShift SCC conflicts when installing system packages.
20+
21+
This spec covers only the **image boundary** — what must be true about a container image for the platform to run it as a runner. Runner internals (bridge layer, gRPC transport, credential management) are defined in `runner.spec.md`. Pod provisioning mechanics are defined in `control-plane.spec.md`.
22+
23+
---
24+
25+
## Stable Runner Contract
26+
27+
Everything in this section is the stable interface. Anything not listed here is internal and MAY change without notice between runner releases.
28+
29+
### Requirement: AG-UI HTTP Contract
30+
31+
A custom runner image SHALL expose the AG-UI protocol on the port specified by the `AGUI_PORT` environment variable (default `8001`).
32+
33+
The following endpoints are part of the stable contract:
34+
35+
| Endpoint | Method | Purpose |
36+
|----------|--------|---------|
37+
| `/` | POST | AG-UI run — execute one turn, stream SSE events |
38+
| `/interrupt` | POST | Halt the active run for a thread |
39+
| `/health` | GET | Liveness/readiness probe |
40+
| `/capabilities` | GET | Declare supported features to callers |
41+
| `/events/{thread_id}` | GET | SSE live event stream for a specific thread |
42+
43+
Custom images MUST NOT remove, relocate, or change the response format of these endpoints. The remaining platform endpoints (`/repos`, `/workflow`, `/feedback`, `/mcp-status`, `/content`, `/tasks`) are registered by the `ambient_runner` package and inherited automatically.
44+
45+
#### Scenario: Custom image passes health check
46+
47+
- GIVEN a custom runner image built FROM the base
48+
- WHEN the CP creates a pod and the readiness probe calls `GET /health`
49+
- THEN the response is `200 OK`
50+
- AND the session transitions to `Running` phase
51+
52+
#### Scenario: Custom image serves AG-UI protocol
53+
54+
- GIVEN a custom runner image is running in a session pod
55+
- WHEN the api-server proxies a user message to `POST /`
56+
- THEN the runner processes the turn and streams AG-UI events via SSE
57+
- AND the event format is identical to the standard runner
58+
59+
---
60+
61+
### Requirement: Python Runtime Contract
62+
63+
Custom images SHALL provide Python 3.12+ and SHALL have the `ambient_runner` package installed. The runner process MUST use the same Python major.minor version as the base image.
64+
65+
Custom tools MAY use different Python versions via explicit interpreter paths, but the runner's uvicorn process MUST run under the base image's Python.
66+
67+
#### Scenario: Missing ambient_runner package
68+
69+
- GIVEN a custom image without the `ambient_runner` package
70+
- WHEN the pod starts
71+
- THEN the runner process fails to start
72+
- AND the pod exits with a non-zero exit code
73+
- AND the CP transitions the session to `Failed`
74+
75+
---
76+
77+
### Requirement: Filesystem Contract
78+
79+
A custom runner image SHALL preserve the following filesystem layout:
80+
81+
| Path | Constraint |
82+
|------|------------|
83+
| `/workspace` | MUST exist; EmptyDir mounted by CP at pod creation |
84+
| `/app` | MUST exist; writeable by UID 1001; serves as `HOME` |
85+
| `/app/ambient-runner` | MUST contain installed `ambient_runner` package |
86+
| `/app/vertex` | MUST tolerate read-only Secret mount by CP (when Vertex AI enabled) |
87+
| `/tmp` | MUST be writeable |
88+
89+
Custom images MAY add files and directories anywhere. Custom images MUST NOT remove or relocate the paths listed above.
90+
91+
#### Scenario: Custom tools installed in system PATH
92+
93+
- GIVEN a custom image with additional system packages installed
94+
- WHEN a session runs in a pod using this image
95+
- THEN the additional binaries are available in the agent's PATH
96+
- AND all AG-UI endpoints function normally
97+
98+
---
99+
100+
### Requirement: Entrypoint Contract
101+
102+
Custom images SHOULD NOT override CMD or ENTRYPOINT. The platform controls the runner process lifecycle through the base image's default command.
103+
104+
If a custom image needs pre-startup logic, it MAY use a wrapper entrypoint that performs setup and then `exec`s the original command. The runner process MUST:
105+
106+
- Listen on the port specified by `AGUI_PORT` (default `8001`)
107+
- Receive SIGTERM for graceful shutdown (process must be PID 1 or a direct child of PID 1)
108+
- Start within the pod's startup timeout
109+
110+
#### Scenario: Wrapper entrypoint preserves signal handling
111+
112+
- GIVEN a custom image with a wrapper entrypoint that execs the runner process
113+
- WHEN the CP sends SIGTERM to the pod
114+
- THEN the runner process receives the signal
115+
- AND shuts down gracefully within `terminationGracePeriodSeconds`
116+
117+
---
118+
119+
### Requirement: Environment Contract
120+
121+
The following environment variables are injected by the CP at pod creation time. Custom images MUST NOT override these in the Dockerfile:
122+
123+
| Variable | Purpose |
124+
|----------|---------|
125+
| `SESSION_ID` | Primary session identifier |
126+
| `PROJECT_NAME` | Project context |
127+
| `WORKSPACE_PATH` | Workspace root (always `/workspace`) |
128+
| `AGUI_PORT` | Runner HTTP port |
129+
| `BACKEND_API_URL` | api-server base URL |
130+
| `AMBIENT_GRPC_URL` | api-server gRPC address |
131+
| `AMBIENT_GRPC_USE_TLS` | TLS flag for gRPC channel |
132+
| `AMBIENT_CP_TOKEN_URL` | CP token endpoint |
133+
| `AMBIENT_CP_TOKEN_PUBLIC_KEY` | RSA public key for token auth |
134+
| `INITIAL_PROMPT` | Auto-execute prompt |
135+
| `IS_RESUME` | Resume flag on pod restart |
136+
| `CREDENTIAL_IDS` | JSON map of resolved credential IDs |
137+
| `RUNNER_TYPE` | Bridge selection (from agent registry) |
138+
139+
The base image also sets `PYTHONUNBUFFERED=1`, `HOME=/app`, and `SHELL=/bin/bash`. Custom images SHOULD preserve these.
140+
141+
Custom images MAY set additional environment variables. Custom images MUST NOT unset CP-injected variables.
142+
143+
#### Scenario: Custom image adds environment variables
144+
145+
- GIVEN a custom image with additional `ENV` directives
146+
- WHEN a session pod starts
147+
- THEN both the custom env vars and all CP-injected env vars are present
148+
- AND the runner starts normally
149+
150+
---
151+
152+
### Requirement: Security Contract
153+
154+
A custom runner image SHALL run as UID 1001 with no root privileges.
155+
156+
| Constraint | Enforced by |
157+
|------------|-------------|
158+
| UID 1001 | Dockerfile `USER 1001` |
159+
| `runAsNonRoot: true` | Pod SecurityContext |
160+
| `allowPrivilegeEscalation: false` | Pod SecurityContext |
161+
| `drop: ["ALL"]` capabilities | Pod SecurityContext |
162+
163+
Custom images MAY use `USER 0` during build stages for installing system packages, provided the final `USER` directive sets UID 1001. Custom images SHOULD include OpenShift arbitrary-UID compatibility (`chmod -R g=u` on writeable paths).
164+
165+
#### Scenario: Custom image with system package installation
166+
167+
- GIVEN a custom image that installs system packages as root during build
168+
- AND sets `USER 1001` as the final directive
169+
- WHEN the pod starts with `securityContext.runAsNonRoot: true`
170+
- THEN the pod starts successfully
171+
- AND the installed packages are executable by UID 1001
172+
173+
---
174+
175+
## ProjectSettings Integration
176+
177+
### Requirement: Custom Runner Image Field
178+
179+
The ProjectSettings resource SHALL support a `runner_image` field (string). When set, the CP SHALL use this image instead of the default when creating session pods for that project.
180+
181+
The field SHALL contain a fully qualified container image reference: `registry/repository:tag` or `registry/repository@sha256:digest`. When empty or unset, the CP uses the default image.
182+
183+
#### Scenario: Project with custom runner image
184+
185+
- GIVEN a ProjectSettings with `runner_image` set to a custom image
186+
- WHEN a session is started in that project
187+
- THEN the CP creates the runner pod with the custom image
188+
- AND all other pod configuration (env vars, volumes, security context) is unchanged
189+
190+
#### Scenario: Project without custom runner image
191+
192+
- GIVEN a ProjectSettings with `runner_image` unset
193+
- WHEN a session is started
194+
- THEN the CP uses the default runner image
195+
196+
---
197+
198+
### Requirement: Image Selection Precedence
199+
200+
The CP SHALL select the runner image using the following precedence (highest to lowest):
201+
202+
1. **ProjectSettings `runner_image`** — workspace admin override
203+
2. **Agent registry `container.image`** — per-agent-type default
204+
3. **Operator `RUNNER_IMAGE` env var** — cluster-level default
205+
4. **Hardcoded fallback**
206+
207+
`ProjectSettings.runner_image` overrides the **image** but not the **agent type configuration**. The `RUNNER_TYPE` env var, resource limits, state directory, and other agent-registry settings are still applied from the registry entry matching the session's runner type.
208+
209+
Custom images MUST contain the bridge implementation for every agent type that sessions in this project may use. Images built FROM the standard base inherit all bridges.
210+
211+
#### Scenario: Custom image with non-default runner type
212+
213+
- GIVEN a project with `runner_image` set to a custom image
214+
- AND a session created with a non-default runner type
215+
- WHEN the CP provisions the pod
216+
- THEN the pod uses the custom image
217+
- AND the pod env includes the `RUNNER_TYPE` from the agent registry
218+
- AND the custom image MUST contain the matching bridge implementation
219+
220+
#### Scenario: No custom image — agent registry selects image
221+
222+
- GIVEN a project with `runner_image` unset
223+
- AND a session with a specific runner type
224+
- WHEN the CP provisions the pod
225+
- THEN the pod uses the image from the agent registry entry for that runner type
226+
227+
---
228+
229+
### Requirement: Image Validation
230+
231+
The CP SHALL validate the `runner_image` value before creating pods.
232+
233+
The CP SHALL reject images where the reference is syntactically invalid (missing repository or tag/digest) or the registry host is empty.
234+
235+
The CP SHOULD support an operator-level allowlist of permitted registries via `RUNNER_IMAGE_ALLOWED_REGISTRIES` (comma-separated hostnames). When set, images from unlisted registries SHALL be rejected and the session SHALL transition to `Failed` with a descriptive condition.
236+
237+
When the allowlist is unset, any registry is allowed.
238+
239+
#### Scenario: Image from disallowed registry
240+
241+
- GIVEN a registry allowlist that does not include `docker.io`
242+
- AND a ProjectSettings with `runner_image` pointing to `docker.io`
243+
- WHEN the CP validates the image reference
244+
- THEN the session transitions to `Failed` with a condition describing the rejection
245+
246+
#### Scenario: No registry allowlist
247+
248+
- GIVEN no registry allowlist configured
249+
- AND a ProjectSettings with `runner_image` pointing to any registry
250+
- THEN the image is accepted
251+
252+
---
253+
254+
### Requirement: Image Pull Credentials
255+
256+
The ProjectSettings resource SHALL support a `runner_image_pull_secret` field (string) containing the name of a Kubernetes Secret (type `kubernetes.io/dockerconfigjson`) in the project's namespace.
257+
258+
When set, the CP SHALL add it to the pod's `spec.imagePullSecrets`.
259+
260+
#### Scenario: Private registry with pull secret
261+
262+
- GIVEN a ProjectSettings with `runner_image` and `runner_image_pull_secret` set
263+
- AND the referenced Secret exists in the project namespace
264+
- WHEN the CP creates the runner pod
265+
- THEN the pod spec includes the secret in `imagePullSecrets`
266+
267+
---
268+
269+
### Requirement: Image Pull Policy
270+
271+
The CP SHALL set `imagePullPolicy` based on the image reference:
272+
273+
| Reference type | Policy |
274+
|----------------|--------|
275+
| `@sha256:` digest | `IfNotPresent` |
276+
| `localhost/` prefix | `IfNotPresent` |
277+
| All others (tags) | `Always` |
278+
279+
---
280+
281+
### Requirement: RBAC for Runner Image Configuration
282+
283+
Only users with `project_settings:update` permission SHALL be permitted to modify ProjectSettings, including the `runner_image` and `runner_image_pull_secret` fields. This follows the existing endpoint-level RBAC model.
284+
285+
#### Scenario: User without update permission
286+
287+
- GIVEN a user without `project_settings:update` permission
288+
- WHEN they PATCH ProjectSettings with a `runner_image` value
289+
- THEN the request is rejected with `403 Forbidden`
290+
291+
---
292+
293+
### Requirement: Running Sessions Unaffected
294+
295+
When `runner_image` changes on a ProjectSettings resource, the change SHALL apply to **new sessions only**. Running sessions continue using the image they were created with.
296+
297+
#### Scenario: Image change does not affect running sessions
298+
299+
- GIVEN running sessions in a project using image A
300+
- WHEN the admin changes `runner_image` to image B
301+
- THEN running sessions continue with image A
302+
- AND the next session started uses image B
303+
304+
---
305+
306+
## Failure Modes
307+
308+
### Requirement: Health Check Timeout
309+
310+
The CP SHALL configure a readiness probe on the runner container (`GET /health` on `AGUI_PORT`). If the probe does not pass within the pod's startup timeout, the CP SHALL transition the session to `Failed`.
311+
312+
#### Scenario: Custom image crashes on start
313+
314+
- GIVEN a custom image with a broken dependency
315+
- WHEN the pod starts and the runner process fails to initialize
316+
- THEN the pod exits with a non-zero exit code
317+
- AND the CP transitions the session to `Failed`
318+
319+
### Requirement: Bridge Mismatch
320+
321+
When a custom image does not contain the bridge implementation required by the session's `RUNNER_TYPE`, the runner process SHALL fail at startup. The pod logs SHALL contain an error identifying the missing bridge module.
322+
323+
Custom images built FROM the standard base image inherit all bridge implementations and are not affected.
324+
325+
#### Scenario: Custom image missing bridge for session runner type
326+
327+
- GIVEN a custom image that does not include the bridge for a given runner type
328+
- AND a session is created with that runner type
329+
- WHEN the pod starts
330+
- THEN the runner process fails to load the bridge module
331+
- AND the pod exits with a non-zero exit code
332+
- AND the CP transitions the session to `Failed`
333+
334+
### Requirement: Image Pull Failure
335+
336+
When the kubelet cannot pull the custom image, the CP SHALL transition the session to `Failed` with the pull error in the session condition.
337+
338+
#### Scenario: Image does not exist in registry
339+
340+
- GIVEN `runner_image` pointing to a non-existent image
341+
- WHEN the CP creates the pod
342+
- THEN the kubelet enters `ImagePullBackOff`
343+
- AND the CP transitions the session to `Failed`
344+
345+
---
346+
347+
## Security Boundary
348+
349+
Custom runner images run within the same security perimeter as the standard runner:
350+
351+
- **Network isolation**: Runner pods are subject to NetworkPolicy. Outbound internet access is blocked by default.
352+
- **Credential isolation**: Credentials are fetched per-turn via cluster-local endpoints only.
353+
- **Per-session ServiceAccount**: Each session gets its own SA with minimal RBAC.
354+
355+
Custom images inherit these constraints.
356+
357+
---
358+
359+
## Base Image Publishing
360+
361+
### Requirement: Published Base Image
362+
363+
The platform SHALL publish a base runner image suitable for `FROM` directives at a stable, versioned tag. The image SHALL be built from the same source as the standard runner image.
364+
365+
Breaking changes to the stable contract SHALL increment the major version.
366+
367+
### Requirement: Contract Version Label
368+
369+
The base image SHALL carry an OCI label indicating the contract version (e.g., `io.ambient-code.runner-contract-version`).
370+
371+
The CP MAY log a warning if the contract version does not match the expected version. The CP SHALL NOT block pod creation based on contract version mismatch.
372+
373+
#### Scenario: Contract version mismatch
374+
375+
- GIVEN the CP expects contract version `1`
376+
- AND a custom image has a different contract version label
377+
- WHEN the CP creates the pod
378+
- THEN the CP logs a warning
379+
- AND the pod is created normally

0 commit comments

Comments
 (0)