|
| 1 | +# Custom Runner Image Specification |
| 2 | + |
| 3 | +**Date:** 2026-05-12 |
| 4 | +**Status:** Proposed |
| 5 | +**Related:** |
| 6 | + - `runner.spec.md` — Runner runtime, AG-UI protocol, bridge layer |
| 7 | + - `../control-plane/control-plane.spec.md` — Pod provisioning, image selection, env var injection |
| 8 | + - `../api/ambient-model.spec.md` — ProjectSettings, Session data model |
| 9 | + - `../security/security.spec.md` — Per-session SA isolation, credential boundaries |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +## Purpose |
| 14 | + |
| 15 | +The Ambient Runner ships a single image containing Python, git, Node.js, Go, and several CLI tools. Workspace admins who need additional tools — Terraform, kubectl, language-specific SDKs, internal CLIs — have no supported extension path short of forking the image. |
| 16 | + |
| 17 | +This spec defines a **stable runner contract** (the set of filesystem paths, HTTP endpoints, environment variables, and security constraints that custom images must preserve), a **Dockerfile FROM extension model** (users layer tools onto a published base image), and a **ProjectSettings-driven image override** (workspace admins declare a custom image per project). |
| 18 | + |
| 19 | +The extension model is Dockerfile FROM only. Init hooks (scripts run at pod startup) were rejected: they are non-reproducible across pods, add startup latency, require runtime network egress that conflicts with NetworkPolicy isolation, and create OpenShift SCC conflicts when installing system packages. |
| 20 | + |
| 21 | +This spec covers only the **image boundary** — what must be true about a container image for the platform to run it as a runner. Runner internals (bridge layer, gRPC transport, credential management) are defined in `runner.spec.md`. Pod provisioning mechanics are defined in `control-plane.spec.md`. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Stable Runner Contract |
| 26 | + |
| 27 | +Everything in this section is the stable interface. Anything not listed here is internal and MAY change without notice between runner releases. |
| 28 | + |
| 29 | +### Requirement: AG-UI HTTP Contract |
| 30 | + |
| 31 | +A custom runner image SHALL expose the AG-UI protocol on the port specified by the `AGUI_PORT` environment variable (default `8001`). |
| 32 | + |
| 33 | +The following endpoints are part of the stable contract: |
| 34 | + |
| 35 | +| Endpoint | Method | Purpose | |
| 36 | +|----------|--------|---------| |
| 37 | +| `/` | POST | AG-UI run — execute one turn, stream SSE events | |
| 38 | +| `/interrupt` | POST | Halt the active run for a thread | |
| 39 | +| `/health` | GET | Liveness/readiness probe | |
| 40 | +| `/capabilities` | GET | Declare supported features to callers | |
| 41 | +| `/events/{thread_id}` | GET | SSE live event stream for a specific thread | |
| 42 | + |
| 43 | +Custom images MUST NOT remove, relocate, or change the response format of these endpoints. The remaining platform endpoints (`/repos`, `/workflow`, `/feedback`, `/mcp-status`, `/content`, `/tasks`) are registered by the `ambient_runner` package and inherited automatically. |
| 44 | + |
| 45 | +#### Scenario: Custom image passes health check |
| 46 | + |
| 47 | +- GIVEN a custom runner image built FROM the base |
| 48 | +- WHEN the CP creates a pod and the readiness probe calls `GET /health` |
| 49 | +- THEN the response is `200 OK` |
| 50 | +- AND the session transitions to `Running` phase |
| 51 | + |
| 52 | +#### Scenario: Custom image serves AG-UI protocol |
| 53 | + |
| 54 | +- GIVEN a custom runner image is running in a session pod |
| 55 | +- WHEN the api-server proxies a user message to `POST /` |
| 56 | +- THEN the runner processes the turn and streams AG-UI events via SSE |
| 57 | +- AND the event format is identical to the standard runner |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +### Requirement: Python Runtime Contract |
| 62 | + |
| 63 | +Custom images SHALL provide Python 3.12+ and SHALL have the `ambient_runner` package installed. The runner process MUST use the same Python major.minor version as the base image. |
| 64 | + |
| 65 | +Custom tools MAY use different Python versions via explicit interpreter paths, but the runner's uvicorn process MUST run under the base image's Python. |
| 66 | + |
| 67 | +#### Scenario: Missing ambient_runner package |
| 68 | + |
| 69 | +- GIVEN a custom image without the `ambient_runner` package |
| 70 | +- WHEN the pod starts |
| 71 | +- THEN the runner process fails to start |
| 72 | +- AND the pod exits with a non-zero exit code |
| 73 | +- AND the CP transitions the session to `Failed` |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +### Requirement: Filesystem Contract |
| 78 | + |
| 79 | +A custom runner image SHALL preserve the following filesystem layout: |
| 80 | + |
| 81 | +| Path | Constraint | |
| 82 | +|------|------------| |
| 83 | +| `/workspace` | MUST exist; EmptyDir mounted by CP at pod creation | |
| 84 | +| `/app` | MUST exist; writeable by UID 1001; serves as `HOME` | |
| 85 | +| `/app/ambient-runner` | MUST contain installed `ambient_runner` package | |
| 86 | +| `/app/vertex` | MUST tolerate read-only Secret mount by CP (when Vertex AI enabled) | |
| 87 | +| `/tmp` | MUST be writeable | |
| 88 | + |
| 89 | +Custom images MAY add files and directories anywhere. Custom images MUST NOT remove or relocate the paths listed above. |
| 90 | + |
| 91 | +#### Scenario: Custom tools installed in system PATH |
| 92 | + |
| 93 | +- GIVEN a custom image with additional system packages installed |
| 94 | +- WHEN a session runs in a pod using this image |
| 95 | +- THEN the additional binaries are available in the agent's PATH |
| 96 | +- AND all AG-UI endpoints function normally |
| 97 | + |
| 98 | +--- |
| 99 | + |
| 100 | +### Requirement: Entrypoint Contract |
| 101 | + |
| 102 | +Custom images SHOULD NOT override CMD or ENTRYPOINT. The platform controls the runner process lifecycle through the base image's default command. |
| 103 | + |
| 104 | +If a custom image needs pre-startup logic, it MAY use a wrapper entrypoint that performs setup and then `exec`s the original command. The runner process MUST: |
| 105 | + |
| 106 | +- Listen on the port specified by `AGUI_PORT` (default `8001`) |
| 107 | +- Receive SIGTERM for graceful shutdown (process must be PID 1 or a direct child of PID 1) |
| 108 | +- Start within the pod's startup timeout |
| 109 | + |
| 110 | +#### Scenario: Wrapper entrypoint preserves signal handling |
| 111 | + |
| 112 | +- GIVEN a custom image with a wrapper entrypoint that execs the runner process |
| 113 | +- WHEN the CP sends SIGTERM to the pod |
| 114 | +- THEN the runner process receives the signal |
| 115 | +- AND shuts down gracefully within `terminationGracePeriodSeconds` |
| 116 | + |
| 117 | +--- |
| 118 | + |
| 119 | +### Requirement: Environment Contract |
| 120 | + |
| 121 | +The following environment variables are injected by the CP at pod creation time. Custom images MUST NOT override these in the Dockerfile: |
| 122 | + |
| 123 | +| Variable | Purpose | |
| 124 | +|----------|---------| |
| 125 | +| `SESSION_ID` | Primary session identifier | |
| 126 | +| `PROJECT_NAME` | Project context | |
| 127 | +| `WORKSPACE_PATH` | Workspace root (always `/workspace`) | |
| 128 | +| `AGUI_PORT` | Runner HTTP port | |
| 129 | +| `BACKEND_API_URL` | api-server base URL | |
| 130 | +| `AMBIENT_GRPC_URL` | api-server gRPC address | |
| 131 | +| `AMBIENT_GRPC_USE_TLS` | TLS flag for gRPC channel | |
| 132 | +| `AMBIENT_CP_TOKEN_URL` | CP token endpoint | |
| 133 | +| `AMBIENT_CP_TOKEN_PUBLIC_KEY` | RSA public key for token auth | |
| 134 | +| `INITIAL_PROMPT` | Auto-execute prompt | |
| 135 | +| `IS_RESUME` | Resume flag on pod restart | |
| 136 | +| `CREDENTIAL_IDS` | JSON map of resolved credential IDs | |
| 137 | +| `RUNNER_TYPE` | Bridge selection (from agent registry) | |
| 138 | + |
| 139 | +The base image also sets `PYTHONUNBUFFERED=1`, `HOME=/app`, and `SHELL=/bin/bash`. Custom images SHOULD preserve these. |
| 140 | + |
| 141 | +Custom images MAY set additional environment variables. Custom images MUST NOT unset CP-injected variables. |
| 142 | + |
| 143 | +#### Scenario: Custom image adds environment variables |
| 144 | + |
| 145 | +- GIVEN a custom image with additional `ENV` directives |
| 146 | +- WHEN a session pod starts |
| 147 | +- THEN both the custom env vars and all CP-injected env vars are present |
| 148 | +- AND the runner starts normally |
| 149 | + |
| 150 | +--- |
| 151 | + |
| 152 | +### Requirement: Security Contract |
| 153 | + |
| 154 | +A custom runner image SHALL run as UID 1001 with no root privileges. |
| 155 | + |
| 156 | +| Constraint | Enforced by | |
| 157 | +|------------|-------------| |
| 158 | +| UID 1001 | Dockerfile `USER 1001` | |
| 159 | +| `runAsNonRoot: true` | Pod SecurityContext | |
| 160 | +| `allowPrivilegeEscalation: false` | Pod SecurityContext | |
| 161 | +| `drop: ["ALL"]` capabilities | Pod SecurityContext | |
| 162 | + |
| 163 | +Custom images MAY use `USER 0` during build stages for installing system packages, provided the final `USER` directive sets UID 1001. Custom images SHOULD include OpenShift arbitrary-UID compatibility (`chmod -R g=u` on writeable paths). |
| 164 | + |
| 165 | +#### Scenario: Custom image with system package installation |
| 166 | + |
| 167 | +- GIVEN a custom image that installs system packages as root during build |
| 168 | +- AND sets `USER 1001` as the final directive |
| 169 | +- WHEN the pod starts with `securityContext.runAsNonRoot: true` |
| 170 | +- THEN the pod starts successfully |
| 171 | +- AND the installed packages are executable by UID 1001 |
| 172 | + |
| 173 | +--- |
| 174 | + |
| 175 | +## ProjectSettings Integration |
| 176 | + |
| 177 | +### Requirement: Custom Runner Image Field |
| 178 | + |
| 179 | +The ProjectSettings resource SHALL support a `runner_image` field (string). When set, the CP SHALL use this image instead of the default when creating session pods for that project. |
| 180 | + |
| 181 | +The field SHALL contain a fully qualified container image reference: `registry/repository:tag` or `registry/repository@sha256:digest`. When empty or unset, the CP uses the default image. |
| 182 | + |
| 183 | +#### Scenario: Project with custom runner image |
| 184 | + |
| 185 | +- GIVEN a ProjectSettings with `runner_image` set to a custom image |
| 186 | +- WHEN a session is started in that project |
| 187 | +- THEN the CP creates the runner pod with the custom image |
| 188 | +- AND all other pod configuration (env vars, volumes, security context) is unchanged |
| 189 | + |
| 190 | +#### Scenario: Project without custom runner image |
| 191 | + |
| 192 | +- GIVEN a ProjectSettings with `runner_image` unset |
| 193 | +- WHEN a session is started |
| 194 | +- THEN the CP uses the default runner image |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +### Requirement: Image Selection Precedence |
| 199 | + |
| 200 | +The CP SHALL select the runner image using the following precedence (highest to lowest): |
| 201 | + |
| 202 | +1. **ProjectSettings `runner_image`** — workspace admin override |
| 203 | +2. **Agent registry `container.image`** — per-agent-type default |
| 204 | +3. **Operator `RUNNER_IMAGE` env var** — cluster-level default |
| 205 | +4. **Hardcoded fallback** |
| 206 | + |
| 207 | +`ProjectSettings.runner_image` overrides the **image** but not the **agent type configuration**. The `RUNNER_TYPE` env var, resource limits, state directory, and other agent-registry settings are still applied from the registry entry matching the session's runner type. |
| 208 | + |
| 209 | +Custom images MUST contain the bridge implementation for every agent type that sessions in this project may use. Images built FROM the standard base inherit all bridges. |
| 210 | + |
| 211 | +#### Scenario: Custom image with non-default runner type |
| 212 | + |
| 213 | +- GIVEN a project with `runner_image` set to a custom image |
| 214 | +- AND a session created with a non-default runner type |
| 215 | +- WHEN the CP provisions the pod |
| 216 | +- THEN the pod uses the custom image |
| 217 | +- AND the pod env includes the `RUNNER_TYPE` from the agent registry |
| 218 | +- AND the custom image MUST contain the matching bridge implementation |
| 219 | + |
| 220 | +#### Scenario: No custom image — agent registry selects image |
| 221 | + |
| 222 | +- GIVEN a project with `runner_image` unset |
| 223 | +- AND a session with a specific runner type |
| 224 | +- WHEN the CP provisions the pod |
| 225 | +- THEN the pod uses the image from the agent registry entry for that runner type |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +### Requirement: Image Validation |
| 230 | + |
| 231 | +The CP SHALL validate the `runner_image` value before creating pods. |
| 232 | + |
| 233 | +The CP SHALL reject images where the reference is syntactically invalid (missing repository or tag/digest) or the registry host is empty. |
| 234 | + |
| 235 | +The CP SHOULD support an operator-level allowlist of permitted registries via `RUNNER_IMAGE_ALLOWED_REGISTRIES` (comma-separated hostnames). When set, images from unlisted registries SHALL be rejected and the session SHALL transition to `Failed` with a descriptive condition. |
| 236 | + |
| 237 | +When the allowlist is unset, any registry is allowed. |
| 238 | + |
| 239 | +#### Scenario: Image from disallowed registry |
| 240 | + |
| 241 | +- GIVEN a registry allowlist that does not include `docker.io` |
| 242 | +- AND a ProjectSettings with `runner_image` pointing to `docker.io` |
| 243 | +- WHEN the CP validates the image reference |
| 244 | +- THEN the session transitions to `Failed` with a condition describing the rejection |
| 245 | + |
| 246 | +#### Scenario: No registry allowlist |
| 247 | + |
| 248 | +- GIVEN no registry allowlist configured |
| 249 | +- AND a ProjectSettings with `runner_image` pointing to any registry |
| 250 | +- THEN the image is accepted |
| 251 | + |
| 252 | +--- |
| 253 | + |
| 254 | +### Requirement: Image Pull Credentials |
| 255 | + |
| 256 | +The ProjectSettings resource SHALL support a `runner_image_pull_secret` field (string) containing the name of a Kubernetes Secret (type `kubernetes.io/dockerconfigjson`) in the project's namespace. |
| 257 | + |
| 258 | +When set, the CP SHALL add it to the pod's `spec.imagePullSecrets`. |
| 259 | + |
| 260 | +#### Scenario: Private registry with pull secret |
| 261 | + |
| 262 | +- GIVEN a ProjectSettings with `runner_image` and `runner_image_pull_secret` set |
| 263 | +- AND the referenced Secret exists in the project namespace |
| 264 | +- WHEN the CP creates the runner pod |
| 265 | +- THEN the pod spec includes the secret in `imagePullSecrets` |
| 266 | + |
| 267 | +--- |
| 268 | + |
| 269 | +### Requirement: Image Pull Policy |
| 270 | + |
| 271 | +The CP SHALL set `imagePullPolicy` based on the image reference: |
| 272 | + |
| 273 | +| Reference type | Policy | |
| 274 | +|----------------|--------| |
| 275 | +| `@sha256:` digest | `IfNotPresent` | |
| 276 | +| `localhost/` prefix | `IfNotPresent` | |
| 277 | +| All others (tags) | `Always` | |
| 278 | + |
| 279 | +--- |
| 280 | + |
| 281 | +### Requirement: RBAC for Runner Image Configuration |
| 282 | + |
| 283 | +Only users with `project_settings:update` permission SHALL be permitted to modify ProjectSettings, including the `runner_image` and `runner_image_pull_secret` fields. This follows the existing endpoint-level RBAC model. |
| 284 | + |
| 285 | +#### Scenario: User without update permission |
| 286 | + |
| 287 | +- GIVEN a user without `project_settings:update` permission |
| 288 | +- WHEN they PATCH ProjectSettings with a `runner_image` value |
| 289 | +- THEN the request is rejected with `403 Forbidden` |
| 290 | + |
| 291 | +--- |
| 292 | + |
| 293 | +### Requirement: Running Sessions Unaffected |
| 294 | + |
| 295 | +When `runner_image` changes on a ProjectSettings resource, the change SHALL apply to **new sessions only**. Running sessions continue using the image they were created with. |
| 296 | + |
| 297 | +#### Scenario: Image change does not affect running sessions |
| 298 | + |
| 299 | +- GIVEN running sessions in a project using image A |
| 300 | +- WHEN the admin changes `runner_image` to image B |
| 301 | +- THEN running sessions continue with image A |
| 302 | +- AND the next session started uses image B |
| 303 | + |
| 304 | +--- |
| 305 | + |
| 306 | +## Failure Modes |
| 307 | + |
| 308 | +### Requirement: Health Check Timeout |
| 309 | + |
| 310 | +The CP SHALL configure a readiness probe on the runner container (`GET /health` on `AGUI_PORT`). If the probe does not pass within the pod's startup timeout, the CP SHALL transition the session to `Failed`. |
| 311 | + |
| 312 | +#### Scenario: Custom image crashes on start |
| 313 | + |
| 314 | +- GIVEN a custom image with a broken dependency |
| 315 | +- WHEN the pod starts and the runner process fails to initialize |
| 316 | +- THEN the pod exits with a non-zero exit code |
| 317 | +- AND the CP transitions the session to `Failed` |
| 318 | + |
| 319 | +### Requirement: Bridge Mismatch |
| 320 | + |
| 321 | +When a custom image does not contain the bridge implementation required by the session's `RUNNER_TYPE`, the runner process SHALL fail at startup. The pod logs SHALL contain an error identifying the missing bridge module. |
| 322 | + |
| 323 | +Custom images built FROM the standard base image inherit all bridge implementations and are not affected. |
| 324 | + |
| 325 | +#### Scenario: Custom image missing bridge for session runner type |
| 326 | + |
| 327 | +- GIVEN a custom image that does not include the bridge for a given runner type |
| 328 | +- AND a session is created with that runner type |
| 329 | +- WHEN the pod starts |
| 330 | +- THEN the runner process fails to load the bridge module |
| 331 | +- AND the pod exits with a non-zero exit code |
| 332 | +- AND the CP transitions the session to `Failed` |
| 333 | + |
| 334 | +### Requirement: Image Pull Failure |
| 335 | + |
| 336 | +When the kubelet cannot pull the custom image, the CP SHALL transition the session to `Failed` with the pull error in the session condition. |
| 337 | + |
| 338 | +#### Scenario: Image does not exist in registry |
| 339 | + |
| 340 | +- GIVEN `runner_image` pointing to a non-existent image |
| 341 | +- WHEN the CP creates the pod |
| 342 | +- THEN the kubelet enters `ImagePullBackOff` |
| 343 | +- AND the CP transitions the session to `Failed` |
| 344 | + |
| 345 | +--- |
| 346 | + |
| 347 | +## Security Boundary |
| 348 | + |
| 349 | +Custom runner images run within the same security perimeter as the standard runner: |
| 350 | + |
| 351 | +- **Network isolation**: Runner pods are subject to NetworkPolicy. Outbound internet access is blocked by default. |
| 352 | +- **Credential isolation**: Credentials are fetched per-turn via cluster-local endpoints only. |
| 353 | +- **Per-session ServiceAccount**: Each session gets its own SA with minimal RBAC. |
| 354 | + |
| 355 | +Custom images inherit these constraints. |
| 356 | + |
| 357 | +--- |
| 358 | + |
| 359 | +## Base Image Publishing |
| 360 | + |
| 361 | +### Requirement: Published Base Image |
| 362 | + |
| 363 | +The platform SHALL publish a base runner image suitable for `FROM` directives at a stable, versioned tag. The image SHALL be built from the same source as the standard runner image. |
| 364 | + |
| 365 | +Breaking changes to the stable contract SHALL increment the major version. |
| 366 | + |
| 367 | +### Requirement: Contract Version Label |
| 368 | + |
| 369 | +The base image SHALL carry an OCI label indicating the contract version (e.g., `io.ambient-code.runner-contract-version`). |
| 370 | + |
| 371 | +The CP MAY log a warning if the contract version does not match the expected version. The CP SHALL NOT block pod creation based on contract version mismatch. |
| 372 | + |
| 373 | +#### Scenario: Contract version mismatch |
| 374 | + |
| 375 | +- GIVEN the CP expects contract version `1` |
| 376 | +- AND a custom image has a different contract version label |
| 377 | +- WHEN the CP creates the pod |
| 378 | +- THEN the CP logs a warning |
| 379 | +- AND the pod is created normally |
0 commit comments