|
| 1 | +# 18. Coding-agent target runtime contract |
| 2 | + |
| 3 | +Date: 2026-07-05 |
| 4 | + |
| 5 | +## Status |
| 6 | + |
| 7 | +Accepted (2026-07-05). Tracks the coding-agent runtime closeout under Beads |
| 8 | +`av-t2o5`, `av-y7eq.10`, `av-y7eq.9`, and `av-t2o5.2`. |
| 9 | + |
| 10 | +This ADR records the target runtime contract. It does not claim live provider |
| 11 | +matrix completion; host-runtime dogfood evidence is owned by `av-y7eq.9`. |
| 12 | +Profile and sandbox isolation evidence is deferred to `av-t2o5.1`. |
| 13 | +Provider-agnostic recorded trajectory replay and removal of `provider: |
| 14 | +copilot-log` from authored target YAML are owned by `av-t2o5.2`. |
| 15 | + |
| 16 | +## Context |
| 17 | + |
| 18 | +AgentV evaluates coding agents in real repositories. Those agents are not |
| 19 | +ordinary LLM APIs: they run tools, read files from disk, mutate workspaces, |
| 20 | +manage auth profiles, stream transcripts, and may spawn their own subprocesses. |
| 21 | +If AgentV imports fragile agent SDKs directly into the main CLI/orchestrator |
| 22 | +process, a provider bug can crash the run before AgentV finalizes |
| 23 | +`.agentv/results/<run_id>/`, `summary.json`, `.internal/index.jsonl`, |
| 24 | +transcripts, and grading artifacts. |
| 25 | + |
| 26 | +The product direction is repo-native, workspace-native evaluation with portable |
| 27 | +run bundles as the source of truth. The default path should therefore evaluate |
| 28 | +the real installed agent/profile on the host when that is what the user wants, |
| 29 | +while still keeping provider instability inside target execution envelopes. |
| 30 | + |
| 31 | +Peer frameworks are evidence, not schema authority: |
| 32 | + |
| 33 | +- Promptfoo local clone `/home/entity/projects/promptfoo/promptfoo` at commit |
| 34 | + `6bfc5a0c7f16f9c4717ac731d276b578e63d0769` separates coding-agent provider |
| 35 | + families such as Codex SDK, Codex app-server, and Claude Agent SDK. Its |
| 36 | + taxonomy explicitly says provider IDs should encode the runtime boundary. |
| 37 | +- Promptfoo also shows useful optional dependency ergonomics: Claude Agent SDK |
| 38 | + is loaded only for that provider, but the SDK still runs in Promptfoo's |
| 39 | + provider process. AgentV adopts optional/lazy SDK loading, but adds an |
| 40 | + AgentV-owned child-runner boundary. |
| 41 | +- Vercel `agent-eval` at commit |
| 42 | + `a9dcc9a8c53dbc22ececc967ded7ab3963f18e67` runs coding agents through |
| 43 | + sandboxed CLI-like execution, records raw and parsed transcripts, and writes |
| 44 | + result bundles under `results/<experiment>/<timestamp>/`. |
| 45 | +- Margin Evals at commit |
| 46 | + `53fb2fd080689efaf7934573d8759d14fc1043e4` uses run-centric artifacts, |
| 47 | + process/runtime logs, managed agent definitions, and trajectory hooks. |
| 48 | +- Harbor uses container environments and ATIF trajectories for benchmark-grade |
| 49 | + agent execution. Its 2026-06-18 change to run `harbor check` and |
| 50 | + `harbor analyze` as Harbor trials instead of in-process Claude SDK calls |
| 51 | + supports the same boundary: produce artifacts from real executions instead |
| 52 | + of hiding work inside the coordinator process. |
| 53 | +- Kata Symphony and Taskplane validate Pi RPC as a process/stdio control |
| 54 | + boundary: `pi --mode rpc` is launched as a live subprocess locally, over |
| 55 | + SSH, or through worker orchestration, rather than being collapsed into an |
| 56 | + in-process SDK call. |
| 57 | +- entireio/cli preserves native agent sessions and derives normalized |
| 58 | + transcript/checkpoint metadata from provider-specific logs. That pattern |
| 59 | + supports AgentV preserving raw logs as provenance and importing them into |
| 60 | + provider-agnostic replay artifacts, not exposing one live `*-log` target |
| 61 | + provider per backend. |
| 62 | + |
| 63 | +## Decision |
| 64 | + |
| 65 | +AgentV treats coding-agent targets as external runtimes to orchestrate, not |
| 66 | +libraries to call in-process by default. |
| 67 | + |
| 68 | +Authored targets use this shape: |
| 69 | + |
| 70 | +```yaml |
| 71 | +targets: |
| 72 | + - id: codex-local |
| 73 | + provider: codex-app-server |
| 74 | + runtime: host |
| 75 | + config: |
| 76 | + command: ["codex", "app-server"] |
| 77 | + model: gpt-5-codex |
| 78 | +``` |
| 79 | +
|
| 80 | +The fields mean: |
| 81 | +
|
| 82 | +| Field | Meaning | |
| 83 | +| --- | --- | |
| 84 | +| `id` | Stable AgentV target identity used for CLI selection, artifacts, Dashboard, and comparisons. | |
| 85 | +| `provider` | Adapter/control boundary such as `codex-cli`, `codex-app-server`, `pi-rpc`, `claude-cli`, or `copilot-sdk`. | |
| 86 | +| `runtime` | Placement/isolation mode: `host`, `profile`, or `sandbox`; may be a string shorthand or an object with `mode`. | |
| 87 | +| `config` | Provider-specific knobs such as `command`, `model`, `cwd`, `timeout_seconds`, auth endpoint settings, permission flags, and provider protocol settings. | |
| 88 | + |
| 89 | +Do not add competing top-level fields such as `isolation`, `sandbox`, |
| 90 | +`profile`, `install`, `container`, `environment`, `executable`, `binary`, |
| 91 | +`args`, or `arguments` for this contract. Process/protocol providers use |
| 92 | +`config.command` as a non-empty argv array. Authored eval concurrency belongs |
| 93 | +under `evaluate_options.max_concurrency`, not inside a target definition. |
| 94 | +Grader selection belongs to `defaults.grader`, CLI overrides, or |
| 95 | +evaluator-level target selection, not to the system-under-test target. |
| 96 | + |
| 97 | +### Provider Boundaries |
| 98 | + |
| 99 | +Process and protocol providers are the preferred defaults: |
| 100 | + |
| 101 | +- `codex-app-server`: preferred Codex rich protocol/control boundary. |
| 102 | +- `codex-cli`: simple Codex subprocess boundary for host/profile execution and |
| 103 | + installed user shims. |
| 104 | +- `pi-rpc`: preferred Pi rich control boundary over stdio/RPC. |
| 105 | +- `pi-cli`: simple Pi subprocess boundary. |
| 106 | +- `claude-cli`: default Claude path through the installed Claude CLI. |
| 107 | +- `copilot-cli`: active Copilot execution through the installed CLI/protocol |
| 108 | + path. |
| 109 | + |
| 110 | +SDK providers are explicit advanced paths: |
| 111 | + |
| 112 | +- `codex-sdk` |
| 113 | +- `pi-sdk` |
| 114 | +- `claude-sdk` |
| 115 | +- `copilot-sdk` |
| 116 | + |
| 117 | +SDK transports run behind an AgentV child-runner process on the host. The parent |
| 118 | +CLI/orchestrator starts the child with the target config and provider request, |
| 119 | +receives structured events/logs and one final provider response envelope, and |
| 120 | +maps child crashes, malformed child output, timeouts, and cancellation into |
| 121 | +provider-scoped errors. The concrete SDK package is imported inside the child |
| 122 | +runner only for the selected SDK target. |
| 123 | + |
| 124 | +This is process isolation for SDK dependency and crash containment. It is not |
| 125 | +Docker/container isolation and does not make `runtime: host` equivalent to |
| 126 | +`runtime.mode: sandbox`. |
| 127 | + |
| 128 | +### Runtime Modes |
| 129 | + |
| 130 | +| Runtime | Boundary | Use case | |
| 131 | +| --- | --- | --- | |
| 132 | +| `host` | Runs the installed CLI, protocol server, or SDK child runner on the current machine with the user's normal auth/profile unless provider config overrides it. | Local research, subscription OAuth, and evaluating the exact installed agent/profile an engineer uses manually. | |
| 133 | +| `profile` | Runs a host process with isolated home/provider home/temp/env configuration, such as `HOME`, `CODEX_HOME`, provider homes, temp dirs, and explicit env allowlists. | Cleaner local evals without full container cost. | |
| 134 | +| `sandbox` | Runs through a separate substrate such as Docker, a managed sandbox, remote worker, or future container backend. | CI, reproducibility, untrusted tasks, and stronger filesystem/runtime containment. | |
| 135 | + |
| 136 | +Host runtime is the first supported path for coding-agent targets. CLI, RPC, |
| 137 | +and app-server transports run against the host-installed agent/profile. SDK |
| 138 | +transports also run on the host, but through the AgentV child-runner process |
| 139 | +described above. |
| 140 | + |
| 141 | +The current implementation supports Docker sandbox execution for generic |
| 142 | +`provider: cli`. Sandbox-aware coding-agent providers are future work. When a |
| 143 | +coding-agent provider is authored with `runtime.mode: sandbox` before a |
| 144 | +sandbox-aware runner exists, AgentV should return a deliberate |
| 145 | +`target_execution` error envelope rather than pretending the target ran or |
| 146 | +crashing the evaluator. |
| 147 | + |
| 148 | +Codex `config.sandbox_mode` is a Codex provider permission/sandbox knob passed |
| 149 | +to Codex. It is not AgentV `runtime.mode: sandbox`. |
| 150 | + |
| 151 | +### Failure Contract |
| 152 | + |
| 153 | +Coding-agent providers must report target failures through structured |
| 154 | +`target_execution` envelopes whenever possible. That includes: |
| 155 | + |
| 156 | +- spawn failures and missing executables |
| 157 | +- provider nonzero exits |
| 158 | +- malformed provider output |
| 159 | +- provider timeouts or cancellations |
| 160 | +- SDK child-runner crashes before or after partial events |
| 161 | +- sandbox infrastructure failures |
| 162 | +- target task failures returned by a protocol provider |
| 163 | +- partial transcripts or logs from a failed provider |
| 164 | + |
| 165 | +Target crashes are target results. They must not become AgentV orchestrator |
| 166 | +crashes or prevent final run-bundle artifacts from being written. |
| 167 | + |
| 168 | +### Replay and Log Providers |
| 169 | + |
| 170 | +`provider: copilot-log` is removed from the authored live target surface before |
| 171 | +beta. AgentV should not add `codex-log`, `claude-log`, `pi-log`, or other |
| 172 | +provider-specific log target providers. |
| 173 | + |
| 174 | +Provider-native logs remain useful as raw provenance and import inputs. Copilot |
| 175 | +`events.jsonl` parsing should feed import/normalization into a |
| 176 | +provider-agnostic recorded trajectory replay contract. Replay is an |
| 177 | +eval/orchestrator mode or generic replay target over AgentV trajectory artifacts, |
| 178 | +not a live coding-agent runtime provider. Live Copilot targets remain |
| 179 | +`copilot-cli` and `copilot-sdk`. |
| 180 | + |
| 181 | +This aligns with ADR 0008: raw native transcripts are preserved for debugging |
| 182 | +and parser improvement, while normalized AgentV transcript/trajectory artifacts |
| 183 | +are the durable input to grading, Dashboard inspection, and replay. |
| 184 | + |
| 185 | +## Consequences |
| 186 | + |
| 187 | +- Users can evaluate the same host-installed agent/profile they use manually. |
| 188 | +- Provider IDs remain explicit about control boundary instead of collapsing |
| 189 | + runtime choices into provider config flags. |
| 190 | +- SDK providers stay available when SDK-native events or controls are worth the |
| 191 | + extra complexity, but SDK dependency failures do not take down the parent CLI. |
| 192 | +- `runtime: host` remains lightweight and zero-infra; stronger profile/sandbox |
| 193 | + isolation can be added without changing target identity semantics. |
| 194 | +- Generic Docker sandbox support through `provider: cli` remains valid, while |
| 195 | + sandbox-aware coding-agent adapters are deliberately deferred. |
| 196 | +- Offline grading/replay gets one provider-agnostic path instead of one |
| 197 | + provider-specific `*-log` target surface per backend. |
| 198 | + |
| 199 | +## Alternatives Considered |
| 200 | + |
| 201 | +### Import coding-agent SDKs in the main AgentV process |
| 202 | + |
| 203 | +Rejected. Lazy SDK import is helpful for optional dependencies, but it is not a |
| 204 | +runtime isolation boundary. Prior Pi SDK dogfood exposed stream teardown |
| 205 | +failures that can outlive the apparent agent result and crash the parent |
| 206 | +process. The parent process must own run finalization, timeout enforcement, and |
| 207 | +artifact writing. |
| 208 | + |
| 209 | +### Make SDK providers the default because they are structured |
| 210 | + |
| 211 | +Rejected. SDKs can expose useful events, but the default AgentV path should |
| 212 | +match the real installed CLI/profile where possible and keep the product |
| 213 | +zero-infra. SDK providers are explicit advanced targets. |
| 214 | + |
| 215 | +### Copy Promptfoo provider naming wholesale |
| 216 | + |
| 217 | +Rejected. Promptfoo is useful evidence for explicit provider IDs and optional |
| 218 | +provider dependencies, but AgentV keeps target identity and backend/control |
| 219 | +boundary separate: `id` is stable AgentV target identity, while `provider` names |
| 220 | +the adapter kind. AgentV does not copy Promptfoo's use of `label` as the target |
| 221 | +identity field or carry compatibility aliases where the beta contract can be |
| 222 | +cleaner. |
| 223 | + |
| 224 | +### Put runtime placement under provider-specific config |
| 225 | + |
| 226 | +Rejected. Runtime placement is cross-provider orchestration state. It belongs in |
| 227 | +`runtime`, not in every provider's `config` with different names and precedence |
| 228 | +rules. |
| 229 | + |
| 230 | +### Treat provider logs as live target providers |
| 231 | + |
| 232 | +Rejected. Passive logs do not run an agent and should not satisfy live |
| 233 | +host-runtime dogfood. They are import/replay sources. Keeping them out of |
| 234 | +authored live target YAML avoids a family of `*-log` providers and preserves a |
| 235 | +single normalized replay contract. |
| 236 | + |
| 237 | +## Non-Goals |
| 238 | + |
| 239 | +- Implementing or validating the full live provider matrix. |
| 240 | +- Implementing profile-mode or sandbox-aware coding-agent provider runners. |
| 241 | +- Replacing the generic `provider: cli` sandbox path. |
| 242 | +- Designing the full provider-agnostic replay cassette contract. |
| 243 | +- Adding compatibility aliases for removed beta-only target provider names. |
0 commit comments