Skip to content

Commit 2c94ce8

Browse files
authored
docs(adr): record coding-agent target runtime contract (#1655)
* docs(adr): record coding-agent target runtime contract * docs(adr): correct authored concurrency field
1 parent 94e91a5 commit 2c94ce8

1 file changed

Lines changed: 243 additions & 0 deletions

File tree

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# 18. Coding-agent target runtime contract
2+
3+
Date: 2026-07-05
4+
5+
## Status
6+
7+
Accepted (2026-07-05). Tracks the coding-agent runtime closeout under Beads
8+
`av-t2o5`, `av-y7eq.10`, `av-y7eq.9`, and `av-t2o5.2`.
9+
10+
This ADR records the target runtime contract. It does not claim live provider
11+
matrix completion; host-runtime dogfood evidence is owned by `av-y7eq.9`.
12+
Profile and sandbox isolation evidence is deferred to `av-t2o5.1`.
13+
Provider-agnostic recorded trajectory replay and removal of `provider:
14+
copilot-log` from authored target YAML are owned by `av-t2o5.2`.
15+
16+
## Context
17+
18+
AgentV evaluates coding agents in real repositories. Those agents are not
19+
ordinary LLM APIs: they run tools, read files from disk, mutate workspaces,
20+
manage auth profiles, stream transcripts, and may spawn their own subprocesses.
21+
If AgentV imports fragile agent SDKs directly into the main CLI/orchestrator
22+
process, a provider bug can crash the run before AgentV finalizes
23+
`.agentv/results/<run_id>/`, `summary.json`, `.internal/index.jsonl`,
24+
transcripts, and grading artifacts.
25+
26+
The product direction is repo-native, workspace-native evaluation with portable
27+
run bundles as the source of truth. The default path should therefore evaluate
28+
the real installed agent/profile on the host when that is what the user wants,
29+
while still keeping provider instability inside target execution envelopes.
30+
31+
Peer frameworks are evidence, not schema authority:
32+
33+
- Promptfoo local clone `/home/entity/projects/promptfoo/promptfoo` at commit
34+
`6bfc5a0c7f16f9c4717ac731d276b578e63d0769` separates coding-agent provider
35+
families such as Codex SDK, Codex app-server, and Claude Agent SDK. Its
36+
taxonomy explicitly says provider IDs should encode the runtime boundary.
37+
- Promptfoo also shows useful optional dependency ergonomics: Claude Agent SDK
38+
is loaded only for that provider, but the SDK still runs in Promptfoo's
39+
provider process. AgentV adopts optional/lazy SDK loading, but adds an
40+
AgentV-owned child-runner boundary.
41+
- Vercel `agent-eval` at commit
42+
`a9dcc9a8c53dbc22ececc967ded7ab3963f18e67` runs coding agents through
43+
sandboxed CLI-like execution, records raw and parsed transcripts, and writes
44+
result bundles under `results/<experiment>/<timestamp>/`.
45+
- Margin Evals at commit
46+
`53fb2fd080689efaf7934573d8759d14fc1043e4` uses run-centric artifacts,
47+
process/runtime logs, managed agent definitions, and trajectory hooks.
48+
- Harbor uses container environments and ATIF trajectories for benchmark-grade
49+
agent execution. Its 2026-06-18 change to run `harbor check` and
50+
`harbor analyze` as Harbor trials instead of in-process Claude SDK calls
51+
supports the same boundary: produce artifacts from real executions instead
52+
of hiding work inside the coordinator process.
53+
- Kata Symphony and Taskplane validate Pi RPC as a process/stdio control
54+
boundary: `pi --mode rpc` is launched as a live subprocess locally, over
55+
SSH, or through worker orchestration, rather than being collapsed into an
56+
in-process SDK call.
57+
- entireio/cli preserves native agent sessions and derives normalized
58+
transcript/checkpoint metadata from provider-specific logs. That pattern
59+
supports AgentV preserving raw logs as provenance and importing them into
60+
provider-agnostic replay artifacts, not exposing one live `*-log` target
61+
provider per backend.
62+
63+
## Decision
64+
65+
AgentV treats coding-agent targets as external runtimes to orchestrate, not
66+
libraries to call in-process by default.
67+
68+
Authored targets use this shape:
69+
70+
```yaml
71+
targets:
72+
- id: codex-local
73+
provider: codex-app-server
74+
runtime: host
75+
config:
76+
command: ["codex", "app-server"]
77+
model: gpt-5-codex
78+
```
79+
80+
The fields mean:
81+
82+
| Field | Meaning |
83+
| --- | --- |
84+
| `id` | Stable AgentV target identity used for CLI selection, artifacts, Dashboard, and comparisons. |
85+
| `provider` | Adapter/control boundary such as `codex-cli`, `codex-app-server`, `pi-rpc`, `claude-cli`, or `copilot-sdk`. |
86+
| `runtime` | Placement/isolation mode: `host`, `profile`, or `sandbox`; may be a string shorthand or an object with `mode`. |
87+
| `config` | Provider-specific knobs such as `command`, `model`, `cwd`, `timeout_seconds`, auth endpoint settings, permission flags, and provider protocol settings. |
88+
89+
Do not add competing top-level fields such as `isolation`, `sandbox`,
90+
`profile`, `install`, `container`, `environment`, `executable`, `binary`,
91+
`args`, or `arguments` for this contract. Process/protocol providers use
92+
`config.command` as a non-empty argv array. Authored eval concurrency belongs
93+
under `evaluate_options.max_concurrency`, not inside a target definition.
94+
Grader selection belongs to `defaults.grader`, CLI overrides, or
95+
evaluator-level target selection, not to the system-under-test target.
96+
97+
### Provider Boundaries
98+
99+
Process and protocol providers are the preferred defaults:
100+
101+
- `codex-app-server`: preferred Codex rich protocol/control boundary.
102+
- `codex-cli`: simple Codex subprocess boundary for host/profile execution and
103+
installed user shims.
104+
- `pi-rpc`: preferred Pi rich control boundary over stdio/RPC.
105+
- `pi-cli`: simple Pi subprocess boundary.
106+
- `claude-cli`: default Claude path through the installed Claude CLI.
107+
- `copilot-cli`: active Copilot execution through the installed CLI/protocol
108+
path.
109+
110+
SDK providers are explicit advanced paths:
111+
112+
- `codex-sdk`
113+
- `pi-sdk`
114+
- `claude-sdk`
115+
- `copilot-sdk`
116+
117+
SDK transports run behind an AgentV child-runner process on the host. The parent
118+
CLI/orchestrator starts the child with the target config and provider request,
119+
receives structured events/logs and one final provider response envelope, and
120+
maps child crashes, malformed child output, timeouts, and cancellation into
121+
provider-scoped errors. The concrete SDK package is imported inside the child
122+
runner only for the selected SDK target.
123+
124+
This is process isolation for SDK dependency and crash containment. It is not
125+
Docker/container isolation and does not make `runtime: host` equivalent to
126+
`runtime.mode: sandbox`.
127+
128+
### Runtime Modes
129+
130+
| Runtime | Boundary | Use case |
131+
| --- | --- | --- |
132+
| `host` | Runs the installed CLI, protocol server, or SDK child runner on the current machine with the user's normal auth/profile unless provider config overrides it. | Local research, subscription OAuth, and evaluating the exact installed agent/profile an engineer uses manually. |
133+
| `profile` | Runs a host process with isolated home/provider home/temp/env configuration, such as `HOME`, `CODEX_HOME`, provider homes, temp dirs, and explicit env allowlists. | Cleaner local evals without full container cost. |
134+
| `sandbox` | Runs through a separate substrate such as Docker, a managed sandbox, remote worker, or future container backend. | CI, reproducibility, untrusted tasks, and stronger filesystem/runtime containment. |
135+
136+
Host runtime is the first supported path for coding-agent targets. CLI, RPC,
137+
and app-server transports run against the host-installed agent/profile. SDK
138+
transports also run on the host, but through the AgentV child-runner process
139+
described above.
140+
141+
The current implementation supports Docker sandbox execution for generic
142+
`provider: cli`. Sandbox-aware coding-agent providers are future work. When a
143+
coding-agent provider is authored with `runtime.mode: sandbox` before a
144+
sandbox-aware runner exists, AgentV should return a deliberate
145+
`target_execution` error envelope rather than pretending the target ran or
146+
crashing the evaluator.
147+
148+
Codex `config.sandbox_mode` is a Codex provider permission/sandbox knob passed
149+
to Codex. It is not AgentV `runtime.mode: sandbox`.
150+
151+
### Failure Contract
152+
153+
Coding-agent providers must report target failures through structured
154+
`target_execution` envelopes whenever possible. That includes:
155+
156+
- spawn failures and missing executables
157+
- provider nonzero exits
158+
- malformed provider output
159+
- provider timeouts or cancellations
160+
- SDK child-runner crashes before or after partial events
161+
- sandbox infrastructure failures
162+
- target task failures returned by a protocol provider
163+
- partial transcripts or logs from a failed provider
164+
165+
Target crashes are target results. They must not become AgentV orchestrator
166+
crashes or prevent final run-bundle artifacts from being written.
167+
168+
### Replay and Log Providers
169+
170+
`provider: copilot-log` is removed from the authored live target surface before
171+
beta. AgentV should not add `codex-log`, `claude-log`, `pi-log`, or other
172+
provider-specific log target providers.
173+
174+
Provider-native logs remain useful as raw provenance and import inputs. Copilot
175+
`events.jsonl` parsing should feed import/normalization into a
176+
provider-agnostic recorded trajectory replay contract. Replay is an
177+
eval/orchestrator mode or generic replay target over AgentV trajectory artifacts,
178+
not a live coding-agent runtime provider. Live Copilot targets remain
179+
`copilot-cli` and `copilot-sdk`.
180+
181+
This aligns with ADR 0008: raw native transcripts are preserved for debugging
182+
and parser improvement, while normalized AgentV transcript/trajectory artifacts
183+
are the durable input to grading, Dashboard inspection, and replay.
184+
185+
## Consequences
186+
187+
- Users can evaluate the same host-installed agent/profile they use manually.
188+
- Provider IDs remain explicit about control boundary instead of collapsing
189+
runtime choices into provider config flags.
190+
- SDK providers stay available when SDK-native events or controls are worth the
191+
extra complexity, but SDK dependency failures do not take down the parent CLI.
192+
- `runtime: host` remains lightweight and zero-infra; stronger profile/sandbox
193+
isolation can be added without changing target identity semantics.
194+
- Generic Docker sandbox support through `provider: cli` remains valid, while
195+
sandbox-aware coding-agent adapters are deliberately deferred.
196+
- Offline grading/replay gets one provider-agnostic path instead of one
197+
provider-specific `*-log` target surface per backend.
198+
199+
## Alternatives Considered
200+
201+
### Import coding-agent SDKs in the main AgentV process
202+
203+
Rejected. Lazy SDK import is helpful for optional dependencies, but it is not a
204+
runtime isolation boundary. Prior Pi SDK dogfood exposed stream teardown
205+
failures that can outlive the apparent agent result and crash the parent
206+
process. The parent process must own run finalization, timeout enforcement, and
207+
artifact writing.
208+
209+
### Make SDK providers the default because they are structured
210+
211+
Rejected. SDKs can expose useful events, but the default AgentV path should
212+
match the real installed CLI/profile where possible and keep the product
213+
zero-infra. SDK providers are explicit advanced targets.
214+
215+
### Copy Promptfoo provider naming wholesale
216+
217+
Rejected. Promptfoo is useful evidence for explicit provider IDs and optional
218+
provider dependencies, but AgentV keeps target identity and backend/control
219+
boundary separate: `id` is stable AgentV target identity, while `provider` names
220+
the adapter kind. AgentV does not copy Promptfoo's use of `label` as the target
221+
identity field or carry compatibility aliases where the beta contract can be
222+
cleaner.
223+
224+
### Put runtime placement under provider-specific config
225+
226+
Rejected. Runtime placement is cross-provider orchestration state. It belongs in
227+
`runtime`, not in every provider's `config` with different names and precedence
228+
rules.
229+
230+
### Treat provider logs as live target providers
231+
232+
Rejected. Passive logs do not run an agent and should not satisfy live
233+
host-runtime dogfood. They are import/replay sources. Keeping them out of
234+
authored live target YAML avoids a family of `*-log` providers and preserves a
235+
single normalized replay contract.
236+
237+
## Non-Goals
238+
239+
- Implementing or validating the full live provider matrix.
240+
- Implementing profile-mode or sandbox-aware coding-agent provider runners.
241+
- Replacing the generic `provider: cli` sandbox path.
242+
- Designing the full provider-agnostic replay cassette contract.
243+
- Adding compatibility aliases for removed beta-only target provider names.

0 commit comments

Comments
 (0)