You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
3
+
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
185
185
186
-
## Requirements
187
-
188
-
- Node.js 20.0.0 or higher
189
-
- Environment variables for your chosen providers (configured via targets.yaml)
190
-
191
-
Environment keys (configured via targets.yaml):
192
-
193
-
-**Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
194
-
-**Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
195
-
-**Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
196
-
-**VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
197
-
-**CLI provider:** Configure `command_template` plus optional `cwd`, `env`, `timeout_seconds`, and `healthcheck` fields in targets.yaml; CLI `settings.env` entries are merged into the process environment
198
-
199
186
## Targets and Environment Variables
200
187
201
188
Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -205,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
205
192
Each target specifies:
206
193
207
194
-`name`: Unique identifier for the target
208
-
-`provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
195
+
-`provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
209
196
-`settings`: Environment variable names to use for this target
210
197
211
198
### Examples
@@ -221,26 +208,6 @@ Each target specifies:
221
208
model: "AZURE_DEPLOYMENT_NAME"
222
209
```
223
210
224
-
**Anthropic targets:**
225
-
226
-
```yaml
227
-
- name: anthropic_base
228
-
provider: anthropic
229
-
settings:
230
-
api_key: "ANTHROPIC_API_KEY"
231
-
model: "ANTHROPIC_MODEL"
232
-
```
233
-
234
-
**Google Gemini targets:**
235
-
236
-
```yaml
237
-
- name: gemini_base
238
-
provider: gemini
239
-
settings:
240
-
api_key: "GOOGLE_API_KEY"
241
-
model: "GOOGLE_GEMINI_MODEL"# Optional, defaults to gemini-2.0-flash-exp
242
-
```
243
-
244
211
**VS Code targets:**
245
212
246
213
```yaml
@@ -261,7 +228,7 @@ Each target specifies:
261
228
- name: local_cli
262
229
provider: cli
263
230
settings:
264
-
command_template: 'code chat {PROMPT} {FILES}'
231
+
command_template: 'somecommand {PROMPT} {FILES}'
265
232
files_format: '--file {path}'
266
233
cwd: PROJECT_ROOT # optional working directory
267
234
env: # merged into process.env
@@ -272,8 +239,22 @@ Each target specifies:
272
239
command_template: code --version
273
240
```
274
241
275
-
CLI placeholders are `{PROMPT}`, `{GUIDELINES}`, `{EVAL_ID}`, `{ATTEMPT}`, and `{FILES}`. Values are shell-escaped automatically; avoid wrapping them in extra quotes unless your CLI requires nested quoting. `{FILES}` renders each file path using `files_format` (supports `{path}` and `{basename}`) and joins with spaces. Optional `healthcheck` probes (HTTP or command) run once before the first eval and abort the run on failure.
276
-
CLI troubleshooting: unsupported placeholders fail validation, so stick to the tokens above; if your CLI logs show doubled quotes, drop extra quoting in `command_template` and rely on the built-in escaping; if healthchecks fail, raise `timeout_seconds` or point the probe at a fast status endpoint.
242
+
**Codex CLI targets:**
243
+
244
+
```yaml
245
+
- name: codex_cli
246
+
provider: codex
247
+
settings:
248
+
executable: "CODEX_CLI_PATH"# defaults to `codex` if omitted
249
+
profile: "CODEX_PROFILE"# matches the profile in ~/.codex/config
250
+
model: "CODEX_MODEL"# optional, falls back to profile default
251
+
approval_preset: "CODEX_APPROVAL_PRESET"
252
+
timeout_seconds: 180
253
+
cwd: CODEX_WORKSPACE_DIR
254
+
```
255
+
256
+
Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
257
+
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
277
258
278
259
## Timeout Handling and Retries
279
260
@@ -290,22 +271,116 @@ Example with custom timeout settings:
The sample `codex_cli` target demonstrates how to drive the standalone Codex CLI from AgentV.
104
+
105
+
1. Install the `codex` CLI (follow the official `codex-cli` README) and run `codex configure` so `~/.codex/config` exists.
106
+
2. Export either `OPENAI_API_KEY` or `CODEX_API_KEY`, plus optional `CODEX_PROFILE`, `CODEX_MODEL`, and `CODEX_APPROVAL_PRESET` values (see `.env.template`).
107
+
3. (Optional) Set `CODEX_CLI_PATH` if the `codex` executable is not already on your `PATH`.
AgentV mirrors guideline and attachment files into the Codex workspace and passes the combined prompt to `codex exec --json`, so preread links behave the same way as the VS Code provider.
- Evaluators need parity between the VS Code Copilot provider and OpenAI's Codex CLI so we can compare agent behaviours in identical YAML panels.
5
+
- Codex already exposes a headless `codex exec --json` mode (see `codex-cli/README.md` lines 218-231) and configurable provider profiles (`docs/config.md` lines 61-191), so AgentV can drive it non-interactively.
6
+
- Without first-class support we currently have to fall back to the generic CLI provider, which cannot stage guideline prereads or capture Codex JSON results reliably.
7
+
8
+
## What Changes
9
+
- Extend the evaluation spec's Provider Integration requirement with a Codex-specific scenario covering executable discovery, workspace staging, JSONL invocation, and structured result parsing.
10
+
- Define target settings (executable path, profile, model, approval preset, cwd) that map directly to Codex CLI knobs and lean on the CLI's existing configuration/credential handling.
11
+
- Ensure attachments and guideline files are mirrored into the Codex workspace with `file://` prereads similar to the VS Code provider so Codex can open them before answering.
12
+
- Surface actionable errors when Codex exits non-zero, times out, or emits invalid JSON to keep eval runs debuggable.
The system SHALL support multiple LLM providers with environment-based configuration.
6
+
7
+
#### Scenario: Azure OpenAI provider
8
+
9
+
-**WHEN** a test case uses the "azure-openai" provider
10
+
-**THEN** the system reads `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_DEPLOYMENT_NAME` from environment
11
+
-**AND** invokes Azure OpenAI with the configured settings
12
+
13
+
#### Scenario: Anthropic provider
14
+
15
+
-**WHEN** a test case uses the "anthropic" provider
16
+
-**THEN** the system reads `ANTHROPIC_API_KEY` from environment
17
+
-**AND** invokes Anthropic Claude with the configured settings
18
+
19
+
#### Scenario: Google Gemini provider
20
+
21
+
-**WHEN** a test case uses the "gemini" provider
22
+
-**THEN** the system reads `GOOGLE_API_KEY` from environment
23
+
-**AND** optionally reads `GOOGLE_GEMINI_MODEL` to override the default model
24
+
-**AND** invokes Google Gemini with the configured settings
25
+
26
+
#### Scenario: VS Code Copilot provider
27
+
28
+
-**WHEN** a test case uses the "vscode-copilot" provider
29
+
-**THEN** the system generates a structured prompt file with preread block and SHA tokens
30
+
-**AND** invokes the subagent library to execute the prompt
31
+
-**AND** captures the Copilot response
32
+
33
+
#### Scenario: Codex CLI provider
34
+
35
+
-**WHEN** a test case uses the "codex" provider
36
+
-**THEN** the system locates the Codex CLI executable (default `codex`, overrideable via the target)
37
+
-**AND** it mirrors guideline and attachment files into a scratch workspace, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
38
+
-**AND** it renders the eval prompt into a single string and launches `codex exec --json` plus any configured profile, model, approval preset, and working-directory overrides defined on the target
39
+
-**AND** it verifies the Codex executable is available while delegating profile/config resolution to the CLI itself
40
+
-**AND** it parses the emitted JSONL event stream to capture the final assistant message as the provider response, attaching stdout/stderr when the CLI exits non-zero or returns malformed JSON
41
+
42
+
#### Scenario: Mock provider for dry-run
43
+
44
+
-**WHEN** a test case uses the "mock" provider or dry-run is enabled
45
+
-**THEN** the system returns a predefined mock response
46
+
-**AND** does not make external API calls
47
+
48
+
#### Scenario: Missing provider credentials
49
+
50
+
-**WHEN** a provider is selected but required environment variables are missing
51
+
-**THEN** the system fails fast with a clear error message
-[x] 1.1 Extend `targets.yaml` validation to accept `provider: codex` with settings for `executable`, `profile`, `model`, `approvalPreset`, `timeoutSeconds`, and optional working directory overrides.
3
+
-[x] 1.2 Add a Codex provider class that stages guideline + attachment files into a scratch workspace, builds the preread block (mirroring the VS Code provider), and renders the eval prompt into a single string Codex can ingest.
4
+
-[x] 1.3 Invoke the Codex CLI (`codex exec --json` by default) with settings-derived flags, stream stdout/stderr, and parse the emitted JSONL event stream to capture the final assistant response.
5
+
-[x] 1.4 Detect missing executables early and surface actionable errors before dispatching eval cases.
6
+
-[x] 1.5 Register the provider in the factory so batching flags, dry-run mode, and retries behave consistently with other providers.
7
+
-[x] 1.6 Document the new provider in README/examples, including sample target entries and instructions for installing Codex CLI.
8
+
9
+
## 2. Validation
10
+
-[x] 2.1 Add unit tests that stub the Codex executable to ensure prompts, attachments, and CLI arguments are composed correctly.
11
+
-[x] 2.2 Add failure-path tests covering timeouts, malformed JSON, and missing credentials to guarantee clear error surfaces.
12
+
-[x] 2.3 Run `pnpm test packages/core/test/evaluation/providers/codex.test.ts` (new) plus the example eval to confirm Codex targets can execute end-to-end.
0 commit comments