Skip to content

Commit b195d7e

Browse files
authored
feat: add codex provider (#23)
1 parent e4a28bb commit b195d7e

26 files changed

Lines changed: 1307 additions & 565 deletions

File tree

README.md

Lines changed: 125 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# AgentV
22

3-
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
3+
A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
44

55
## Installation and Setup
66

@@ -183,19 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
183183

184184
**Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
185185

186-
## Requirements
187-
188-
- Node.js 20.0.0 or higher
189-
- Environment variables for your chosen providers (configured via targets.yaml)
190-
191-
Environment keys (configured via targets.yaml):
192-
193-
- **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
194-
- **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
195-
- **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
196-
- **VS Code:** Set environment variable specified in your target's `settings.workspace_env``.code-workspace` path
197-
- **CLI provider:** Configure `command_template` plus optional `cwd`, `env`, `timeout_seconds`, and `healthcheck` fields in targets.yaml; CLI `settings.env` entries are merged into the process environment
198-
199186
## Targets and Environment Variables
200187

201188
Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -205,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
205192
Each target specifies:
206193

207194
- `name`: Unique identifier for the target
208-
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
195+
- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
209196
- `settings`: Environment variable names to use for this target
210197

211198
### Examples
@@ -221,26 +208,6 @@ Each target specifies:
221208
model: "AZURE_DEPLOYMENT_NAME"
222209
```
223210
224-
**Anthropic targets:**
225-
226-
```yaml
227-
- name: anthropic_base
228-
provider: anthropic
229-
settings:
230-
api_key: "ANTHROPIC_API_KEY"
231-
model: "ANTHROPIC_MODEL"
232-
```
233-
234-
**Google Gemini targets:**
235-
236-
```yaml
237-
- name: gemini_base
238-
provider: gemini
239-
settings:
240-
api_key: "GOOGLE_API_KEY"
241-
model: "GOOGLE_GEMINI_MODEL" # Optional, defaults to gemini-2.0-flash-exp
242-
```
243-
244211
**VS Code targets:**
245212
246213
```yaml
@@ -261,7 +228,7 @@ Each target specifies:
261228
- name: local_cli
262229
provider: cli
263230
settings:
264-
command_template: 'code chat {PROMPT} {FILES}'
231+
command_template: 'somecommand {PROMPT} {FILES}'
265232
files_format: '--file {path}'
266233
cwd: PROJECT_ROOT # optional working directory
267234
env: # merged into process.env
@@ -272,8 +239,22 @@ Each target specifies:
272239
command_template: code --version
273240
```
274241
275-
CLI placeholders are `{PROMPT}`, `{GUIDELINES}`, `{EVAL_ID}`, `{ATTEMPT}`, and `{FILES}`. Values are shell-escaped automatically; avoid wrapping them in extra quotes unless your CLI requires nested quoting. `{FILES}` renders each file path using `files_format` (supports `{path}` and `{basename}`) and joins with spaces. Optional `healthcheck` probes (HTTP or command) run once before the first eval and abort the run on failure.
276-
CLI troubleshooting: unsupported placeholders fail validation, so stick to the tokens above; if your CLI logs show doubled quotes, drop extra quoting in `command_template` and rely on the built-in escaping; if healthchecks fail, raise `timeout_seconds` or point the probe at a fast status endpoint.
242+
**Codex CLI targets:**
243+
244+
```yaml
245+
- name: codex_cli
246+
provider: codex
247+
settings:
248+
executable: "CODEX_CLI_PATH" # defaults to `codex` if omitted
249+
profile: "CODEX_PROFILE" # matches the profile in ~/.codex/config
250+
model: "CODEX_MODEL" # optional, falls back to profile default
251+
approval_preset: "CODEX_APPROVAL_PRESET"
252+
timeout_seconds: 180
253+
cwd: CODEX_WORKSPACE_DIR
254+
```
255+
256+
Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
257+
Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
277258

278259
## Timeout Handling and Retries
279260

@@ -290,22 +271,116 @@ Example with custom timeout settings:
290271
agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
291272
```
292273

293-
## How the Evals Work
274+
## Writing Custom Evaluators
275+
276+
### Code Evaluator I/O Contract
277+
278+
Code evaluators receive input via stdin and write output to stdout as JSON.
279+
280+
**Input Format (via stdin):**
281+
```json
282+
{
283+
"task": "string describing the task",
284+
"outcome": "expected outcome description",
285+
"expected": "expected output string",
286+
"output": "generated code/text from the agent",
287+
"system_message": "system message if any",
288+
"guideline_paths": ["path1", "path2"],
289+
"attachments": ["file1", "file2"],
290+
"user_segments": [{"type": "text", "value": "..."}]
291+
}
292+
```
293+
294+
**Output Format (to stdout):**
295+
```json
296+
{
297+
"score": 0.85,
298+
"hits": ["list of successful checks"],
299+
"misses": ["list of failed checks"],
300+
"reasoning": "explanation of the score"
301+
}
302+
```
303+
304+
**Key Points:**
305+
- Evaluators receive **full context** but should select only relevant fields
306+
- Most evaluators only need `output` field - ignore the rest to avoid false positives
307+
- Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
308+
- Score range: `0.0` to `1.0` (float)
309+
- `hits` and `misses` are optional but recommended for debugging
310+
311+
### Code Evaluator Script Template
312+
313+
```python
314+
#!/usr/bin/env python3
315+
import json
316+
import sys
317+
318+
def evaluate(input_data):
319+
# Extract only the fields you need
320+
output = input_data.get("output", "")
321+
322+
# Your validation logic here
323+
score = 0.0 # to 1.0
324+
hits = ["successful check 1", "successful check 2"]
325+
misses = ["failed check 1"]
326+
reasoning = "Explanation of score"
327+
328+
return {
329+
"score": score,
330+
"hits": hits,
331+
"misses": misses,
332+
"reasoning": reasoning
333+
}
334+
335+
if __name__ == "__main__":
336+
try:
337+
input_data = json.loads(sys.stdin.read())
338+
result = evaluate(input_data)
339+
print(json.dumps(result, indent=2))
340+
except Exception as e:
341+
error_result = {
342+
"score": 0.0,
343+
"hits": [],
344+
"misses": [f"Evaluator error: {str(e)}"],
345+
"reasoning": f"Evaluator error: {str(e)}"
346+
}
347+
print(json.dumps(error_result, indent=2))
348+
sys.exit(1)
349+
```
350+
351+
### LLM Judge Template Structure
352+
353+
```markdown
354+
# Judge Name
355+
356+
Evaluation criteria and guidelines...
357+
358+
## Scoring Guidelines
359+
0.9-1.0: Excellent
360+
0.7-0.8: Good
361+
...
362+
363+
## Output Format
364+
{
365+
"score": 0.85,
366+
"passed": true,
367+
"reasoning": "..."
368+
}
369+
```
294370

295-
For each eval case in a `.yaml` file:
371+
## Next Steps
296372

297-
1. Parse YAML and collect user messages (inline text and referenced files)
298-
2. Extract code blocks from text for structured prompting
299-
3. Generate a candidate answer via the configured provider/model
300-
4. Score against the expected answer using AI-powered quality grading
301-
5. Output results in JSONL or YAML format with detailed metrics
373+
- Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
374+
- Create your own eval cases following the schema
375+
- Write custom evaluator scripts for domain-specific validation
376+
- Create LLM judge templates for semantic evaluation
377+
- Set up optimizer configs when ready to improve prompts
302378

303-
### VS Code Copilot Target
379+
## Resources
304380

305-
- Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
306-
- The prompt is built from the `.yaml` user content (task, files, code blocks)
307-
- Copilot is instructed to complete the task within the workspace context
308-
- Results are captured and scored automatically
381+
- [Simple Example README](docs/examples/simple/README.md)
382+
- [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
383+
- [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
309384

310385
## Scoring and Outputs
311386

apps/cli/package.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "agentv",
3-
"version": "0.3.1",
3+
"version": "0.5.0",
44
"description": "CLI entry point for AgentV",
55
"type": "module",
66
"repository": {

docs/examples/simple/.agentv/targets.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,19 @@ targets:
5858
healthcheck:
5959
type: command
6060
command_template: uv run ./mock_cli.py --healthcheck
61+
62+
- name: codex_cli
63+
provider: codex
64+
judge_target: azure_base
65+
settings:
66+
# Uses the Codex CLI (defaults to `codex` on PATH)
67+
# executable: CODEX_CLI_PATH # Optional: override executable path
68+
# args: # Optional additional CLI arguments
69+
# - --profile
70+
# - CODEX_PROFILE
71+
# - --model
72+
# - CODEX_MODEL
73+
# - --ask-for-approval
74+
# - CODEX_APPROVAL_PRESET
75+
timeout_seconds: 180
76+
cwd: CODEX_WORKSPACE_DIR # Where scratch workspaces are created

docs/examples/simple/.env.template

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,15 @@ PROJECTX_WORKSPACE_PATH=C:/Users/your-username/OneDrive - Company Pty Ltd/sample
2121
# CLI provider sample (used by the local_cli target)
2222
PROJECT_ROOT=D:/GitHub/your-username/agentv/docs/examples/simple
2323
LOCAL_AGENT_TOKEN=your-cli-token
24+
25+
# Codex CLI Configuration
26+
# Either OPENAI_API_KEY or CODEX_API_KEY must be set before running codex targets
27+
OPENAI_API_KEY=your-openai-or-codex-key
28+
CODEX_API_KEY=
29+
CODEX_PROFILE=default
30+
CODEX_MODEL=gpt-4o-mini
31+
CODEX_APPROVAL_PRESET=auto
32+
CODEX_CLI_PATH=C:/Program Files/Codex/bin/codex.exe
33+
CODEX_WORKSPACE_DIR=C:/Temp/agentv-codex
34+
# Optional override if your codex config lives outside the default ~/.codex/config
35+
CODEX_CONFIG_PATH=

docs/examples/simple/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,21 @@ To try it locally:
9898
agentv eval evals/cli-provider-demo.yaml --target local_cli
9999
```
100100

101+
### Codex CLI provider sample
102+
103+
The sample `codex_cli` target demonstrates how to drive the standalone Codex CLI from AgentV.
104+
105+
1. Install the `codex` CLI (follow the official `codex-cli` README) and run `codex configure` so `~/.codex/config` exists.
106+
2. Export either `OPENAI_API_KEY` or `CODEX_API_KEY`, plus optional `CODEX_PROFILE`, `CODEX_MODEL`, and `CODEX_APPROVAL_PRESET` values (see `.env.template`).
107+
3. (Optional) Set `CODEX_CLI_PATH` if the `codex` executable is not already on your `PATH`.
108+
4. Run an eval with the Codex target:
109+
110+
```bash
111+
agentv eval evals/example-eval.yaml --target codex_cli
112+
```
113+
114+
AgentV mirrors guideline and attachment files into the Codex workspace and passes the combined prompt to `codex exec --json`, so preread links behave the same way as the VS Code provider.
115+
101116
### With Optimization (Future)
102117

103118
```bash
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Change: Add Codex CLI provider for evals
2+
3+
## Why
4+
- Evaluators need parity between the VS Code Copilot provider and OpenAI's Codex CLI so we can compare agent behaviours in identical YAML panels.
5+
- Codex already exposes a headless `codex exec --json` mode (see `codex-cli/README.md` lines 218-231) and configurable provider profiles (`docs/config.md` lines 61-191), so AgentV can drive it non-interactively.
6+
- Without first-class support we currently have to fall back to the generic CLI provider, which cannot stage guideline prereads or capture Codex JSON results reliably.
7+
8+
## What Changes
9+
- Extend the evaluation spec's Provider Integration requirement with a Codex-specific scenario covering executable discovery, workspace staging, JSONL invocation, and structured result parsing.
10+
- Define target settings (executable path, profile, model, approval preset, cwd) that map directly to Codex CLI knobs and lean on the CLI's existing configuration/credential handling.
11+
- Ensure attachments and guideline files are mirrored into the Codex workspace with `file://` prereads similar to the VS Code provider so Codex can open them before answering.
12+
- Surface actionable errors when Codex exits non-zero, times out, or emits invalid JSON to keep eval runs debuggable.
13+
14+
## Impact
15+
- Affected specs: `openspec/specs/evaluation/spec.md` (Provider Integration requirement).
16+
- Affected code: `packages/core/src/evaluation/providers`, provider factory/targets schema, CLI docs and example targets, regression tests under `packages/core/test/evaluation`.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
## MODIFIED Requirements
2+
3+
### Requirement: Provider Integration
4+
5+
The system SHALL support multiple LLM providers with environment-based configuration.
6+
7+
#### Scenario: Azure OpenAI provider
8+
9+
- **WHEN** a test case uses the "azure-openai" provider
10+
- **THEN** the system reads `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_DEPLOYMENT_NAME` from environment
11+
- **AND** invokes Azure OpenAI with the configured settings
12+
13+
#### Scenario: Anthropic provider
14+
15+
- **WHEN** a test case uses the "anthropic" provider
16+
- **THEN** the system reads `ANTHROPIC_API_KEY` from environment
17+
- **AND** invokes Anthropic Claude with the configured settings
18+
19+
#### Scenario: Google Gemini provider
20+
21+
- **WHEN** a test case uses the "gemini" provider
22+
- **THEN** the system reads `GOOGLE_API_KEY` from environment
23+
- **AND** optionally reads `GOOGLE_GEMINI_MODEL` to override the default model
24+
- **AND** invokes Google Gemini with the configured settings
25+
26+
#### Scenario: VS Code Copilot provider
27+
28+
- **WHEN** a test case uses the "vscode-copilot" provider
29+
- **THEN** the system generates a structured prompt file with preread block and SHA tokens
30+
- **AND** invokes the subagent library to execute the prompt
31+
- **AND** captures the Copilot response
32+
33+
#### Scenario: Codex CLI provider
34+
35+
- **WHEN** a test case uses the "codex" provider
36+
- **THEN** the system locates the Codex CLI executable (default `codex`, overrideable via the target)
37+
- **AND** it mirrors guideline and attachment files into a scratch workspace, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
38+
- **AND** it renders the eval prompt into a single string and launches `codex exec --json` plus any configured profile, model, approval preset, and working-directory overrides defined on the target
39+
- **AND** it verifies the Codex executable is available while delegating profile/config resolution to the CLI itself
40+
- **AND** it parses the emitted JSONL event stream to capture the final assistant message as the provider response, attaching stdout/stderr when the CLI exits non-zero or returns malformed JSON
41+
42+
#### Scenario: Mock provider for dry-run
43+
44+
- **WHEN** a test case uses the "mock" provider or dry-run is enabled
45+
- **THEN** the system returns a predefined mock response
46+
- **AND** does not make external API calls
47+
48+
#### Scenario: Missing provider credentials
49+
50+
- **WHEN** a provider is selected but required environment variables are missing
51+
- **THEN** the system fails fast with a clear error message
52+
- **AND** lists the missing environment variables
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## 1. Implementation
2+
- [x] 1.1 Extend `targets.yaml` validation to accept `provider: codex` with settings for `executable`, `profile`, `model`, `approvalPreset`, `timeoutSeconds`, and optional working directory overrides.
3+
- [x] 1.2 Add a Codex provider class that stages guideline + attachment files into a scratch workspace, builds the preread block (mirroring the VS Code provider), and renders the eval prompt into a single string Codex can ingest.
4+
- [x] 1.3 Invoke the Codex CLI (`codex exec --json` by default) with settings-derived flags, stream stdout/stderr, and parse the emitted JSONL event stream to capture the final assistant response.
5+
- [x] 1.4 Detect missing executables early and surface actionable errors before dispatching eval cases.
6+
- [x] 1.5 Register the provider in the factory so batching flags, dry-run mode, and retries behave consistently with other providers.
7+
- [x] 1.6 Document the new provider in README/examples, including sample target entries and instructions for installing Codex CLI.
8+
9+
## 2. Validation
10+
- [x] 2.1 Add unit tests that stub the Codex executable to ensure prompts, attachments, and CLI arguments are composed correctly.
11+
- [x] 2.2 Add failure-path tests covering timeouts, malformed JSON, and missing credentials to guarantee clear error surfaces.
12+
- [x] 2.3 Run `pnpm test packages/core/test/evaluation/providers/codex.test.ts` (new) plus the example eval to confirm Codex targets can execute end-to-end.

docs/openspec/changes/implement-custom-evaluators/design.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/design.md

File renamed without changes.

docs/openspec/changes/implement-custom-evaluators/proposal.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/proposal.md

File renamed without changes.

0 commit comments

Comments
 (0)