EntityProcess
diff --git a/‎README.md‎
Lines changed: 125 additions & 50 deletions b/‎README.md‎
Lines changed: 125 additions & 50 deletions
diff --git a/‎apps/cli/package.json‎
Lines changed: 1 addition & 1 deletion b/‎apps/cli/package.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/examples/simple/.agentv/targets.yaml‎
Lines changed: 16 additions & 0 deletions b/‎docs/examples/simple/.agentv/targets.yaml‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/examples/simple/.env.template‎
Lines changed: 12 additions & 0 deletions b/‎docs/examples/simple/.env.template‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/examples/simple/README.md‎
Lines changed: 15 additions & 0 deletions b/‎docs/examples/simple/README.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/proposal.md‎
Lines changed: 16 additions & 0 deletions b/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/proposal.md‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/specs/evaluation/spec.md‎
Lines changed: 52 additions & 0 deletions b/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/specs/evaluation/spec.md‎
Lines changed: 52 additions & 0 deletions
diff --git a/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/tasks.md‎
Lines changed: 12 additions & 0 deletions b/‎docs/openspec/changes/archive/2025-11-22-add-codex-provider/tasks.md‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎…es/implement-custom-evaluators/design.md‎ ‎…22-implement-custom-evaluators/design.md‎docs/openspec/changes/implement-custom-evaluators/design.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/design.md b/‎…es/implement-custom-evaluators/design.md‎ ‎…22-implement-custom-evaluators/design.md‎docs/openspec/changes/implement-custom-evaluators/design.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/design.md
diff --git a/‎…/implement-custom-evaluators/proposal.md‎ ‎…-implement-custom-evaluators/proposal.md‎docs/openspec/changes/implement-custom-evaluators/proposal.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/proposal.md b/‎…/implement-custom-evaluators/proposal.md‎ ‎…-implement-custom-evaluators/proposal.md‎docs/openspec/changes/implement-custom-evaluators/proposal.md renamed to docs/openspec/changes/archive/2025-11-22-implement-custom-evaluators/proposal.md
@@ -1,6 +1,6 @@
 # AgentV
 
-A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
+A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
 
 ## Installation and Setup
 
@@ -183,19 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
 
 **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
 
-## Requirements
-
-- Node.js 20.0.0 or higher
-- Environment variables for your chosen providers (configured via targets.yaml)
-
-Environment keys (configured via targets.yaml):
-
-- **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
-- **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
-- **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
-- **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
-- **CLI provider:** Configure `command_template` plus optional `cwd`, `env`, `timeout_seconds`, and `healthcheck` fields in targets.yaml; CLI `settings.env` entries are merged into the process environment
-
 ## Targets and Environment Variables
 
 Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -205,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
 Each target specifies:
 
 - `name`: Unique identifier for the target
-- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
+- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
 - `settings`: Environment variable names to use for this target
 
 ### Examples
@@ -221,26 +208,6 @@ Each target specifies:
     model: "AZURE_DEPLOYMENT_NAME"
 ```
 
-**Anthropic targets:**
-
-```yaml
-- name: anthropic_base
-  provider: anthropic
-  settings:
-    api_key: "ANTHROPIC_API_KEY"
-    model: "ANTHROPIC_MODEL"
-```
-
-**Google Gemini targets:**
-
-```yaml
-- name: gemini_base
-  provider: gemini
-  settings:
-    api_key: "GOOGLE_API_KEY"
-    model: "GOOGLE_GEMINI_MODEL"  # Optional, defaults to gemini-2.0-flash-exp
-```
-
 **VS Code targets:**
 
 ```yaml
@@ -261,7 +228,7 @@ Each target specifies:
 - name: local_cli
   provider: cli
   settings:
-    command_template: 'code chat {PROMPT} {FILES}'
+    command_template: 'somecommand {PROMPT} {FILES}'
     files_format: '--file {path}'
     cwd: PROJECT_ROOT               # optional working directory
     env:                            # merged into process.env
@@ -272,8 +239,22 @@ Each target specifies:
       command_template: code --version
 ```
 
-CLI placeholders are `{PROMPT}`, `{GUIDELINES}`, `{EVAL_ID}`, `{ATTEMPT}`, and `{FILES}`. Values are shell-escaped automatically; avoid wrapping them in extra quotes unless your CLI requires nested quoting. `{FILES}` renders each file path using `files_format` (supports `{path}` and `{basename}`) and joins with spaces. Optional `healthcheck` probes (HTTP or command) run once before the first eval and abort the run on failure.
-CLI troubleshooting: unsupported placeholders fail validation, so stick to the tokens above; if your CLI logs show doubled quotes, drop extra quoting in `command_template` and rely on the built-in escaping; if healthchecks fail, raise `timeout_seconds` or point the probe at a fast status endpoint.
+**Codex CLI targets:**
+
+```yaml
+- name: codex_cli
+  provider: codex
+  settings:
+    executable: "CODEX_CLI_PATH"     # defaults to `codex` if omitted
+    profile: "CODEX_PROFILE"         # matches the profile in ~/.codex/config
+    model: "CODEX_MODEL"             # optional, falls back to profile default
+    approval_preset: "CODEX_APPROVAL_PRESET"
+    timeout_seconds: 180
+    cwd: CODEX_WORKSPACE_DIR
+```
+
+Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
+Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
 
 ## Timeout Handling and Retries
 
@@ -290,22 +271,116 @@ Example with custom timeout settings:
 agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
 ```
 
-## How the Evals Work
+## Writing Custom Evaluators
+
+### Code Evaluator I/O Contract
+
+Code evaluators receive input via stdin and write output to stdout as JSON.
+
+**Input Format (via stdin):**
+```json
+{
+  "task": "string describing the task",
+  "outcome": "expected outcome description",
+  "expected": "expected output string",
+  "output": "generated code/text from the agent",
+  "system_message": "system message if any",
+  "guideline_paths": ["path1", "path2"],
+  "attachments": ["file1", "file2"],
+  "user_segments": [{"type": "text", "value": "..."}]
+}
+```
+
+**Output Format (to stdout):**
+```json
+{
+  "score": 0.85,
+  "hits": ["list of successful checks"],
+  "misses": ["list of failed checks"],
+  "reasoning": "explanation of the score"
+}
+```
+
+**Key Points:**
+- Evaluators receive **full context** but should select only relevant fields
+- Most evaluators only need `output` field - ignore the rest to avoid false positives
+- Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
+- Score range: `0.0` to `1.0` (float)
+- `hits` and `misses` are optional but recommended for debugging
+
+### Code Evaluator Script Template
+
+```python
+#!/usr/bin/env python3
+import json
+import sys
+
+def evaluate(input_data):
+    # Extract only the fields you need
+    output = input_data.get("output", "")
+    
+    # Your validation logic here
+    score = 0.0  # to 1.0
+    hits = ["successful check 1", "successful check 2"]
+    misses = ["failed check 1"]
+    reasoning = "Explanation of score"
+    
+    return {
+        "score": score,
+        "hits": hits,
+        "misses": misses,
+        "reasoning": reasoning
+    }
+
+if __name__ == "__main__":
+    try:
+        input_data = json.loads(sys.stdin.read())
+        result = evaluate(input_data)
+        print(json.dumps(result, indent=2))
+    except Exception as e:
+        error_result = {
+            "score": 0.0,
+            "hits": [],
+            "misses": [f"Evaluator error: {str(e)}"],
+            "reasoning": f"Evaluator error: {str(e)}"
+        }
+        print(json.dumps(error_result, indent=2))
+        sys.exit(1)
+```
+
+### LLM Judge Template Structure
+
+```markdown
+# Judge Name
+
+Evaluation criteria and guidelines...
+
+## Scoring Guidelines
+0.9-1.0: Excellent
+0.7-0.8: Good
+...
+
+## Output Format
+{
+  "score": 0.85,
+  "passed": true,
+  "reasoning": "..."
+}
+```
 
-For each eval case in a `.yaml` file:
+## Next Steps
 
-1. Parse YAML and collect user messages (inline text and referenced files)
-2. Extract code blocks from text for structured prompting
-3. Generate a candidate answer via the configured provider/model
-4. Score against the expected answer using AI-powered quality grading
-5. Output results in JSONL or YAML format with detailed metrics
+- Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
+- Create your own eval cases following the schema
+- Write custom evaluator scripts for domain-specific validation
+- Create LLM judge templates for semantic evaluation
+- Set up optimizer configs when ready to improve prompts
 
-### VS Code Copilot Target
+## Resources
 
-- Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
-- The prompt is built from the `.yaml` user content (task, files, code blocks)
-- Copilot is instructed to complete the task within the workspace context
-- Results are captured and scored automatically
+- [Simple Example README](docs/examples/simple/README.md)
+- [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
+- [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
 
 ## Scoring and Outputs
 
 
@@ -1,6 +1,6 @@
 {
   "name": "agentv",
-  "version": "0.3.1",
+  "version": "0.5.0",
   "description": "CLI entry point for AgentV",
   "type": "module",
   "repository": {
 
@@ -58,3 +58,19 @@ targets:
       healthcheck:
         type: command
         command_template: uv run ./mock_cli.py --healthcheck
+
+  - name: codex_cli
+    provider: codex
+    judge_target: azure_base
+    settings:
+      # Uses the Codex CLI (defaults to `codex` on PATH)
+      # executable: CODEX_CLI_PATH        # Optional: override executable path
+      # args:                             # Optional additional CLI arguments
+      #   - --profile
+      #   - CODEX_PROFILE
+      #   - --model
+      #   - CODEX_MODEL
+      #   - --ask-for-approval
+      #   - CODEX_APPROVAL_PRESET
+      timeout_seconds: 180
+      cwd: CODEX_WORKSPACE_DIR            # Where scratch workspaces are created
@@ -21,3 +21,15 @@ PROJECTX_WORKSPACE_PATH=C:/Users/your-username/OneDrive - Company Pty Ltd/sample
 # CLI provider sample (used by the local_cli target)
 PROJECT_ROOT=D:/GitHub/your-username/agentv/docs/examples/simple
 LOCAL_AGENT_TOKEN=your-cli-token
+
+# Codex CLI Configuration
+# Either OPENAI_API_KEY or CODEX_API_KEY must be set before running codex targets
+OPENAI_API_KEY=your-openai-or-codex-key
+CODEX_API_KEY=
+CODEX_PROFILE=default
+CODEX_MODEL=gpt-4o-mini
+CODEX_APPROVAL_PRESET=auto
+CODEX_CLI_PATH=C:/Program Files/Codex/bin/codex.exe
+CODEX_WORKSPACE_DIR=C:/Temp/agentv-codex
+# Optional override if your codex config lives outside the default ~/.codex/config
+CODEX_CONFIG_PATH=
@@ -98,6 +98,21 @@ To try it locally:
 agentv eval evals/cli-provider-demo.yaml --target local_cli
 ```
 
+### Codex CLI provider sample
+
+The sample `codex_cli` target demonstrates how to drive the standalone Codex CLI from AgentV.
+
+1. Install the `codex` CLI (follow the official `codex-cli` README) and run `codex configure` so `~/.codex/config` exists.
+2. Export either `OPENAI_API_KEY` or `CODEX_API_KEY`, plus optional `CODEX_PROFILE`, `CODEX_MODEL`, and `CODEX_APPROVAL_PRESET` values (see `.env.template`).
+3. (Optional) Set `CODEX_CLI_PATH` if the `codex` executable is not already on your `PATH`.
+4. Run an eval with the Codex target:
+
+```bash
+agentv eval evals/example-eval.yaml --target codex_cli
+```
+
+AgentV mirrors guideline and attachment files into the Codex workspace and passes the combined prompt to `codex exec --json`, so preread links behave the same way as the VS Code provider.
+
 ### With Optimization (Future)
 
 ```bash
 
@@ -0,0 +1,16 @@
+# Change: Add Codex CLI provider for evals
+
+## Why
+- Evaluators need parity between the VS Code Copilot provider and OpenAI's Codex CLI so we can compare agent behaviours in identical YAML panels.
+- Codex already exposes a headless `codex exec --json` mode (see `codex-cli/README.md` lines 218-231) and configurable provider profiles (`docs/config.md` lines 61-191), so AgentV can drive it non-interactively.
+- Without first-class support we currently have to fall back to the generic CLI provider, which cannot stage guideline prereads or capture Codex JSON results reliably.
+
+## What Changes
+- Extend the evaluation spec's Provider Integration requirement with a Codex-specific scenario covering executable discovery, workspace staging, JSONL invocation, and structured result parsing.
+- Define target settings (executable path, profile, model, approval preset, cwd) that map directly to Codex CLI knobs and lean on the CLI's existing configuration/credential handling.
+- Ensure attachments and guideline files are mirrored into the Codex workspace with `file://` prereads similar to the VS Code provider so Codex can open them before answering.
+- Surface actionable errors when Codex exits non-zero, times out, or emits invalid JSON to keep eval runs debuggable.
+
+## Impact
+- Affected specs: `openspec/specs/evaluation/spec.md` (Provider Integration requirement).
+- Affected code: `packages/core/src/evaluation/providers`, provider factory/targets schema, CLI docs and example targets, regression tests under `packages/core/test/evaluation`.
@@ -0,0 +1,52 @@
+## MODIFIED Requirements
+
+### Requirement: Provider Integration
+
+The system SHALL support multiple LLM providers with environment-based configuration.
+
+#### Scenario: Azure OpenAI provider
+
+- **WHEN** a test case uses the "azure-openai" provider
+- **THEN** the system reads `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, and `AZURE_DEPLOYMENT_NAME` from environment
+- **AND** invokes Azure OpenAI with the configured settings
+
+#### Scenario: Anthropic provider
+
+- **WHEN** a test case uses the "anthropic" provider
+- **THEN** the system reads `ANTHROPIC_API_KEY` from environment
+- **AND** invokes Anthropic Claude with the configured settings
+
+#### Scenario: Google Gemini provider
+
+- **WHEN** a test case uses the "gemini" provider
+- **THEN** the system reads `GOOGLE_API_KEY` from environment
+- **AND** optionally reads `GOOGLE_GEMINI_MODEL` to override the default model
+- **AND** invokes Google Gemini with the configured settings
+
+#### Scenario: VS Code Copilot provider
+
+- **WHEN** a test case uses the "vscode-copilot" provider
+- **THEN** the system generates a structured prompt file with preread block and SHA tokens
+- **AND** invokes the subagent library to execute the prompt
+- **AND** captures the Copilot response
+
+#### Scenario: Codex CLI provider
+
+- **WHEN** a test case uses the "codex" provider
+- **THEN** the system locates the Codex CLI executable (default `codex`, overrideable via the target)
+- **AND** it mirrors guideline and attachment files into a scratch workspace, emitting the same preread block links used by the VS Code provider so Codex opens every referenced file before answering
+- **AND** it renders the eval prompt into a single string and launches `codex exec --json` plus any configured profile, model, approval preset, and working-directory overrides defined on the target
+- **AND** it verifies the Codex executable is available while delegating profile/config resolution to the CLI itself
+- **AND** it parses the emitted JSONL event stream to capture the final assistant message as the provider response, attaching stdout/stderr when the CLI exits non-zero or returns malformed JSON
+
+#### Scenario: Mock provider for dry-run
+
+- **WHEN** a test case uses the "mock" provider or dry-run is enabled
+- **THEN** the system returns a predefined mock response
+- **AND** does not make external API calls
+
+#### Scenario: Missing provider credentials
+
+- **WHEN** a provider is selected but required environment variables are missing
+- **THEN** the system fails fast with a clear error message
+- **AND** lists the missing environment variables
@@ -0,0 +1,12 @@
+## 1. Implementation
+- [x] 1.1 Extend `targets.yaml` validation to accept `provider: codex` with settings for `executable`, `profile`, `model`, `approvalPreset`, `timeoutSeconds`, and optional working directory overrides.
+- [x] 1.2 Add a Codex provider class that stages guideline + attachment files into a scratch workspace, builds the preread block (mirroring the VS Code provider), and renders the eval prompt into a single string Codex can ingest.
+- [x] 1.3 Invoke the Codex CLI (`codex exec --json` by default) with settings-derived flags, stream stdout/stderr, and parse the emitted JSONL event stream to capture the final assistant response.
+- [x] 1.4 Detect missing executables early and surface actionable errors before dispatching eval cases.
+- [x] 1.5 Register the provider in the factory so batching flags, dry-run mode, and retries behave consistently with other providers.
+- [x] 1.6 Document the new provider in README/examples, including sample target entries and instructions for installing Codex CLI.
+
+## 2. Validation
+- [x] 2.1 Add unit tests that stub the Codex executable to ensure prompts, attachments, and CLI arguments are composed correctly.
+- [x] 2.2 Add failure-path tests covering timeouts, malformed JSON, and missing credentials to guarantee clear error surfaces.
+- [x] 2.3 Run `pnpm test packages/core/test/evaluation/providers/codex.test.ts` (new) plus the example eval to confirm Codex targets can execute end-to-end.
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"name": "agentv",`
`3`		`- "version": "0.3.1",`
	`3`	`+ "version": "0.5.0",`
`4`	`4`	`"description": "CLI entry point for AgentV",`
`5`	`5`	`"type": "module",`
`6`	`6`	`"repository": {`