EntityProcess
diff --git a/‎apps/web/astro.config.mjs‎
Lines changed: 1 addition & 1 deletion b/‎apps/web/astro.config.mjs‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎apps/web/src/content/docs/docs/evaluation/batch-cli.mdx‎
Lines changed: 9 additions & 9 deletions b/‎apps/web/src/content/docs/docs/evaluation/batch-cli.mdx‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/eval-cases.mdx‎
Lines changed: 21 additions & 21 deletions b/‎apps/web/src/content/docs/docs/evaluation/eval-cases.mdx‎
Lines changed: 21 additions & 21 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/eval-files.mdx‎
Lines changed: 4 additions & 4 deletions b/‎apps/web/src/content/docs/docs/evaluation/eval-files.mdx‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/examples.mdx‎
Lines changed: 6 additions & 6 deletions b/‎apps/web/src/content/docs/docs/evaluation/examples.mdx‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/rubrics.mdx‎
Lines changed: 3 additions & 3 deletions b/‎apps/web/src/content/docs/docs/evaluation/rubrics.mdx‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/running-evals.mdx‎
Lines changed: 4 additions & 4 deletions b/‎apps/web/src/content/docs/docs/evaluation/running-evals.mdx‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎apps/web/src/content/docs/docs/evaluation/sdk.mdx‎
Lines changed: 1 addition & 1 deletion b/‎apps/web/src/content/docs/docs/evaluation/sdk.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎apps/web/src/content/docs/docs/getting-started/quickstart.mdx‎
Lines changed: 1 addition & 1 deletion b/‎apps/web/src/content/docs/docs/getting-started/quickstart.mdx‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎…nt/docs/docs/evaluators/code-graders.mdx‎ ‎…ntent/docs/docs/graders/code-graders.mdx‎apps/web/src/content/docs/docs/evaluators/code-graders.mdx renamed to apps/web/src/content/docs/docs/graders/code-graders.mdx b/‎…nt/docs/docs/evaluators/code-graders.mdx‎ ‎…ntent/docs/docs/graders/code-graders.mdx‎apps/web/src/content/docs/docs/evaluators/code-graders.mdx renamed to apps/web/src/content/docs/docs/graders/code-graders.mdx
@@ -38,7 +38,7 @@ export default defineConfig({
       sidebar: [
         { label: 'Getting Started', autogenerate: { directory: 'docs/getting-started' } },
         { label: 'Evaluation', autogenerate: { directory: 'docs/evaluation' } },
-        { label: 'Evaluators', autogenerate: { directory: 'docs/evaluators' } },
+        { label: 'Graders', autogenerate: { directory: 'docs/graders' } },
         { label: 'Targets', autogenerate: { directory: 'docs/targets' } },
         { label: 'Tools', autogenerate: { directory: 'docs/tools' } },
         { label: 'Guides', autogenerate: { directory: 'docs/guides' } },
 
@@ -14,14 +14,14 @@ Use batch CLI evaluation when:
 - An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
 - The runner reads the eval YAML directly to extract all tests
 - Output is JSONL with records keyed by test `id`
-- Each test has its own evaluator to validate its corresponding output record
+- Each test has its own grader to validate its corresponding output record
 
 ## Execution Flow
 
 1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
 2. **Batch runner** reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by `id`
 3. **AgentV** parses the JSONL and routes each record to its matching test by `id`
-4. **Per-test evaluators** validate the output for each test independently
+4. **Per-test graders** validate the output for each test independently
 
 ## Eval File Structure
 
@@ -109,7 +109,7 @@ JSONL where each line is a JSON object with an `id` matching a test:
 {"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
 ```
 
-The `id` field must match the test `id` for AgentV to route output to the correct evaluator.
+The `id` field must match the test `id` for AgentV to route output to the correct grader.
 
 ### Output with Tool Trajectory
 
@@ -138,11 +138,11 @@ To enable `tool_trajectory` evaluation, include `output` with `tool_calls`:
 }
 ```
 
-AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` evaluators.
+AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` graders.
 
-## Evaluator Implementation
+## Grader Implementation
 
-Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard `code_grader` input via stdin.
+Each test has its own grader that validates the batch runner output. The grader receives the standard `code_grader` input via stdin.
 
 **Input (stdin):**
 ```json
@@ -164,7 +164,7 @@ Each test has its own evaluator that validates the batch runner output. The eval
 }
 ```
 
-### Example Evaluator
+### Example Grader
 
 ```typescript
 import fs from 'node:fs';
@@ -233,7 +233,7 @@ expected_output:
       reasons: []
 ```
 
-The evaluator extracts these fields and compares them against the parsed candidate output.
+The grader extracts these fields and compares them against the parsed candidate output.
 
 ## Target Configuration
 
@@ -259,7 +259,7 @@ Key settings:
 
 ## Best Practices
 
-1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct evaluator
+1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct grader
 2. **Structured input** -- put structured data in `user.content` for the runner to extract
 3. **Structured expected_output** -- define expected output as objects for easy comparison
 4. **Deterministic runners** -- batch runners should produce consistent output for reliable testing
 
@@ -5,7 +5,7 @@ sidebar:
   order: 2
 ---
 
-Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
+Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional grader overrides.
 
 ## Basic Structure
 
@@ -29,9 +29,9 @@ tests:
 | `expected_output` | No | Expected response for comparison (string, object, or message array). Alias: `expected_output` |
 | `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
 | `workspace` | No | Per-case workspace config (overrides suite-level) |
-| `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts |
+| `metadata` | No | Arbitrary key-value pairs passed to graders and workspace scripts |
 | `rubrics` | No | Structured evaluation criteria |
-| `assertions` | No | Per-test evaluators |
+| `assertions` | No | Per-test graders |
 
 ## Input
 
@@ -55,7 +55,7 @@ When suite-level `input` is defined in the eval file, those messages are prepend
 
 ## Expected Output
 
-Optional reference response for comparison by evaluators. A string expands to a single assistant message:
+Optional reference response for comparison by graders. A string expands to a single assistant message:
 
 ```yaml
 expected_output: "42"
@@ -71,7 +71,7 @@ expected_output:
 
 ## Per-Case Execution Overrides
 
-Override the default target or evaluators for specific tests:
+Override the default target or graders for specific tests:
 
 ```yaml
 tests:
@@ -87,7 +87,7 @@ tests:
         prompt: ./graders/depth.md
 ```
 
-Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
+Per-case `assertions` graders are **merged** with root-level `assertions` graders — test-specific graders run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
 
 ```yaml
 assertions:
@@ -162,11 +162,11 @@ Operational checkout state belongs under `workspace.repos[].checkout.base_commit
 
 ## Per-Test Assertions
 
-The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
+The `assertions` field defines graders directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
 
 ### Deterministic Assertions
 
-These evaluators run without an LLM call and produce binary (0 or 1) scores:
+These graders run without an LLM call and produce binary (0 or 1) scores:
 
 | Type | Value | Description |
 |------|-------|-------------|
@@ -251,7 +251,7 @@ tests:
         value: ["true/false", "boolean", "expected value"]
 ```
 
-Assertion evaluators auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
+Assertion graders auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
 
 ### Rubric Assertions
 
@@ -283,7 +283,7 @@ tests:
 
 ### Required Gates
 
-Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.
+Any grader in `assertions` can be marked as `required`. When a required grader fails, the overall test verdict is `fail` regardless of the aggregate score.
 
 | Value | Behavior |
 |-------|----------|
@@ -303,23 +303,23 @@ assertions:
         weight: 1.0
 ```
 
-Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`.
+Required gates are evaluated after all graders run. If any required grader falls below its threshold, the verdict is forced to `fail`.
 
 ### Assertions Merge Behavior
 
 `assertions` can be defined at both suite and test levels:
 
-- Per-test `assertions` evaluators run first.
-- Suite-level `assertions` evaluators are appended automatically.
+- Per-test `assertions` graders run first.
+- Suite-level `assertions` graders are appended automatically.
 - Set `execution.skip_defaults: true` on a test to skip suite-level defaults.
 
 ## How `criteria` and `assertions` Interact
 
-The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present.
+The `criteria` field is a **data field** that describes what the response should accomplish. It is not an grader itself — how it gets used depends on whether `assertions` is present.
 
 ### No `assertions` — implicit LLM grader
 
-When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:
+When a test has no `assertions` field, a default `llm-grader` grader runs automatically and uses `criteria` as the evaluation prompt:
 
 ```yaml
 tests:
@@ -342,14 +342,14 @@ tests:
     input: Generate the spreadsheet report
 ```
 
-### `assertions` present — explicit evaluators only
+### `assertions` present — explicit graders only
 
-When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
+When `assertions` is defined, only the declared graders run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
 
-If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
+If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
 
 ```
-Warning: Test 'my-test': criteria is defined but no evaluator in assertions
+Warning: Test 'my-test': criteria is defined but no grader in assertions
 will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
 if it is documentation-only.
 ```
@@ -367,7 +367,7 @@ tests:
         value: "fix"
 ```
 
-When you need a custom file conversion for only one grader, add `preprocessors` directly to that evaluator:
+When you need a custom file conversion for only one grader, add `preprocessors` directly to that grader:
 
 ```yaml
 preprocessors:
@@ -389,7 +389,7 @@ tests:
 
 ## Metadata
 
-Pass additional context to evaluators via the `metadata` field:
+Pass additional context to graders via the `metadata` field:
 
 ```yaml
 tests:
 
@@ -5,7 +5,7 @@ sidebar:
   order: 1
 ---
 
-Evaluation files define the test cases, targets, and evaluators for an evaluation run. AgentV supports two formats: YAML and JSONL.
+Evaluation files define the test cases, targets, and graders for an evaluation run. AgentV supports two formats: YAML and JSONL.
 
 ## Suites
 
@@ -41,7 +41,7 @@ tests:
 | `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) |
 | `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/docs/guides/workspace-pool/#external-workspace-config) |
 | `tests` | Array of individual tests, or a string path to an external file |
-| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
+| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
 | `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |
 
 ### Metadata Fields
@@ -76,7 +76,7 @@ tests:
 
 ### Suite-level Assertions
 
-The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.
+The `assertions` field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test's graders unless a test sets `execution.skip_defaults: true`.
 
 ```yaml
 description: API response validation
@@ -92,7 +92,7 @@ tests:
     input: Check API health
 ```
 
-`assertions` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
+`assertions` supports all grader types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
 
 ### Assertion Includes
 
 
@@ -69,7 +69,7 @@ tests:
           ```
 ````
 
-## Multi-Evaluator
+## Multi-Grader
 
 Combine a code grader and an LLM grader on the same test:
 
@@ -86,7 +86,7 @@ tests:
       - name: json_format_validator
         type: code-grader
         command: [uv, run, validate_json.py]
-        cwd: ./evaluators
+        cwd: ./graders
       - name: content_evaluator
         type: llm-grader
         prompt: ./graders/semantic_correctness.md
@@ -363,11 +363,11 @@ tests:
 - The batch runner reads the eval YAML via `--eval` flag and outputs JSONL keyed by `id`
 - Put structured data in `user.content` as objects for the runner to extract
 - Use `expected_output` with object fields for structured expected output
-- Each test has its own evaluator to validate its portion of the output
+- Each test has its own grader to validate its portion of the output
 
 ## Suite-level Input
 
-Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators:
+Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for graders:
 
 ```yaml
 description: Travel assistant evaluation
@@ -418,11 +418,11 @@ See the [suite-level-input example](https://github.com/EntityProcess/agentv/tree
 - Show the pattern, not rigid templates
 - Allow for natural language variation
 - Focus on semantic correctness over exact matching
-- Evaluators handle the actual validation logic
+- Graders handle the actual validation logic
 
 ## Showcases
 
 For complete end-to-end workflows that combine multiple features, see the showcases in [`examples/showcase/`](https://github.com/EntityProcess/agentv/tree/main/examples/showcase):
 
-- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted evaluators, measures variability, and compares results side-by-side.
+- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted graders, measures variability, and compares results side-by-side.
 - **[Export Screening](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/export-screening)** — classification eval with confusion matrix metrics and CI gating.
@@ -22,7 +22,7 @@ tests:
       - States time complexity
 ```
 
-All strings are collected into a single rubrics evaluator automatically.
+All strings are collected into a single rubrics grader automatically.
 
 ### Full form for advanced options
 
@@ -120,9 +120,9 @@ score = sum(criterion_score / 10 * weight) / sum(total_weights)
 
 ## Authoring Rubrics
 
-Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic evaluators, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the evaluator choice driven by the criteria rather than one fixed recipe.
+Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.
 
-## Combining with Other Evaluators
+## Combining with Other Graders
 
 Rubrics work alongside code and LLM graders:
 
 
@@ -75,7 +75,7 @@ agentv eval --dry-run evals/my-eval.yaml
 ```
 
 :::note
-Dry-run returns mock responses that don't match evaluator output schemas. Use it only for testing harness flow, not evaluator logic.
+Dry-run returns mock responses that don't match grader output schemas. Use it only for testing harness flow, not grader logic.
 :::
 
 ### Custom Output Directory
@@ -163,7 +163,7 @@ Each eval test case produces a trace with:
 - **LLM call spans** (`chat <model>`) — model name, token usage (input/output/cached)
 - **Tool call spans** (`execute_tool <name>`) — tool name, arguments, results (with `--otel-capture-content`)
 - **Turn spans** (`agentv.turn.N`) — groups messages by conversation turn (with `--otel-group-turns`)
-- **Evaluator events** — per-grader scores attached to the root span
+- **Grader events** — per-grader scores attached to the root span
 
 :::tip[Claude provider + trace-claude-code plugin]
 When using the Claude provider, AgentV injects `CC_PARENT_SPAN_ID` and `CC_ROOT_SPAN_ID` into the Claude subprocess. If the [trace-claude-code](https://github.com/braintrustdata/braintrust-claude-plugin) plugin is installed, it attaches Claude Code CLI-level tool spans (Read, Write, Bash, etc.) as children of the AgentV eval trace, giving you full visibility into both the eval framework and the agent's internal actions.
@@ -331,14 +331,14 @@ This is the same interface that agent-orchestrated evals use — the EVAL.yaml t
 
 ## Offline Grading
 
-Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:
+Grade existing agent sessions without re-running them. Import a transcript, then run deterministic graders:
 
 ```bash
 # List sessions and import one
 agentv import claude --list
 agentv import claude --session-id <uuid>
 
-# Run evaluators against the imported transcript
+# Run graders against the imported transcript
 agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl
 ```
 
 
@@ -90,7 +90,7 @@ export default defineCodeGrader(({ trace, outputText }) => ({
 
 `defineCodeGrader` graders are referenced in YAML with `type: code-grader` and `command: [bun, run, grader.ts]`. `defineAssertion` uses convention-based discovery instead — just place in `.agentv/assertions/` and reference by name.
 
-For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/evaluators/code-graders/).
+For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/graders/code-graders/).
 
 ## Programmatic API
 
 
@@ -72,5 +72,5 @@ Results appear in `.agentv/results/runs/<timestamp>/index.jsonl` with scores, re
 
 - Learn about [eval file formats](/docs/evaluation/eval-files/)
 - Configure [targets](/docs/targets/configuration/) for different providers
-- Create [custom evaluators](/docs/evaluators/custom-evaluators/)
+- Create [custom graders](/docs/graders/custom-graders/)
 - If setup drifts, rerun: `agentv init`