Skip to content

Commit e07634d

Browse files
christsoclaude
andauthored
docs: rename evaluators to graders for consistency with config (#1106)
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5c5bb87 commit e07634d

31 files changed

Lines changed: 150 additions & 150 deletions

apps/web/astro.config.mjs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ export default defineConfig({
3838
sidebar: [
3939
{ label: 'Getting Started', autogenerate: { directory: 'docs/getting-started' } },
4040
{ label: 'Evaluation', autogenerate: { directory: 'docs/evaluation' } },
41-
{ label: 'Evaluators', autogenerate: { directory: 'docs/evaluators' } },
41+
{ label: 'Graders', autogenerate: { directory: 'docs/graders' } },
4242
{ label: 'Targets', autogenerate: { directory: 'docs/targets' } },
4343
{ label: 'Tools', autogenerate: { directory: 'docs/tools' } },
4444
{ label: 'Guides', autogenerate: { directory: 'docs/guides' } },

apps/web/src/content/docs/docs/evaluation/batch-cli.mdx

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,14 +14,14 @@ Use batch CLI evaluation when:
1414
- An external tool processes multiple inputs in a single invocation (e.g., AML screening, bulk classification)
1515
- The runner reads the eval YAML directly to extract all tests
1616
- Output is JSONL with records keyed by test `id`
17-
- Each test has its own evaluator to validate its corresponding output record
17+
- Each test has its own grader to validate its corresponding output record
1818

1919
## Execution Flow
2020

2121
1. **AgentV** invokes the batch runner once, passing `--eval <yaml-path>` and `--output <jsonl-path>`
2222
2. **Batch runner** reads the eval YAML, extracts all tests, processes them, and writes JSONL output keyed by `id`
2323
3. **AgentV** parses the JSONL and routes each record to its matching test by `id`
24-
4. **Per-test evaluators** validate the output for each test independently
24+
4. **Per-test graders** validate the output for each test independently
2525

2626
## Eval File Structure
2727

@@ -109,7 +109,7 @@ JSONL where each line is a JSON object with an `id` matching a test:
109109
{"id": "case-002", "text": "{\"decision\": \"REVIEW\", ...}"}
110110
```
111111

112-
The `id` field must match the test `id` for AgentV to route output to the correct evaluator.
112+
The `id` field must match the test `id` for AgentV to route output to the correct grader.
113113

114114
### Output with Tool Trajectory
115115

@@ -138,11 +138,11 @@ To enable `tool_trajectory` evaluation, include `output` with `tool_calls`:
138138
}
139139
```
140140

141-
AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` evaluators.
141+
AgentV extracts tool calls directly from `output[].tool_calls[]` for `tool_trajectory` graders.
142142

143-
## Evaluator Implementation
143+
## Grader Implementation
144144

145-
Each test has its own evaluator that validates the batch runner output. The evaluator receives the standard `code_grader` input via stdin.
145+
Each test has its own grader that validates the batch runner output. The grader receives the standard `code_grader` input via stdin.
146146

147147
**Input (stdin):**
148148
```json
@@ -164,7 +164,7 @@ Each test has its own evaluator that validates the batch runner output. The eval
164164
}
165165
```
166166

167-
### Example Evaluator
167+
### Example Grader
168168

169169
```typescript
170170
import fs from 'node:fs';
@@ -233,7 +233,7 @@ expected_output:
233233
reasons: []
234234
```
235235
236-
The evaluator extracts these fields and compares them against the parsed candidate output.
236+
The grader extracts these fields and compares them against the parsed candidate output.
237237
238238
## Target Configuration
239239
@@ -259,7 +259,7 @@ Key settings:
259259

260260
## Best Practices
261261

262-
1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct evaluator
262+
1. **Use unique test IDs** -- the batch runner and AgentV use `id` to route outputs to the correct grader
263263
2. **Structured input** -- put structured data in `user.content` for the runner to extract
264264
3. **Structured expected_output** -- define expected output as objects for easy comparison
265265
4. **Deterministic runners** -- batch runners should produce consistent output for reliable testing

apps/web/src/content/docs/docs/evaluation/eval-cases.mdx

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar:
55
order: 2
66
---
77

8-
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional evaluator overrides.
8+
Tests are individual test entries within an evaluation file. Each test defines input messages, expected outcomes, and optional grader overrides.
99

1010
## Basic Structure
1111

@@ -29,9 +29,9 @@ tests:
2929
| `expected_output` | No | Expected response for comparison (string, object, or message array). Alias: `expected_output` |
3030
| `execution` | No | Per-case execution overrides (for example `target`, `skip_defaults`) |
3131
| `workspace` | No | Per-case workspace config (overrides suite-level) |
32-
| `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts |
32+
| `metadata` | No | Arbitrary key-value pairs passed to graders and workspace scripts |
3333
| `rubrics` | No | Structured evaluation criteria |
34-
| `assertions` | No | Per-test evaluators |
34+
| `assertions` | No | Per-test graders |
3535

3636
## Input
3737

@@ -55,7 +55,7 @@ When suite-level `input` is defined in the eval file, those messages are prepend
5555

5656
## Expected Output
5757

58-
Optional reference response for comparison by evaluators. A string expands to a single assistant message:
58+
Optional reference response for comparison by graders. A string expands to a single assistant message:
5959

6060
```yaml
6161
expected_output: "42"
@@ -71,7 +71,7 @@ expected_output:
7171

7272
## Per-Case Execution Overrides
7373

74-
Override the default target or evaluators for specific tests:
74+
Override the default target or graders for specific tests:
7575

7676
```yaml
7777
tests:
@@ -87,7 +87,7 @@ tests:
8787
prompt: ./graders/depth.md
8888
```
8989

90-
Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
90+
Per-case `assertions` graders are **merged** with root-level `assertions` graders — test-specific graders run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
9191

9292
```yaml
9393
assertions:
@@ -162,11 +162,11 @@ Operational checkout state belongs under `workspace.repos[].checkout.base_commit
162162

163163
## Per-Test Assertions
164164

165-
The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
165+
The `assertions` field defines graders directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
166166

167167
### Deterministic Assertions
168168

169-
These evaluators run without an LLM call and produce binary (0 or 1) scores:
169+
These graders run without an LLM call and produce binary (0 or 1) scores:
170170

171171
| Type | Value | Description |
172172
|------|-------|-------------|
@@ -251,7 +251,7 @@ tests:
251251
value: ["true/false", "boolean", "expected value"]
252252
```
253253

254-
Assertion evaluators auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
254+
Assertion graders auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
255255

256256
### Rubric Assertions
257257

@@ -283,7 +283,7 @@ tests:
283283

284284
### Required Gates
285285

286-
Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.
286+
Any grader in `assertions` can be marked as `required`. When a required grader fails, the overall test verdict is `fail` regardless of the aggregate score.
287287

288288
| Value | Behavior |
289289
|-------|----------|
@@ -303,23 +303,23 @@ assertions:
303303
weight: 1.0
304304
```
305305

306-
Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`.
306+
Required gates are evaluated after all graders run. If any required grader falls below its threshold, the verdict is forced to `fail`.
307307

308308
### Assertions Merge Behavior
309309

310310
`assertions` can be defined at both suite and test levels:
311311

312-
- Per-test `assertions` evaluators run first.
313-
- Suite-level `assertions` evaluators are appended automatically.
312+
- Per-test `assertions` graders run first.
313+
- Suite-level `assertions` graders are appended automatically.
314314
- Set `execution.skip_defaults: true` on a test to skip suite-level defaults.
315315

316316
## How `criteria` and `assertions` Interact
317317

318-
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present.
318+
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an grader itself — how it gets used depends on whether `assertions` is present.
319319

320320
### No `assertions` — implicit LLM grader
321321

322-
When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:
322+
When a test has no `assertions` field, a default `llm-grader` grader runs automatically and uses `criteria` as the evaluation prompt:
323323

324324
```yaml
325325
tests:
@@ -342,14 +342,14 @@ tests:
342342
input: Generate the spreadsheet report
343343
```
344344

345-
### `assertions` present — explicit evaluators only
345+
### `assertions` present — explicit graders only
346346

347-
When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
347+
When `assertions` is defined, only the declared graders run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
348348

349-
If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
349+
If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
350350

351351
```
352-
Warning: Test 'my-test': criteria is defined but no evaluator in assertions
352+
Warning: Test 'my-test': criteria is defined but no grader in assertions
353353
will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
354354
if it is documentation-only.
355355
```
@@ -367,7 +367,7 @@ tests:
367367
value: "fix"
368368
```
369369

370-
When you need a custom file conversion for only one grader, add `preprocessors` directly to that evaluator:
370+
When you need a custom file conversion for only one grader, add `preprocessors` directly to that grader:
371371

372372
```yaml
373373
preprocessors:
@@ -389,7 +389,7 @@ tests:
389389

390390
## Metadata
391391

392-
Pass additional context to evaluators via the `metadata` field:
392+
Pass additional context to graders via the `metadata` field:
393393

394394
```yaml
395395
tests:

apps/web/src/content/docs/docs/evaluation/eval-files.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar:
55
order: 1
66
---
77

8-
Evaluation files define the test cases, targets, and evaluators for an evaluation run. AgentV supports two formats: YAML and JSONL.
8+
Evaluation files define the test cases, targets, and graders for an evaluation run. AgentV supports two formats: YAML and JSONL.
99

1010
## Suites
1111

@@ -41,7 +41,7 @@ tests:
4141
| `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) |
4242
| `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/docs/guides/workspace-pool/#external-workspace-config) |
4343
| `tests` | Array of individual tests, or a string path to an external file |
44-
| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
44+
| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
4545
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |
4646

4747
### Metadata Fields
@@ -76,7 +76,7 @@ tests:
7676
7777
### Suite-level Assertions
7878
79-
The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.
79+
The `assertions` field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test's graders unless a test sets `execution.skip_defaults: true`.
8080

8181
```yaml
8282
description: API response validation
@@ -92,7 +92,7 @@ tests:
9292
input: Check API health
9393
```
9494

95-
`assertions` supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
95+
`assertions` supports all grader types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
9696

9797
### Assertion Includes
9898

apps/web/src/content/docs/docs/evaluation/examples.mdx

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ tests:
6969
```
7070
````
7171
72-
## Multi-Evaluator
72+
## Multi-Grader
7373
7474
Combine a code grader and an LLM grader on the same test:
7575
@@ -86,7 +86,7 @@ tests:
8686
- name: json_format_validator
8787
type: code-grader
8888
command: [uv, run, validate_json.py]
89-
cwd: ./evaluators
89+
cwd: ./graders
9090
- name: content_evaluator
9191
type: llm-grader
9292
prompt: ./graders/semantic_correctness.md
@@ -363,11 +363,11 @@ tests:
363363
- The batch runner reads the eval YAML via `--eval` flag and outputs JSONL keyed by `id`
364364
- Put structured data in `user.content` as objects for the runner to extract
365365
- Use `expected_output` with object fields for structured expected output
366-
- Each test has its own evaluator to validate its portion of the output
366+
- Each test has its own grader to validate its portion of the output
367367

368368
## Suite-level Input
369369

370-
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators:
370+
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for graders:
371371

372372
```yaml
373373
description: Travel assistant evaluation
@@ -418,11 +418,11 @@ See the [suite-level-input example](https://github.com/EntityProcess/agentv/tree
418418
- Show the pattern, not rigid templates
419419
- Allow for natural language variation
420420
- Focus on semantic correctness over exact matching
421-
- Evaluators handle the actual validation logic
421+
- Graders handle the actual validation logic
422422

423423
## Showcases
424424

425425
For complete end-to-end workflows that combine multiple features, see the showcases in [`examples/showcase/`](https://github.com/EntityProcess/agentv/tree/main/examples/showcase):
426426

427-
- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted evaluators, measures variability, and compares results side-by-side.
427+
- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted graders, measures variability, and compares results side-by-side.
428428
- **[Export Screening](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/export-screening)** — classification eval with confusion matrix metrics and CI gating.

apps/web/src/content/docs/docs/evaluation/rubrics.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ tests:
2222
- States time complexity
2323
```
2424
25-
All strings are collected into a single rubrics evaluator automatically.
25+
All strings are collected into a single rubrics grader automatically.
2626
2727
### Full form for advanced options
2828
@@ -120,9 +120,9 @@ score = sum(criterion_score / 10 * weight) / sum(total_weights)
120120
121121
## Authoring Rubrics
122122
123-
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic evaluators, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the evaluator choice driven by the criteria rather than one fixed recipe.
123+
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.
124124
125-
## Combining with Other Evaluators
125+
## Combining with Other Graders
126126
127127
Rubrics work alongside code and LLM graders:
128128

apps/web/src/content/docs/docs/evaluation/running-evals.mdx

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ agentv eval --dry-run evals/my-eval.yaml
7575
```
7676

7777
:::note
78-
Dry-run returns mock responses that don't match evaluator output schemas. Use it only for testing harness flow, not evaluator logic.
78+
Dry-run returns mock responses that don't match grader output schemas. Use it only for testing harness flow, not grader logic.
7979
:::
8080

8181
### Custom Output Directory
@@ -163,7 +163,7 @@ Each eval test case produces a trace with:
163163
- **LLM call spans** (`chat <model>`) — model name, token usage (input/output/cached)
164164
- **Tool call spans** (`execute_tool <name>`) — tool name, arguments, results (with `--otel-capture-content`)
165165
- **Turn spans** (`agentv.turn.N`) — groups messages by conversation turn (with `--otel-group-turns`)
166-
- **Evaluator events** — per-grader scores attached to the root span
166+
- **Grader events** — per-grader scores attached to the root span
167167

168168
:::tip[Claude provider + trace-claude-code plugin]
169169
When using the Claude provider, AgentV injects `CC_PARENT_SPAN_ID` and `CC_ROOT_SPAN_ID` into the Claude subprocess. If the [trace-claude-code](https://github.com/braintrustdata/braintrust-claude-plugin) plugin is installed, it attaches Claude Code CLI-level tool spans (Read, Write, Bash, etc.) as children of the AgentV eval trace, giving you full visibility into both the eval framework and the agent's internal actions.
@@ -331,14 +331,14 @@ This is the same interface that agent-orchestrated evals use — the EVAL.yaml t
331331

332332
## Offline Grading
333333

334-
Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:
334+
Grade existing agent sessions without re-running them. Import a transcript, then run deterministic graders:
335335

336336
```bash
337337
# List sessions and import one
338338
agentv import claude --list
339339
agentv import claude --session-id <uuid>
340340
341-
# Run evaluators against the imported transcript
341+
# Run graders against the imported transcript
342342
agentv eval evals/my-eval.yaml --transcript .agentv/transcripts/claude-<id>.jsonl
343343
```
344344

apps/web/src/content/docs/docs/evaluation/sdk.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ export default defineCodeGrader(({ trace, outputText }) => ({
9090

9191
`defineCodeGrader` graders are referenced in YAML with `type: code-grader` and `command: [bun, run, grader.ts]`. `defineAssertion` uses convention-based discovery instead — just place in `.agentv/assertions/` and reference by name.
9292

93-
For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/evaluators/code-graders/).
93+
For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/graders/code-graders/).
9494

9595
## Programmatic API
9696

apps/web/src/content/docs/docs/getting-started/quickstart.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,5 +72,5 @@ Results appear in `.agentv/results/runs/<timestamp>/index.jsonl` with scores, re
7272

7373
- Learn about [eval file formats](/docs/evaluation/eval-files/)
7474
- Configure [targets](/docs/targets/configuration/) for different providers
75-
- Create [custom evaluators](/docs/evaluators/custom-evaluators/)
75+
- Create [custom graders](/docs/graders/custom-graders/)
7676
- If setup drifts, rerun: `agentv init`

apps/web/src/content/docs/docs/evaluators/code-graders.mdx renamed to apps/web/src/content/docs/docs/graders/code-graders.mdx

File renamed without changes.

0 commit comments

Comments
 (0)