You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| `metadata` | No | Arbitrary key-value pairs passed to evaluators and workspace scripts |
32
+
| `metadata` | No | Arbitrary key-value pairs passed to graders and workspace scripts |
33
33
| `rubrics` | No | Structured evaluation criteria |
34
-
| `assertions` | No | Per-test evaluators |
34
+
| `assertions` | No | Per-test graders |
35
35
36
36
## Input
37
37
@@ -55,7 +55,7 @@ When suite-level `input` is defined in the eval file, those messages are prepend
55
55
56
56
## Expected Output
57
57
58
-
Optional reference response for comparison by evaluators. A string expands to a single assistant message:
58
+
Optional reference response for comparison by graders. A string expands to a single assistant message:
59
59
60
60
```yaml
61
61
expected_output: "42"
@@ -71,7 +71,7 @@ expected_output:
71
71
72
72
## Per-Case Execution Overrides
73
73
74
-
Override the default target or evaluators for specific tests:
74
+
Override the default target or graders for specific tests:
75
75
76
76
```yaml
77
77
tests:
@@ -87,7 +87,7 @@ tests:
87
87
prompt: ./graders/depth.md
88
88
```
89
89
90
-
Per-case `assertions` evaluators are **merged** with root-level `assertions` evaluators — test-specific evaluators run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
90
+
Per-case `assertions` graders are **merged** with root-level `assertions` graders — test-specific graders run first, then root-level defaults are appended. To opt out of root-level defaults for a specific test, set `execution.skip_defaults: true`:
91
91
92
92
```yaml
93
93
assertions:
@@ -162,11 +162,11 @@ Operational checkout state belongs under `workspace.repos[].checkout.base_commit
162
162
163
163
## Per-Test Assertions
164
164
165
-
The `assertions` field defines evaluators directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
165
+
The `assertions` field defines graders directly on a test. It supports both deterministic assertion types and LLM-based rubric evaluation.
166
166
167
167
### Deterministic Assertions
168
168
169
-
These evaluators run without an LLM call and produce binary (0 or 1) scores:
169
+
These graders run without an LLM call and produce binary (0 or 1) scores:
Assertion evaluators auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
254
+
Assertion graders auto-generate a `name` when one is not provided (e.g., `contains-DENIED`, `is_json`).
255
255
256
256
### Rubric Assertions
257
257
@@ -283,7 +283,7 @@ tests:
283
283
284
284
### Required Gates
285
285
286
-
Any evaluator in `assertions` can be marked as `required`. When a required evaluator fails, the overall test verdict is `fail` regardless of the aggregate score.
286
+
Any grader in `assertions` can be marked as `required`. When a required grader fails, the overall test verdict is `fail` regardless of the aggregate score.
287
287
288
288
| Value | Behavior |
289
289
|-------|----------|
@@ -303,23 +303,23 @@ assertions:
303
303
weight: 1.0
304
304
```
305
305
306
-
Required gates are evaluated after all evaluators run. If any required evaluator falls below its threshold, the verdict is forced to `fail`.
306
+
Required gates are evaluated after all graders run. If any required grader falls below its threshold, the verdict is forced to `fail`.
307
307
308
308
### Assertions Merge Behavior
309
309
310
310
`assertions` can be defined at both suite and test levels:
311
311
312
-
- Per-test `assertions` evaluators run first.
313
-
- Suite-level `assertions` evaluators are appended automatically.
312
+
- Per-test `assertions` graders run first.
313
+
- Suite-level `assertions` graders are appended automatically.
314
314
- Set `execution.skip_defaults: true` on a test to skip suite-level defaults.
315
315
316
316
## How `criteria` and `assertions` Interact
317
317
318
-
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an evaluator itself — how it gets used depends on whether `assertions` is present.
318
+
The `criteria` field is a **data field** that describes what the response should accomplish. It is not an grader itself — how it gets used depends on whether `assertions` is present.
319
319
320
320
### No `assertions` — implicit LLM grader
321
321
322
-
When a test has no `assertions` field, a default `llm-grader` evaluator runs automatically and uses `criteria` as the evaluation prompt:
322
+
When a test has no `assertions` field, a default `llm-grader` grader runs automatically and uses `criteria` as the evaluation prompt:
323
323
324
324
```yaml
325
325
tests:
@@ -342,14 +342,14 @@ tests:
342
342
input: Generate the spreadsheet report
343
343
```
344
344
345
-
### `assertions` present — explicit evaluators only
345
+
### `assertions` present — explicit graders only
346
346
347
-
When `assertions` is defined, only the declared evaluators run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
347
+
When `assertions` is defined, only the declared graders run. No implicit grader is added. Graders that are declared (such as `llm-grader`, `code-grader`, or `rubrics`) receive `criteria` as input automatically.
348
348
349
-
If `assertions` contains only deterministic evaluators (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
349
+
If `assertions` contains only deterministic graders (like `contains` or `regex`), the `criteria` field is not evaluated and a warning is emitted:
350
350
351
351
```
352
-
Warning: Test 'my-test': criteria is defined but no evaluator in assertions
352
+
Warning: Test 'my-test': criteria is defined but no grader in assertions
353
353
will evaluate it. Add 'type: llm-grader' to assertions, or remove criteria
354
354
if it is documentation-only.
355
355
```
@@ -367,7 +367,7 @@ tests:
367
367
value: "fix"
368
368
```
369
369
370
-
When you need a custom file conversion for only one grader, add `preprocessors` directly to that evaluator:
370
+
When you need a custom file conversion for only one grader, add `preprocessors` directly to that grader:
371
371
372
372
```yaml
373
373
preprocessors:
@@ -389,7 +389,7 @@ tests:
389
389
390
390
## Metadata
391
391
392
-
Pass additional context to evaluators via the `metadata` field:
392
+
Pass additional context to graders via the `metadata` field:
| `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/docs/guides/workspace-pool/#external-workspace-config) |
43
43
| `tests` | Array of individual tests, or a string path to an external file |
44
-
| `assertions` | Suite-level evaluators appended to each test unless `execution.skip_defaults: true` is set on the test |
44
+
| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
45
45
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |
46
46
47
47
### Metadata Fields
@@ -76,7 +76,7 @@ tests:
76
76
77
77
### Suite-level Assertions
78
78
79
-
The `assertions` field is the canonical way to define suite-level evaluators. Suite-level assertions are appended to every test's evaluators unless a test sets `execution.skip_defaults: true`.
79
+
The `assertions` field is the canonical way to define suite-level graders. Suite-level assertions are appended to every test's graders unless a test sets `execution.skip_defaults: true`.
80
80
81
81
```yaml
82
82
description: API response validation
@@ -92,7 +92,7 @@ tests:
92
92
input: Check API health
93
93
```
94
94
95
-
`assertions`supports all evaluator types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
95
+
`assertions`supports all grader types, including deterministic assertion types (`contains`, `regex`, `is_json`, `equals`) and `rubrics`. See [Tests](/docs/evaluation/eval-cases/#per-test-assertions) for per-test assertions usage.
Copy file name to clipboardExpand all lines: apps/web/src/content/docs/docs/evaluation/examples.mdx
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -69,7 +69,7 @@ tests:
69
69
```
70
70
````
71
71
72
-
## Multi-Evaluator
72
+
## Multi-Grader
73
73
74
74
Combine a code grader and an LLM grader on the same test:
75
75
@@ -86,7 +86,7 @@ tests:
86
86
- name: json_format_validator
87
87
type: code-grader
88
88
command: [uv, run, validate_json.py]
89
-
cwd: ./evaluators
89
+
cwd: ./graders
90
90
- name: content_evaluator
91
91
type: llm-grader
92
92
prompt: ./graders/semantic_correctness.md
@@ -363,11 +363,11 @@ tests:
363
363
- The batch runner reads the eval YAML via `--eval` flag and outputs JSONL keyed by `id`
364
364
- Put structured data in `user.content` as objects for the runner to extract
365
365
- Use `expected_output` with object fields for structured expected output
366
-
- Each test has its own evaluator to validate its portion of the output
366
+
- Each test has its own grader to validate its portion of the output
367
367
368
368
## Suite-level Input
369
369
370
-
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for evaluators:
370
+
Share a common prompt or system instruction across all tests. Suite-level `input` messages are prepended to each test's input — like suite-level `assertions` for graders:
371
371
372
372
```yaml
373
373
description: Travel assistant evaluation
@@ -418,11 +418,11 @@ See the [suite-level-input example](https://github.com/EntityProcess/agentv/tree
418
418
- Show the pattern, not rigid templates
419
419
- Allow for natural language variation
420
420
- Focus on semantic correctness over exact matching
421
-
- Evaluators handle the actual validation logic
421
+
- Graders handle the actual validation logic
422
422
423
423
## Showcases
424
424
425
425
For complete end-to-end workflows that combine multiple features, see the showcases in [`examples/showcase/`](https://github.com/EntityProcess/agentv/tree/main/examples/showcase):
426
426
427
-
- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted evaluators, measures variability, and compares results side-by-side.
427
+
- **[Multi-Model Benchmark](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/multi-model-benchmark)** — targets matrix × weighted metrics × trials × compare workflow. Runs the same tests against multiple models, scores with weighted graders, measures variability, and compares results side-by-side.
428
428
- **[Export Screening](https://github.com/EntityProcess/agentv/tree/main/examples/showcase/export-screening)** — classification eval with confusion matrix metrics and CI gating.
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic evaluators, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the evaluator choice driven by the criteria rather than one fixed recipe.
123
+
Write rubric criteria directly in `assertions`. If you want help choosing between plain assertions, deterministic graders, and rubric or LLM-based grading, use the `agentv-eval-writer` skill. Keep the grader choice driven by the criteria rather than one fixed recipe.
When using the Claude provider, AgentV injects `CC_PARENT_SPAN_ID` and `CC_ROOT_SPAN_ID` into the Claude subprocess. If the [trace-claude-code](https://github.com/braintrustdata/braintrust-claude-plugin) plugin is installed, it attaches Claude Code CLI-level tool spans (Read, Write, Bash, etc.) as children of the AgentV eval trace, giving you full visibility into both the eval framework and the agent's internal actions.
@@ -331,14 +331,14 @@ This is the same interface that agent-orchestrated evals use — the EVAL.yaml t
331
331
332
332
## Offline Grading
333
333
334
-
Grade existing agent sessions without re-running them. Import a transcript, then run deterministic evaluators:
334
+
Grade existing agent sessions without re-running them. Import a transcript, then run deterministic graders:
`defineCodeGrader` graders are referenced in YAML with `type: code-grader` and `command: [bun, run, grader.ts]`. `defineAssertion` uses convention-based discovery instead — just place in `.agentv/assertions/` and reference by name.
92
92
93
-
For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/evaluators/code-graders/).
93
+
For detailed patterns, input/output contracts, and language-agnostic examples, see [Code Graders](/docs/graders/code-graders/).
0 commit comments