|
1 | 1 | --- |
2 | | -title: "Custom Evaluators" |
| 2 | +title: Custom Evaluators |
3 | 3 | weight: 3 |
4 | | -description: "Write your own scoring logic in Python, JavaScript, or any language." |
| 4 | +description: Define custom evaluation logic for agentevals when built-in metrics are not enough. |
5 | 5 | --- |
6 | 6 |
|
7 | | -Beyond the built-in metrics, you can write your own evaluators in Python, JavaScript, or any language. An evaluator is any program that reads JSON from stdin and writes a score to stdout. |
| 7 | +Custom evaluators let you add project-specific scoring logic on top of the trace data agentevals extracts. |
8 | 8 |
|
9 | | -> For the comprehensive guide, see [custom-evaluators.md](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) in the repository. |
| 9 | +Use custom evaluators when: |
10 | 10 |
|
11 | | -## Scaffold an Evaluator |
| 11 | +- you need domain-specific scoring rules |
| 12 | +- built-in metrics do not capture the behavior you care about |
| 13 | +- you want deterministic checks alongside model-based judges |
| 14 | +- you want to combine trace metadata with output inspection |
12 | 15 |
|
13 | | -```bash |
14 | | -agentevals evaluator init my_evaluator |
15 | | -``` |
| 16 | +## When to use custom evaluators vs delegated backends |
16 | 17 |
|
17 | | -This creates a directory with boilerplate and a manifest: |
| 18 | +Use **custom evaluators** when the evaluation logic should live in your own codebase. |
18 | 19 |
|
19 | | -``` |
20 | | -my_evaluator/ |
21 | | -├── my_evaluator.py # your scoring logic |
22 | | -└── evaluator.yaml # metadata manifest |
23 | | -``` |
| 20 | +Use a **delegated backend** such as the [OpenAI Evals API backend](/docs/openai-evals-api/) when you want agentevals to package data and send judging to an external evaluation system. |
24 | 21 |
|
25 | | -You can also list supported runtimes and generate config snippets: |
| 22 | +## What custom evaluators operate on |
26 | 23 |
|
27 | | -```bash |
28 | | -agentevals evaluator runtimes # show supported languages |
29 | | -agentevals evaluator config my_evaluator \ |
30 | | - --path ./evaluators/my_evaluator.py # generate config snippet |
31 | | -``` |
| 24 | +Custom evaluators work on normalized data extracted from traces. In practice, that means you can reason about: |
32 | 25 |
|
33 | | -## Implement Scoring Logic |
| 26 | +- prompts and responses |
| 27 | +- tool calls and tool results |
| 28 | +- metadata attached to spans or traces |
| 29 | +- expected outputs or dataset annotations, when present |
34 | 30 |
|
35 | | -Your function receives an `EvalInput` with the agent's invocations and returns an `EvalResult` with a score between 0.0 and 1.0. |
| 31 | +The exact structure depends on your eval configuration and trace contents. |
36 | 32 |
|
37 | | -```python |
38 | | -from agentevals_evaluator_sdk import EvalInput, EvalResult, evaluator |
| 33 | +## General workflow |
39 | 34 |
|
40 | | -@evaluator |
41 | | -def my_evaluator(input: EvalInput) -> EvalResult: |
42 | | - scores = [] |
43 | | - for inv in input.invocations: |
44 | | - # Your scoring logic here |
45 | | - score = 1.0 |
46 | | - scores.append(score) |
| 35 | +1. define the eval set and metrics you want to run |
| 36 | +2. implement a Python evaluator for your scoring logic |
| 37 | +3. register or reference it from your eval configuration |
| 38 | +4. run agentevals against your trace data |
| 39 | +5. inspect the resulting scores in CLI or UI |
47 | 40 |
|
48 | | - return EvalResult( |
49 | | - score=sum(scores) / len(scores) if scores else 0.0, |
50 | | - per_invocation_scores=scores, |
51 | | - ) |
| 41 | +## Good evaluator design principles |
52 | 42 |
|
53 | | -if __name__ == "__main__": |
54 | | - my_evaluator.run() |
55 | | -``` |
| 43 | +A strong custom evaluator is usually: |
56 | 44 |
|
57 | | -Install the SDK standalone with `pip install agentevals-evaluator-sdk` (no heavy dependencies). |
| 45 | +- **focused** on one behavior or failure mode |
| 46 | +- **repeatable** so results are easy to compare over time |
| 47 | +- **well-named** so metrics are readable in reports |
| 48 | +- **trace-aware** so it relies on durable attributes instead of brittle formatting assumptions |
58 | 49 |
|
59 | | -## Reference in Eval Config |
| 50 | +## Common patterns |
60 | 51 |
|
61 | | -```yaml |
62 | | -# eval_config.yaml |
63 | | -evaluators: |
64 | | - - name: tool_trajectory_avg_score |
65 | | - type: builtin |
| 52 | +### Deterministic checks |
66 | 53 |
|
67 | | - - name: my_evaluator |
68 | | - type: code |
69 | | - path: ./evaluators/my_evaluator.py |
70 | | - threshold: 0.7 |
71 | | -``` |
| 54 | +Examples: |
72 | 55 |
|
73 | | -```bash |
74 | | -agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json |
75 | | -``` |
| 56 | +- required tool was called |
| 57 | +- forbidden tool was not called |
| 58 | +- final answer included a required field |
| 59 | +- workflow completed within a step limit |
76 | 60 |
|
77 | | -## Community Evaluators |
| 61 | +### Rubric-based scoring |
78 | 62 |
|
79 | | -Community evaluators can be referenced directly from the shared [evaluators repository](https://github.com/agentevals-dev/evaluators) using `type: remote`: |
| 63 | +Examples: |
80 | 64 |
|
81 | | -```yaml |
82 | | -evaluators: |
83 | | - - name: response_quality |
84 | | - type: remote |
85 | | - source: github |
86 | | - ref: evaluators/response_quality/response_quality.py |
87 | | - threshold: 0.7 |
88 | | - config: |
89 | | - min_response_length: 20 |
90 | | -``` |
| 65 | +- answer relevance |
| 66 | +- factual grounding against context |
| 67 | +- adherence to response format |
| 68 | +- success at completing a user task |
91 | 69 |
|
92 | | -Browse available community evaluators on the [Evaluators](/evaluators/) page, or contribute your own. |
| 70 | +### Hybrid scoring |
93 | 71 |
|
94 | | -## Supported Languages |
| 72 | +Many teams combine deterministic checks with model-based judging. For example: |
95 | 73 |
|
96 | | -Evaluators can be written in any language that reads JSON from stdin and writes JSON to stdout. |
| 74 | +- fail if a critical tool call is missing |
| 75 | +- otherwise apply a quality rubric score |
97 | 76 |
|
98 | | -| Language | Extension | SDK available | |
99 | | -|---|---|---| |
100 | | -| Python | `.py` | `pip install agentevals-evaluator-sdk` | |
101 | | -| JavaScript | `.js` | No SDK yet — just read stdin, write stdout | |
102 | | -| TypeScript | `.ts` | No SDK yet — just read stdin, write stdout | |
| 77 | +## Related docs |
103 | 78 |
|
104 | | -## Further Reading |
| 79 | +- [Eval Set Format](/docs/eval-set-format/) |
| 80 | +- [OTel Compatibility](/docs/otel-compatibility/) |
| 81 | +- [OpenAI Evals API backend](/docs/openai-evals-api/) |
| 82 | +- [Streaming](/docs/streaming/) |
105 | 83 |
|
106 | | -- [Custom Evaluators Guide](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) — Full protocol reference |
107 | | -- [Community Evaluators](/evaluators/) — Browse and submit evaluators |
108 | | -- [Eval Set Format](https://github.com/agentevals-dev/agentevals/blob/main/docs/eval-set-format.md) — Schema and field reference for eval set JSON files |
| 84 | +## Recommendation |
| 85 | + |
| 86 | +Start with the smallest evaluator that captures a real product risk. Add more evaluators only when they create a clear signal you intend to track over time. |
0 commit comments