Skip to content

Commit d68461a

Browse files
authored
Merge pull request #10 from agentevals-dev/docs/v0-6-3-refresh
docs: refresh website for v0.6.3 features
2 parents e87a790 + 97b797d commit d68461a

11 files changed

Lines changed: 468 additions & 342 deletions

content/docs/advanced.md

Lines changed: 50 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,61 @@
11
---
2-
title: "Advanced"
3-
weight: 5
4-
description: "Deep-dive documentation, REST API, and development setup."
2+
title: Advanced
3+
weight: 2
4+
description: Advanced usage patterns for evaluation backends, deployment, trace compatibility, and scaling agentevals.
55
---
66

7-
## Docs
7+
This guide summarizes the main advanced building blocks in agentevals and points to the deeper reference pages.
88

9-
| Guide | Description |
10-
|-------|-------------|
11-
| [Eval Set Format](https://github.com/agentevals-dev/agentevals/blob/main/docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
12-
| [Custom Evaluators](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
13-
| [Live Streaming](https://github.com/agentevals-dev/agentevals/blob/main/docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
14-
| [OpenTelemetry Compatibility](https://github.com/agentevals-dev/agentevals/blob/main/docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
9+
## Evaluation architecture
1510

16-
## REST API Reference
11+
agentevals evaluates agent behavior from OpenTelemetry traces instead of replaying the agent.
1712

18-
While the server is running (`agentevals serve`), interactive API documentation is available at:
13+
Depending on your needs, you can combine:
1914

20-
| Endpoint | Description |
21-
|----------|-------------|
22-
| [`/docs`](http://localhost:8001/docs) | Swagger UI with interactive request builder |
23-
| [`/redoc`](http://localhost:8001/redoc) | ReDoc reference documentation |
24-
| [`/openapi.json`](http://localhost:8001/openapi.json) | Raw OpenAPI 3.x schema (for code generation or CI) |
15+
- **built-in metrics** for fast trace-native scoring
16+
- **custom evaluators** for Python-defined logic tailored to your app
17+
- **delegated backends** when you want an external system to judge outputs
2518

26-
The OTLP receiver (port 4318) serves its own docs at `http://localhost:4318/docs`.
19+
The initial delegated option is the [OpenAI Evals API backend](/docs/openai-evals-api/).
2720

28-
## Development
21+
## Deployment patterns
2922

30-
```bash
31-
uv run pytest # run tests
32-
uv run agentevals serve --dev # backend
33-
cd ui && npm run dev # frontend (separate terminal)
34-
```
23+
agentevals can run:
3524

36-
See [DEVELOPMENT.md](https://github.com/agentevals-dev/agentevals/blob/main/DEVELOPMENT.md) for build tiers, Makefile targets, and Nix setup. To contribute, see [CONTRIBUTING.md](https://github.com/agentevals-dev/agentevals/blob/main/CONTRIBUTING.md).
25+
- locally during development
26+
- in containers for reproducible environments
27+
- on Kubernetes using the project Helm chart
28+
29+
For cluster deployment details, configuration knobs, and install examples, see [Kubernetes & Helm](/docs/kubernetes-helm/).
30+
31+
## Trace model and compatibility
32+
33+
The quality of evaluation depends on the shape and completeness of your traces.
34+
35+
If your agent framework emits OpenTelemetry data with different conventions, review [OTel Compatibility](/docs/otel-compatibility/) to understand what agentevals expects and how to adapt inputs.
36+
37+
## Eval definitions
38+
39+
As eval setups grow, it helps to standardize how datasets, evaluators, and metadata are represented.
40+
41+
See [Eval Set Format](/docs/eval-set-format/) for the structure used by agentevals.
42+
43+
## Live and incremental processing
44+
45+
If you want to evaluate data continuously rather than in one batch, see [Streaming](/docs/streaming/).
46+
47+
## Extending agentevals
48+
49+
If built-in metrics are not enough, use [Custom Evaluators](/docs/custom-evaluators/) to implement project-specific scoring logic.
50+
51+
## Recommended reading order
52+
53+
For teams adopting newer v0.6.3 capabilities, this is a good progression:
54+
55+
1. [Quick Start](/docs/quick-start/)
56+
2. [Eval Set Format](/docs/eval-set-format/)
57+
3. [Custom Evaluators](/docs/custom-evaluators/)
58+
4. [OpenAI Evals API backend](/docs/openai-evals-api/)
59+
5. [Kubernetes & Helm](/docs/kubernetes-helm/)
60+
6. [OTel Compatibility](/docs/otel-compatibility/)
61+
7. [Streaming](/docs/streaming/)

content/docs/custom-evaluators.md

Lines changed: 55 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -1,108 +1,86 @@
11
---
2-
title: "Custom Evaluators"
2+
title: Custom Evaluators
33
weight: 3
4-
description: "Write your own scoring logic in Python, JavaScript, or any language."
4+
description: Define custom evaluation logic for agentevals when built-in metrics are not enough.
55
---
66

7-
Beyond the built-in metrics, you can write your own evaluators in Python, JavaScript, or any language. An evaluator is any program that reads JSON from stdin and writes a score to stdout.
7+
Custom evaluators let you add project-specific scoring logic on top of the trace data agentevals extracts.
88

9-
> For the comprehensive guide, see [custom-evaluators.md](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) in the repository.
9+
Use custom evaluators when:
1010

11-
## Scaffold an Evaluator
11+
- you need domain-specific scoring rules
12+
- built-in metrics do not capture the behavior you care about
13+
- you want deterministic checks alongside model-based judges
14+
- you want to combine trace metadata with output inspection
1215

13-
```bash
14-
agentevals evaluator init my_evaluator
15-
```
16+
## When to use custom evaluators vs delegated backends
1617

17-
This creates a directory with boilerplate and a manifest:
18+
Use **custom evaluators** when the evaluation logic should live in your own codebase.
1819

19-
```
20-
my_evaluator/
21-
├── my_evaluator.py # your scoring logic
22-
└── evaluator.yaml # metadata manifest
23-
```
20+
Use a **delegated backend** such as the [OpenAI Evals API backend](/docs/openai-evals-api/) when you want agentevals to package data and send judging to an external evaluation system.
2421

25-
You can also list supported runtimes and generate config snippets:
22+
## What custom evaluators operate on
2623

27-
```bash
28-
agentevals evaluator runtimes # show supported languages
29-
agentevals evaluator config my_evaluator \
30-
--path ./evaluators/my_evaluator.py # generate config snippet
31-
```
24+
Custom evaluators work on normalized data extracted from traces. In practice, that means you can reason about:
3225

33-
## Implement Scoring Logic
26+
- prompts and responses
27+
- tool calls and tool results
28+
- metadata attached to spans or traces
29+
- expected outputs or dataset annotations, when present
3430

35-
Your function receives an `EvalInput` with the agent's invocations and returns an `EvalResult` with a score between 0.0 and 1.0.
31+
The exact structure depends on your eval configuration and trace contents.
3632

37-
```python
38-
from agentevals_evaluator_sdk import EvalInput, EvalResult, evaluator
33+
## General workflow
3934

40-
@evaluator
41-
def my_evaluator(input: EvalInput) -> EvalResult:
42-
scores = []
43-
for inv in input.invocations:
44-
# Your scoring logic here
45-
score = 1.0
46-
scores.append(score)
35+
1. define the eval set and metrics you want to run
36+
2. implement a Python evaluator for your scoring logic
37+
3. register or reference it from your eval configuration
38+
4. run agentevals against your trace data
39+
5. inspect the resulting scores in CLI or UI
4740

48-
return EvalResult(
49-
score=sum(scores) / len(scores) if scores else 0.0,
50-
per_invocation_scores=scores,
51-
)
41+
## Good evaluator design principles
5242

53-
if __name__ == "__main__":
54-
my_evaluator.run()
55-
```
43+
A strong custom evaluator is usually:
5644

57-
Install the SDK standalone with `pip install agentevals-evaluator-sdk` (no heavy dependencies).
45+
- **focused** on one behavior or failure mode
46+
- **repeatable** so results are easy to compare over time
47+
- **well-named** so metrics are readable in reports
48+
- **trace-aware** so it relies on durable attributes instead of brittle formatting assumptions
5849

59-
## Reference in Eval Config
50+
## Common patterns
6051

61-
```yaml
62-
# eval_config.yaml
63-
evaluators:
64-
- name: tool_trajectory_avg_score
65-
type: builtin
52+
### Deterministic checks
6653

67-
- name: my_evaluator
68-
type: code
69-
path: ./evaluators/my_evaluator.py
70-
threshold: 0.7
71-
```
54+
Examples:
7255

73-
```bash
74-
agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
75-
```
56+
- required tool was called
57+
- forbidden tool was not called
58+
- final answer included a required field
59+
- workflow completed within a step limit
7660

77-
## Community Evaluators
61+
### Rubric-based scoring
7862

79-
Community evaluators can be referenced directly from the shared [evaluators repository](https://github.com/agentevals-dev/evaluators) using `type: remote`:
63+
Examples:
8064

81-
```yaml
82-
evaluators:
83-
- name: response_quality
84-
type: remote
85-
source: github
86-
ref: evaluators/response_quality/response_quality.py
87-
threshold: 0.7
88-
config:
89-
min_response_length: 20
90-
```
65+
- answer relevance
66+
- factual grounding against context
67+
- adherence to response format
68+
- success at completing a user task
9169

92-
Browse available community evaluators on the [Evaluators](/evaluators/) page, or contribute your own.
70+
### Hybrid scoring
9371

94-
## Supported Languages
72+
Many teams combine deterministic checks with model-based judging. For example:
9573

96-
Evaluators can be written in any language that reads JSON from stdin and writes JSON to stdout.
74+
- fail if a critical tool call is missing
75+
- otherwise apply a quality rubric score
9776

98-
| Language | Extension | SDK available |
99-
|---|---|---|
100-
| Python | `.py` | `pip install agentevals-evaluator-sdk` |
101-
| JavaScript | `.js` | No SDK yet — just read stdin, write stdout |
102-
| TypeScript | `.ts` | No SDK yet — just read stdin, write stdout |
77+
## Related docs
10378

104-
## Further Reading
79+
- [Eval Set Format](/docs/eval-set-format/)
80+
- [OTel Compatibility](/docs/otel-compatibility/)
81+
- [OpenAI Evals API backend](/docs/openai-evals-api/)
82+
- [Streaming](/docs/streaming/)
10583

106-
- [Custom Evaluators Guide](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) — Full protocol reference
107-
- [Community Evaluators](/evaluators/) — Browse and submit evaluators
108-
- [Eval Set Format](https://github.com/agentevals-dev/agentevals/blob/main/docs/eval-set-format.md) — Schema and field reference for eval set JSON files
84+
## Recommendation
85+
86+
Start with the smallest evaluator that captures a real product risk. Add more evaluators only when they create a clear signal you intend to track over time.

content/docs/eval-set-format.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: Eval Set Format
3+
weight: 6
4+
description: The structure agentevals uses to define evaluation datasets, metadata, and scoring inputs.
5+
---
6+
7+
Eval sets provide a repeatable way to organize the inputs and metadata used during evaluation.
8+
9+
## What an eval set is
10+
11+
An eval set typically describes:
12+
13+
- the items or examples being evaluated
14+
- metadata associated with those items
15+
- expected outputs, labels, or references when available
16+
- which evaluators or metrics should be applied
17+
18+
This gives teams a stable structure for comparing results over time.
19+
20+
## Why it matters
21+
22+
A clear eval set format helps you:
23+
24+
- keep evaluation runs consistent
25+
- compare changes across model or agent versions
26+
- connect trace-derived behavior to dataset-level expectations
27+
- share evaluation definitions across local, CI, and Kubernetes environments
28+
29+
## Practical guidance
30+
31+
When designing an eval set:
32+
33+
- keep identifiers stable
34+
- store expected outputs or labels only when they are genuinely part of the task
35+
- attach metadata that is useful for slicing results later
36+
- avoid overloading one eval set with too many unrelated behaviors
37+
38+
## Related docs
39+
40+
- [Quick Start](/docs/quick-start/)
41+
- [Custom Evaluators](/docs/custom-evaluators/)
42+
- [OpenAI Evals API backend](/docs/openai-evals-api/)
43+
- [Streaming](/docs/streaming/)

content/docs/faq.md

Lines changed: 15 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,29 @@
11
---
2-
title: "FAQ"
3-
weight: 6
4-
description: "Frequently asked questions about AgentEvals."
2+
title: FAQ
3+
weight: 10
4+
description: Frequently asked questions about agentevals.
55
---
66

7-
## How does this compare to ADK's evaluations?
7+
## Does agentevals re-run my agent?
88

9-
Unlike ADK's LocalEvalService, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
9+
No. agentevals is built to score behavior from OpenTelemetry traces without re-running the agent.
1010

11-
However, if you're iterating on your agents locally, you can point your agents to agentevals and you will see rich runtime information in your browser. For more details, use the bundled wheel and explore the Local Development option in the UI.
11+
## What kind of telemetry does agentevals use?
1212

13-
## How does this compare to Bedrock AgentCore's evaluation?
13+
agentevals works from OpenTelemetry trace data emitted by your agent system. See [OTel Compatibility](/docs/otel-compatibility/) for more details.
1414

15-
AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
15+
## Can I write my own evaluators?
1616

17-
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI and web UI. No cloud dependency required.
17+
Yes. See [Custom Evaluators](/docs/custom-evaluators/).
1818

19-
## What trace formats are supported?
19+
## Can agentevals use external judging backends?
2020

21-
AgentEvals supports **OTLP** (OpenTelemetry Protocol) with `http/protobuf` and `http/json`, plus **Jaeger JSON** trace exports. Works with any OTel-instrumented framework including LangChain, Strands, Google ADK, and others.
21+
Yes. agentevals now includes an initial option to delegate evals to OpenAI's Evals API. See [OpenAI Evals API backend](/docs/openai-evals-api/).
2222

23-
## Do I need to re-run my agent to evaluate it?
23+
## Can I deploy agentevals on Kubernetes?
2424

25-
No. Record once, score as many times as you want. AgentEvals evaluates from existing traces, so you never need to replay expensive LLM calls.
25+
Yes. The project now includes container deployment support and a Helm chart for Kubernetes. See [Kubernetes & Helm](/docs/kubernetes-helm/).
2626

27-
## What frameworks are supported?
27+
## Is agentevals only for batch processing?
2828

29-
Any framework that emits OpenTelemetry spans works out of the box. This includes **LangChain**, **Strands**, **Google ADK**, and any other OTel-instrumented framework. The zero-code integration requires no SDK — just point your agent's OTel exporter to agentevals.
30-
31-
## Can I write custom evaluators?
32-
33-
Yes. Evaluators can be written in Python, JavaScript, or any language that reads JSON from stdin and writes JSON to stdout. See the [Custom Evaluators](/docs/custom-evaluators/) page for details.
34-
35-
A Python SDK is available (`pip install agentevals-evaluator-sdk`) for convenience, but it's not required.
36-
37-
## Can I use this in CI/CD?
38-
39-
Absolutely. The CLI is designed for CI integration. Use `--output json` for machine-readable results. See the [CLI & CI/CD section](/docs/integrations/#cli--cicd) for a GitHub Actions example.
40-
41-
## Is there a community evaluator registry?
42-
43-
Yes. Browse community-contributed evaluators on the [Evaluators](/evaluators/) page, or contribute your own to the [evaluators repository](https://github.com/agentevals-dev/evaluators).
44-
45-
## Is AgentEvals open source?
46-
47-
Yes. AgentEvals is open source and available on [GitHub](https://github.com/agentevals-dev/agentevals). Contributions are welcome!
29+
No. There is also support for streaming-oriented workflows. See [Streaming](/docs/streaming/).

0 commit comments

Comments
 (0)