Skip to content

Commit dc4b9b1

Browse files
improve README
1 parent da914a2 commit dc4b9b1

1 file changed

Lines changed: 73 additions & 10 deletions

File tree

README.md

Lines changed: 73 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -35,10 +35,11 @@ agentevals scores performance and inference quality from OpenTelemetry traces. N
3535

3636
agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want. No re-runs, no guesswork.
3737

38-
It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, OpenAI Agents SDK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
38+
It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, OpenAI Agents SDK, and others), supports Jaeger JSON and native OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
3939

4040
- **CLI** for scripting and CI pipelines
4141
- **Web UI** for visual inspection and local developer experience
42+
- **Kubernetes and OTel support** so you can deploy right next to your agents; works natively in your OpenTelemetry pipeline
4243
- **MCP server** so MCP clients can run evaluations from a conversation
4344

4445
## Why agentevals?
@@ -48,7 +49,7 @@ Most evaluation tools require you to **re-execute your agent** for every test, b
4849
- **No re-execution**: score agents from existing traces without replaying expensive LLM calls
4950
- **Framework-agnostic**: works with any agent framework that emits OpenTelemetry spans
5051
- **Golden eval sets**: compare actual behavior against defined expected behaviors for deterministic pass/fail gating
51-
- **Custom evaluators**: write scoring logic in Python, JavaScript, or any language
52+
- **Custom evaluators**: write scoring logic in Python, JavaScript, or any language, or offload scoring to OpenAI Eval API
5253
- **CI/CD ready**: gate deployments on quality thresholds directly in your pipeline
5354
- **Local-first**: no cloud dependency required; everything runs on your machine
5455

@@ -110,25 +111,75 @@ See [DEVELOPMENT.md](DEVELOPMENT.md) for build instructions.
110111

111112
Examples use `agentevals` on your PATH after `pip install agentevals-cli`. If you are working from a clone of this repo, use `uv run agentevals` instead.
112113

113-
Run an evaluation against a sample trace:
114+
The `samples/` directory includes real traces from a Kubernetes Helm agent and matching eval sets that define expected behavior (which tools should be called, what the response should contain).
115+
116+
**Score a trace against an eval set:**
114117

115118
```bash
116119
agentevals run samples/helm.json \
117120
--eval-set samples/eval_set_helm.json \
118121
-m tool_trajectory_avg_score
119122
```
120123

121-
List available evaluators:
124+
The agent was asked to list Helm releases. The eval set expects a call to `helm_list_releases`. It matches:
125+
126+
```
127+
Trace: 3e289017fe03ffd7c4145316d2eb3d0d
128+
Invocations: 1
129+
Metric Score Status Per-Invocation Time
130+
------ ------------------------- ------- -------- ---------------- ------
131+
[PASS] tool_trajectory_avg_score 1 PASSED 1 0ms
132+
```
133+
134+
**Catch a mismatch.** Run a different trace against the same eval set:
122135

123136
```bash
124-
agentevals evaluator list
137+
agentevals run samples/k8s.json \
138+
--eval-set samples/eval_set_helm.json \
139+
-m tool_trajectory_avg_score
125140
```
126141

127-
## Integration
142+
This trace is from a different agent session that never called the expected tool. The evaluation fails:
143+
144+
```
145+
[FAIL] tool_trajectory_avg_score 0 FAILED 0 0ms
146+
Invocation 1 trajectory mismatch:
147+
Expected:
148+
- helm_list_releases({})
149+
Actual:
150+
(none)
151+
```
152+
153+
**Evaluate multiple dimensions at once:**
154+
155+
```bash
156+
agentevals run samples/helm_3.json \
157+
--eval-set samples/evalset_helm_3_2026-02-23.json \
158+
-m tool_trajectory_avg_score \
159+
-m response_match_score
160+
```
161+
162+
`tool_trajectory_avg_score` checks whether the right tools were called. `response_match_score` checks whether the agent's final answer matches the expected response.
163+
164+
**Explore visually.** Launch the Web UI and upload traces from the browser:
165+
166+
```bash
167+
agentevals serve
168+
# opens http://localhost:8001
169+
```
170+
171+
You can also point any OTel-instrumented agent directly at the built-in receiver (`OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318`). The UI streams tool calls, inputs, and outputs live as your agent runs. For production setups, the same receiver slots into a Kubernetes OTel Collector pipeline as an exporter destination. See [Integration](#integration) and the [Kubernetes example](examples/kubernetes/README.md) for walkthroughs.
172+
173+
**Next steps:**
174+
175+
- `agentevals evaluator list` to see all built-in and community evaluators
176+
- [Custom Evaluators](#custom-evaluators) to write your own scoring logic
177+
178+
## Use-cases and integrations
128179

129180
### Zero-Code (Recommended)
130181

131-
Point any OTel-instrumented agent at the receiver. No SDK, no code changes:
182+
Point any OTel-instrumented agent at the agentevals receiver. No SDK, no code changes:
132183

133184
```bash
134185
# Terminal 1
@@ -151,7 +202,7 @@ Traces stream to the UI in real-time. Works with LangChain, Strands, Google ADK,
151202

152203
See [examples/zero-code-examples/](examples/zero-code-examples/) for working examples.
153204

154-
### SDK
205+
### AgentEvals SDK
155206

156207
For programmatic session lifecycle and decorator API:
157208

@@ -166,7 +217,7 @@ with app.session(eval_set_id="my-eval"):
166217

167218
Requires `pip install "agentevals-cli[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.
168219

169-
## CLI
220+
## CLI for local testing, and CI pipelines
170221

171222
```bash
172223
# Multiple traces, JSON output
@@ -311,41 +362,53 @@ See [DEVELOPMENT.md](DEVELOPMENT.md) for build tiers, Makefile targets, and Nix
311362
## FAQ
312363

313364
**Do I need a database or any infrastructure to run agentevals?**
365+
314366
No. agentevals is a single `pip install` with no database, no message queue, and no external services. The CLI evaluates trace files directly from disk. The web UI and live streaming use in-memory session state. You can go from zero to scored traces in under a minute.
315367

316368
**Does the CLI require a running server?**
369+
317370
No. `agentevals run` evaluates trace files entirely offline. The server (`agentevals serve`) is only needed for the web UI, live OTLP streaming, and server-dependent MCP tools like `list_sessions`.
318371

319372
**Can I use agentevals in CI/CD?**
373+
320374
Yes. The CLI is designed for pipeline use: pass trace files and an eval set, set a threshold, and let the exit code gate your deployment. Combine it with `--output json` for machine-readable results. No server process needed.
321375

322376
**What if I switch agent frameworks?**
377+
323378
Because agentevals uses OpenTelemetry as its universal interface, switching frameworks (e.g., from LangChain to Strands, or from ADK to OpenAI Agents) does not require changing your evaluation setup. As long as your new framework emits OTel spans, the same eval sets and metrics work as before.
324379

325380
**Can I write evaluators in my own language?**
381+
326382
Yes. A custom evaluator is any program that reads JSON from stdin and writes a score to stdout. Python and JavaScript have first-class scaffolding support (`agentevals evaluator init`), but any language works. If your evaluator has a `requirements.txt`, agentevals manages a cached virtual environment automatically.
327383

328384
**Can I plug agentevals into an existing OTel pipeline?**
385+
329386
Yes. The OTLP receiver on port 4318 accepts standard `http/protobuf` and `http/json` trace exports, so it slots into any OpenTelemetry pipeline as just another exporter destination. If your pipeline uses gRPC (port 4317), place an [OTel Collector](https://opentelemetry.io/docs/collector/) in front to bridge gRPC to HTTP. The [Kubernetes example](examples/kubernetes/README.md) shows this exact pattern.
330387

331388
**Can I deploy agentevals on Kubernetes?**
389+
332390
Yes. A Dockerfile and a [Helm chart](charts/agentevals/) are included. A single pod exposes the web UI (8001), OTLP receiver (4318), and MCP server (8080). See the [Kubernetes example](examples/kubernetes/README.md) for a full walkthrough deploying agentevals alongside kagent and an OTel Collector.
333391

334392
**How does this compare to ADK's evaluations?**
335-
Unlike ADK's LocalEvalService, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
393+
394+
Unlike ADK's eval method, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
336395

337396
However, if you're iterating on your agents locally, you can point your agents to agentevals and you will see rich runtime information in your browser. For more details, use the bundled wheel and explore the Local Development option in the UI.
338397

339398
**How does this compare to Bedrock AgentCore's evaluation?**
399+
340400
AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
341401

342402
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required, though we do include all ADK's GCP-based evals as of now.
343403

344404
**How does this compare to LangSmith?**
405+
345406
LangSmith is a cloud platform (self-hosting requires an Enterprise plan) where offline evaluation re-executes your application against curated datasets. Its deepest integration is with LangChain/LangGraph, though it can work with other frameworks. agentevals scores pre-recorded OTel traces without re-execution, requires no cloud account or enterprise license, and uses OpenTelemetry as the universal interface rather than a proprietary SDK.
346407

347408
**How does this compare to Langfuse?**
409+
348410
Langfuse is a full observability platform (requires Postgres, ClickHouse, Redis, and S3 for self-hosting) that supports both offline experiments (re-execution) and online evaluation of ingested traces. Traces must be ingested into Langfuse first via its SDK or OTel integration before they can be scored. agentevals evaluates raw OTel trace files or live OTLP streams directly with no database or platform infrastructure required.
349411

350412
**How does this compare to Opik?**
413+
351414
Opik's primary evaluation path re-runs your application code against a dataset, incurring additional LLM costs per eval run. It also supports online evaluation rules that auto-score production traces. While Opik supports OpenTelemetry ingestion alongside its own SDK, its evaluation workflow still centers on re-execution against datasets. agentevals evaluates pre-recorded OTel traces from any framework without re-execution, and runs entirely locally with no cloud dependency.

0 commit comments

Comments
 (0)