You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/docs/faq.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ However, if you're iterating on your agents locally, you can point your agents t
14
14
15
15
AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
16
16
17
-
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required.
17
+
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLIand web UI. No cloud dependency required.
Copy file name to clipboardExpand all lines: content/docs/integrations.md
+2-38Lines changed: 2 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
---
2
2
title: "Integrations & Use Cases"
3
3
weight: 2
4
-
description: "Zero-code, SDK, CLI/CI, and MCP integration patterns."
4
+
description: "Zero-code, SDK, and CLI/CI integration patterns."
5
5
---
6
6
7
-
AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, in CI pipelines with the CLI, or conversationally through the MCP server.
7
+
AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, or in CI pipelines with the CLI.
8
8
9
9
> For detailed, working examples covering all integration patterns, see the [examples directory](https://github.com/agentevals-dev/agentevals/tree/main/examples) in the repository.
10
10
@@ -127,39 +127,3 @@ jobs:
127
127
"
128
128
```
129
129
130
-
---
131
-
132
-
## MCP Server
133
-
134
-
Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
135
-
136
-
### Available Tools
137
-
138
-
| Tool | Requires `serve` | Description |
139
-
|------|:---:|-------------|
140
-
| `list_metrics` | yes | List available metrics |
141
-
| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
142
-
| `list_sessions` | yes | List streaming sessions |
143
-
| `summarize_session` | yes | Structured summary of a session's tool calls |
144
-
| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
145
-
146
-
### Setup
147
-
148
-
```bash
149
-
# Start the MCP server
150
-
uv run agentevals mcp
151
-
152
-
# Custom server URL
153
-
AGENTEVALS_SERVER_URL=http://localhost:9000 uv run agentevals mcp
154
-
```
155
-
156
-
The React UI and MCP server share the same in-memory session state and can run simultaneously.
157
-
158
-
### Claude Code Skills
159
-
160
-
Two slash-command workflows are available in repos with `.claude/skills/`:
161
-
162
-
| Skill | What it does |
163
-
|-------|-------------|
164
-
| `/eval` | Score traces or compare sessions against a golden reference |
165
-
| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
0 commit comments