Skip to content

Commit 5b37a7f

Browse files
sebbycorpclaude
authored andcommitted
Add custom evaluators feature, remove MCP references, improve code block contrast
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e695e99 commit 5b37a7f

6 files changed

Lines changed: 61 additions & 82 deletions

File tree

content/docs/advanced.md

Lines changed: 0 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,25 +25,6 @@ While the server is running (`agentevals serve`), interactive API documentation
2525

2626
The OTLP receiver (port 4318) serves its own docs at `http://localhost:4318/docs`.
2727

28-
## MCP Server Tools
29-
30-
| Tool | Requires `serve` | Description |
31-
|------|:---:|-------------|
32-
| `list_metrics` | yes | List available metrics |
33-
| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
34-
| `list_sessions` | yes | List streaming sessions |
35-
| `summarize_session` | yes | Structured summary of a session's tool calls |
36-
| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
37-
38-
## Claude Code Skills
39-
40-
Two slash-command workflows in `.claude/skills/`, available automatically in repos with the agentevals config:
41-
42-
| Skill | What it does |
43-
|-------|-------------|
44-
| `/eval` | Score traces or compare sessions against a golden reference |
45-
| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
46-
4728
## Development
4829

4930
```bash

content/docs/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ However, if you're iterating on your agents locally, you can point your agents t
1414

1515
AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
1616

17-
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required.
17+
agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI and web UI. No cloud dependency required.
1818

1919
## What trace formats are supported?
2020

content/docs/integrations.md

Lines changed: 2 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
---
22
title: "Integrations & Use Cases"
33
weight: 2
4-
description: "Zero-code, SDK, CLI/CI, and MCP integration patterns."
4+
description: "Zero-code, SDK, and CLI/CI integration patterns."
55
---
66

7-
AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, in CI pipelines with the CLI, or conversationally through the MCP server.
7+
AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, or in CI pipelines with the CLI.
88

99
> For detailed, working examples covering all integration patterns, see the [examples directory](https://github.com/agentevals-dev/agentevals/tree/main/examples) in the repository.
1010
@@ -127,39 +127,3 @@ jobs:
127127
"
128128
```
129129
130-
---
131-
132-
## MCP Server
133-
134-
Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
135-
136-
### Available Tools
137-
138-
| Tool | Requires `serve` | Description |
139-
|------|:---:|-------------|
140-
| `list_metrics` | yes | List available metrics |
141-
| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
142-
| `list_sessions` | yes | List streaming sessions |
143-
| `summarize_session` | yes | Structured summary of a session's tool calls |
144-
| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
145-
146-
### Setup
147-
148-
```bash
149-
# Start the MCP server
150-
uv run agentevals mcp
151-
152-
# Custom server URL
153-
AGENTEVALS_SERVER_URL=http://localhost:9000 uv run agentevals mcp
154-
```
155-
156-
The React UI and MCP server share the same in-memory session state and can run simultaneously.
157-
158-
### Claude Code Skills
159-
160-
Two slash-command workflows are available in repos with `.claude/skills/`:
161-
162-
| Skill | What it does |
163-
|-------|-------------|
164-
| `/eval` | Score traces or compare sessions against a golden reference |
165-
| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |

content/docs/quick-start.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Grab a wheel from the [releases page](https://github.com/agentevals-dev/agenteva
1111
```bash
1212
pip install agentevals-<version>-py3-none-any.whl
1313

14-
# For MCP server and live streaming support:
14+
# For live streaming support:
1515
pip install "agentevals-<version>-py3-none-any.whl[live]"
1616
```
1717

@@ -61,6 +61,6 @@ Live-streamed traces appear in the "Local Dev" tab, grouped by session ID.
6161

6262
## What's Next
6363

64-
- [Integrations](/docs/integrations/) — Zero-code, SDK, CLI/CI, and MCP integration patterns
64+
- [Integrations](/docs/integrations/) — Zero-code, SDK, and CLI/CI integration patterns
6565
- [Custom Evaluators](/docs/custom-evaluators/) — Build your own evaluators
6666
- [UI Walkthrough](/docs/ui-walkthrough/) — Deep dive into the web UI

layouts/index.html

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,11 @@ <h3>LLM-as-Judge</h3>
8282
<h3>CI/CD Integration</h3>
8383
<p>Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.</p>
8484
</div>
85+
<div class="feature-card">
86+
<div class="feature-icon">&#x1f9e9;</div>
87+
<h3>Custom Evaluators</h3>
88+
<p>Write custom scoring logic in Python, JavaScript, or any language. Share and discover evaluators through the community registry.</p>
89+
</div>
8590
</div>
8691
</section>
8792

@@ -106,7 +111,7 @@ <h3>Define Eval Sets</h3>
106111
<div class="step">
107112
<div class="step-number">3</div>
108113
<h3>Score &amp; Report</h3>
109-
<p>Run evaluations via CLI, Web UI, or MCP server. Get detailed scores and pass/fail results.</p>
114+
<p>Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.</p>
110115
</div>
111116
</div>
112117
</div>
@@ -119,7 +124,7 @@ <h3>Score &amp; Report</h3>
119124
<h2>Three Ways to Evaluate</h2>
120125
<p>Choose the interface that fits your workflow.</p>
121126
</div>
122-
<div class="interfaces-grid">
127+
<div class="interfaces-grid interfaces-grid-2">
123128
<div class="interface-card">
124129
<div class="interface-icon">&#x2328;</div>
125130
<h3>CLI</h3>
@@ -130,11 +135,6 @@ <h3>CLI</h3>
130135
<h3>Web UI</h3>
131136
<p>Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.</p>
132137
</div>
133-
<div class="interface-card">
134-
<div class="interface-icon">&#x1f50c;</div>
135-
<h3>MCP Server</h3>
136-
<p>Run evaluations directly from Claude Code conversations. The MCP server integrates agentevals into your AI workflow.</p>
137-
</div>
138138
</div>
139139
</div>
140140
</section>
@@ -181,8 +181,7 @@ <h2>Get Started</h2>
181181
<span class="comment"># Start the web UI</span>
182182
<span class="cmd">agentevals</span> serve
183183

184-
<span class="comment"># Start the MCP server</span>
185-
<span class="cmd">agentevals</span> mcp</pre>
184+
</pre>
186185
</div>
187186
</div>
188187
</div>

static/css/style.css

Lines changed: 48 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -483,14 +483,27 @@ section {
483483
}
484484

485485
.code-block {
486-
background: var(--bg-card);
487-
border: 1px solid var(--border);
486+
background: #0d0d14;
487+
border: 1px solid rgba(255, 255, 255, 0.12);
488488
border-radius: 12px;
489489
overflow: hidden;
490490
max-width: 700px;
491491
margin: 0 auto;
492492
}
493493

494+
[data-theme="light"] .code-block {
495+
background: #1e1e2e;
496+
border-color: rgba(0, 0, 0, 0.1);
497+
}
498+
499+
[data-theme="light"] .code-body pre {
500+
color: #e0e0e8;
501+
}
502+
503+
[data-theme="light"] .code-body .comment {
504+
color: #6b7280;
505+
}
506+
494507
.code-header {
495508
display: flex;
496509
align-items: center;
@@ -500,6 +513,15 @@ section {
500513
border-bottom: 1px solid var(--border);
501514
}
502515

516+
[data-theme="light"] .code-header {
517+
background: rgba(255, 255, 255, 0.05);
518+
border-bottom-color: rgba(255, 255, 255, 0.08);
519+
}
520+
521+
[data-theme="light"] .code-label {
522+
color: #6b7280;
523+
}
524+
503525
.code-dots {
504526
display: flex;
505527
gap: 6px;
@@ -531,31 +553,31 @@ section {
531553
font-family: var(--font-mono);
532554
font-size: 0.85rem;
533555
line-height: 1.7;
534-
color: var(--text-secondary);
556+
color: var(--text-primary);
535557
}
536558

537559
.code-body .cmd {
538560
color: var(--purple-light);
539561
}
540562

541563
[data-theme="light"] .code-body .cmd {
542-
color: var(--purple-dark);
564+
color: var(--purple-light);
543565
}
544566

545567
.code-body .flag {
546568
color: #22c55e;
547569
}
548570

549571
[data-theme="light"] .code-body .flag {
550-
color: #16a34a;
572+
color: #22c55e;
551573
}
552574

553575
.code-body .string {
554576
color: #fbbf24;
555577
}
556578

557579
[data-theme="light"] .code-body .string {
558-
color: #d97706;
580+
color: #fbbf24;
559581
}
560582

561583
.code-body .comment {
@@ -630,8 +652,15 @@ section {
630652
gap: 1.5rem;
631653
}
632654

655+
.interfaces-grid-2 {
656+
grid-template-columns: repeat(2, 1fr);
657+
max-width: 700px;
658+
margin: 0 auto;
659+
}
660+
633661
@media (max-width: 768px) {
634-
.interfaces-grid {
662+
.interfaces-grid,
663+
.interfaces-grid-2 {
635664
grid-template-columns: 1fr;
636665
}
637666
}
@@ -926,21 +955,30 @@ section {
926955
}
927956

928957
.docs-article pre {
929-
background: var(--bg-card);
930-
border: 1px solid var(--border);
958+
background: #0d0d14;
959+
border: 1px solid rgba(255, 255, 255, 0.12);
931960
border-radius: 8px;
932961
padding: 1.25rem;
933962
overflow-x: auto;
934963
margin-bottom: 1.5rem;
935964
}
936965

966+
[data-theme="light"] .docs-article pre {
967+
background: #1e1e2e;
968+
border-color: rgba(0, 0, 0, 0.1);
969+
}
970+
971+
[data-theme="light"] .docs-article pre code {
972+
color: #e0e0e8;
973+
}
974+
937975
.docs-article pre code {
938976
background: none;
939977
border: none;
940978
padding: 0;
941979
font-size: 0.85rem;
942980
line-height: 1.7;
943-
color: var(--text-secondary);
981+
color: var(--text-primary);
944982
}
945983

946984
.docs-article table {
@@ -1338,11 +1376,8 @@ section {
13381376
}
13391377

13401378
.eval-howto-code pre .cmd { color: var(--purple-light); }
1341-
[data-theme="light"] .eval-howto-code pre .cmd { color: var(--purple-dark); }
13421379
.eval-howto-code pre .flag { color: #22c55e; }
1343-
[data-theme="light"] .eval-howto-code pre .flag { color: #16a34a; }
13441380
.eval-howto-code pre .string { color: #fbbf24; }
1345-
[data-theme="light"] .eval-howto-code pre .string { color: #d97706; }
13461381

13471382
.eval-howto-note {
13481383
margin-top: 1rem;

0 commit comments

Comments
 (0)