revert: undo custom evaluators feature, MCP removal, and code block contrast changes

sebbycorp · claude · sebbycorp · commit db37679e86c0 · 2026-03-22T16:06:20.000-04:00
Reverts changes from 5b37a7f and subsequent commits that built on it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/content/docs/advanced.md b/content/docs/advanced.md
@@ -25,6 +25,25 @@ While the server is running (`agentevals serve`), interactive API documentation
 
 The OTLP receiver (port 4318) serves its own docs at `http://localhost:4318/docs`.
 
+## MCP Server Tools
+
+| Tool | Requires `serve` | Description |
+|------|:---:|-------------|
+| `list_metrics` | yes | List available metrics |
+| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
+| `list_sessions` | yes | List streaming sessions |
+| `summarize_session` | yes | Structured summary of a session's tool calls |
+| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
+
+## Claude Code Skills
+
+Two slash-command workflows in `.claude/skills/`, available automatically in repos with the agentevals config:
+
+| Skill | What it does |
+|-------|-------------|
+| `/eval` | Score traces or compare sessions against a golden reference |
+| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
+
 ## Development
 
 ```bash
diff --git a/content/docs/faq.md b/content/docs/faq.md
@@ -14,7 +14,7 @@ However, if you're iterating on your agents locally, you can point your agents t
 
 AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
 
-agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI and web UI. No cloud dependency required.
+agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required.
 
 ## What trace formats are supported?
 
diff --git a/content/docs/integrations.md b/content/docs/integrations.md
@@ -1,10 +1,10 @@
 ---
 title: "Integrations & Use Cases"
 weight: 2
-description: "Zero-code, SDK, and CLI/CI integration patterns."
+description: "Zero-code, SDK, CLI/CI, and MCP integration patterns."
 ---
 
-AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, or in CI pipelines with the CLI.
+AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, in CI pipelines with the CLI, or conversationally through the MCP server.
 
 > For detailed, working examples covering all integration patterns, see the [examples directory](https://github.com/agentevals-dev/agentevals/tree/main/examples) in the repository.
 
@@ -127,3 +127,39 @@ jobs:
           "
 ```
 
+---
+
+## MCP Server
+
+Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
+
+### Available Tools
+
+| Tool | Requires `serve` | Description |
+|------|:---:|-------------|
+| `list_metrics` | yes | List available metrics |
+| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
+| `list_sessions` | yes | List streaming sessions |
+| `summarize_session` | yes | Structured summary of a session's tool calls |
+| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
+
+### Setup
+
+```bash
+# Start the MCP server
+uv run agentevals mcp
+
+# Custom server URL
+AGENTEVALS_SERVER_URL=http://localhost:9000 uv run agentevals mcp
+```
+
+The React UI and MCP server share the same in-memory session state and can run simultaneously.
+
+### Claude Code Skills
+
+Two slash-command workflows are available in repos with `.claude/skills/`:
+
+| Skill | What it does |
+|-------|-------------|
+| `/eval` | Score traces or compare sessions against a golden reference |
+| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
diff --git a/content/docs/quick-start.md b/content/docs/quick-start.md
@@ -11,7 +11,7 @@ Grab a wheel from the [releases page](https://github.com/agentevals-dev/agenteva
 ```bash
 pip install agentevals-<version>-py3-none-any.whl
 
-# For live streaming support:
+# For MCP server and live streaming support:
 pip install "agentevals-<version>-py3-none-any.whl[live]"
 ```
 
@@ -61,6 +61,6 @@ Live-streamed traces appear in the "Local Dev" tab, grouped by session ID.
 
 ## What's Next
 
-- [Integrations](/docs/integrations/) — Zero-code, SDK, and CLI/CI integration patterns
+- [Integrations](/docs/integrations/) — Zero-code, SDK, CLI/CI, and MCP integration patterns
 - [Custom Evaluators](/docs/custom-evaluators/) — Build your own evaluators
 - [UI Walkthrough](/docs/ui-walkthrough/) — Deep dive into the web UI
diff --git a/layouts/index.html b/layouts/index.html
diff --git a/static/css/style.css b/static/css/style.css