Add custom evaluators feature, remove MCP references, improve code block contrast

sebbycorp · claude · sebbycorp · commit 5b37a7f91940 · 2026-03-22T14:40:34.000-04:00
Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/content/docs/advanced.md b/content/docs/advanced.md
@@ -25,25 +25,6 @@ While the server is running (`agentevals serve`), interactive API documentation
 
 The OTLP receiver (port 4318) serves its own docs at `http://localhost:4318/docs`.
 
-## MCP Server Tools
-
-| Tool | Requires `serve` | Description |
-|------|:---:|-------------|
-| `list_metrics` | yes | List available metrics |
-| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
-| `list_sessions` | yes | List streaming sessions |
-| `summarize_session` | yes | Structured summary of a session's tool calls |
-| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
-
-## Claude Code Skills
-
-Two slash-command workflows in `.claude/skills/`, available automatically in repos with the agentevals config:
-
-| Skill | What it does |
-|-------|-------------|
-| `/eval` | Score traces or compare sessions against a golden reference |
-| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
-
 ## Development
 
 ```bash
diff --git a/content/docs/faq.md b/content/docs/faq.md
@@ -14,7 +14,7 @@ However, if you're iterating on your agents locally, you can point your agents t
 
 AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
 
-agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required.
+agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI and web UI. No cloud dependency required.
 
 ## What trace formats are supported?
 
diff --git a/content/docs/integrations.md b/content/docs/integrations.md
@@ -1,10 +1,10 @@
 ---
 title: "Integrations & Use Cases"
 weight: 2
-description: "Zero-code, SDK, CLI/CI, and MCP integration patterns."
+description: "Zero-code, SDK, and CLI/CI integration patterns."
 ---
 
-AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, in CI pipelines with the CLI, or conversationally through the MCP server.
+AgentEvals can be used in multiple ways depending on your workflow. Evaluate agents with zero code via OTel, programmatically via the SDK, or in CI pipelines with the CLI.
 
 > For detailed, working examples covering all integration patterns, see the [examples directory](https://github.com/agentevals-dev/agentevals/tree/main/examples) in the repository.
 
@@ -127,39 +127,3 @@ jobs:
           "
 ```
 
----
-
-## MCP Server
-
-Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
-
-### Available Tools
-
-| Tool | Requires `serve` | Description |
-|------|:---:|-------------|
-| `list_metrics` | yes | List available metrics |
-| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
-| `list_sessions` | yes | List streaming sessions |
-| `summarize_session` | yes | Structured summary of a session's tool calls |
-| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
-
-### Setup
-
-```bash
-# Start the MCP server
-uv run agentevals mcp
-
-# Custom server URL
-AGENTEVALS_SERVER_URL=http://localhost:9000 uv run agentevals mcp
-```
-
-The React UI and MCP server share the same in-memory session state and can run simultaneously.
-
-### Claude Code Skills
-
-Two slash-command workflows are available in repos with `.claude/skills/`:
-
-| Skill | What it does |
-|-------|-------------|
-| `/eval` | Score traces or compare sessions against a golden reference |
-| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
diff --git a/content/docs/quick-start.md b/content/docs/quick-start.md
@@ -11,7 +11,7 @@ Grab a wheel from the [releases page](https://github.com/agentevals-dev/agenteva
 ```bash
 pip install agentevals-<version>-py3-none-any.whl
 
-# For MCP server and live streaming support:
+# For live streaming support:
 pip install "agentevals-<version>-py3-none-any.whl[live]"
 ```
 
@@ -61,6 +61,6 @@ Live-streamed traces appear in the "Local Dev" tab, grouped by session ID.
 
 ## What's Next
 
-- [Integrations](/docs/integrations/) — Zero-code, SDK, CLI/CI, and MCP integration patterns
+- [Integrations](/docs/integrations/) — Zero-code, SDK, and CLI/CI integration patterns
 - [Custom Evaluators](/docs/custom-evaluators/) — Build your own evaluators
 - [UI Walkthrough](/docs/ui-walkthrough/) — Deep dive into the web UI
diff --git a/layouts/index.html b/layouts/index.html
@@ -82,6 +82,11 @@ <h3>LLM-as-Judge</h3>
       <h3>CI/CD Integration</h3>
       <p>Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.</p>
     </div>
+    <div class="feature-card">
+      <div class="feature-icon">&#x1f9e9;</div>
+      <h3>Custom Evaluators</h3>
+      <p>Write custom scoring logic in Python, JavaScript, or any language. Share and discover evaluators through the community registry.</p>
+    </div>
   </div>
 </section>
 
@@ -106,7 +111,7 @@ <h3>Define Eval Sets</h3>
       <div class="step">
         <div class="step-number">3</div>
         <h3>Score &amp; Report</h3>
-        <p>Run evaluations via CLI, Web UI, or MCP server. Get detailed scores and pass/fail results.</p>
+        <p>Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.</p>
       </div>
     </div>
   </div>
@@ -119,7 +124,7 @@ <h3>Score &amp; Report</h3>
       <h2>Three Ways to Evaluate</h2>
       <p>Choose the interface that fits your workflow.</p>
     </div>
-    <div class="interfaces-grid">
+    <div class="interfaces-grid interfaces-grid-2">
       <div class="interface-card">
         <div class="interface-icon">&#x2328;</div>
         <h3>CLI</h3>
@@ -130,11 +135,6 @@ <h3>CLI</h3>
         <h3>Web UI</h3>
         <p>Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.</p>
       </div>
-      <div class="interface-card">
-        <div class="interface-icon">&#x1f50c;</div>
-        <h3>MCP Server</h3>
-        <p>Run evaluations directly from Claude Code conversations. The MCP server integrates agentevals into your AI workflow.</p>
-      </div>
     </div>
   </div>
 </section>
@@ -181,8 +181,7 @@ <h2>Get Started</h2>
 <span class="comment"># Start the web UI</span>
 <span class="cmd">agentevals</span> serve
 
-<span class="comment"># Start the MCP server</span>
-<span class="cmd">agentevals</span> mcp</pre>
+</pre>
       </div>
     </div>
   </div>
diff --git a/static/css/style.css b/static/css/style.css
@@ -483,14 +483,27 @@ section {
 }
 
 .code-block {
-  background: var(--bg-card);
-  border: 1px solid var(--border);
+  background: #0d0d14;
+  border: 1px solid rgba(255, 255, 255, 0.12);
   border-radius: 12px;
   overflow: hidden;
   max-width: 700px;
   margin: 0 auto;
 }
 
+[data-theme="light"] .code-block {
+  background: #1e1e2e;
+  border-color: rgba(0, 0, 0, 0.1);
+}
+
+[data-theme="light"] .code-body pre {
+  color: #e0e0e8;
+}
+
+[data-theme="light"] .code-body .comment {
+  color: #6b7280;
+}
+
 .code-header {
   display: flex;
   align-items: center;
@@ -500,6 +513,15 @@ section {
   border-bottom: 1px solid var(--border);
 }
 
+[data-theme="light"] .code-header {
+  background: rgba(255, 255, 255, 0.05);
+  border-bottom-color: rgba(255, 255, 255, 0.08);
+}
+
+[data-theme="light"] .code-label {
+  color: #6b7280;
+}
+
 .code-dots {
   display: flex;
   gap: 6px;
@@ -531,31 +553,31 @@ section {
   font-family: var(--font-mono);
   font-size: 0.85rem;
   line-height: 1.7;
-  color: var(--text-secondary);
+  color: var(--text-primary);
 }
 
 .code-body .cmd {
   color: var(--purple-light);
 }
 
 [data-theme="light"] .code-body .cmd {
-  color: var(--purple-dark);
+  color: var(--purple-light);
 }
 
 .code-body .flag {
   color: #22c55e;
 }
 
 [data-theme="light"] .code-body .flag {
-  color: #16a34a;
+  color: #22c55e;
 }
 
 .code-body .string {
   color: #fbbf24;
 }
 
 [data-theme="light"] .code-body .string {
-  color: #d97706;
+  color: #fbbf24;
 }
 
 .code-body .comment {
@@ -630,8 +652,15 @@ section {
   gap: 1.5rem;
 }
 
+.interfaces-grid-2 {
+  grid-template-columns: repeat(2, 1fr);
+  max-width: 700px;
+  margin: 0 auto;
+}
+
 @media (max-width: 768px) {
-  .interfaces-grid {
+  .interfaces-grid,
+  .interfaces-grid-2 {
     grid-template-columns: 1fr;
   }
 }
@@ -926,21 +955,30 @@ section {
 }
 
 .docs-article pre {
-  background: var(--bg-card);
-  border: 1px solid var(--border);
+  background: #0d0d14;
+  border: 1px solid rgba(255, 255, 255, 0.12);
   border-radius: 8px;
   padding: 1.25rem;
   overflow-x: auto;
   margin-bottom: 1.5rem;
 }
 
+[data-theme="light"] .docs-article pre {
+  background: #1e1e2e;
+  border-color: rgba(0, 0, 0, 0.1);
+}
+
+[data-theme="light"] .docs-article pre code {
+  color: #e0e0e8;
+}
+
 .docs-article pre code {
   background: none;
   border: none;
   padding: 0;
   font-size: 0.85rem;
   line-height: 1.7;
-  color: var(--text-secondary);
+  color: var(--text-primary);
 }
 
 .docs-article table {
@@ -1338,11 +1376,8 @@ section {
 }
 
 .eval-howto-code pre .cmd { color: var(--purple-light); }
-[data-theme="light"] .eval-howto-code pre .cmd { color: var(--purple-dark); }
 .eval-howto-code pre .flag { color: #22c55e; }
-[data-theme="light"] .eval-howto-code pre .flag { color: #16a34a; }
 .eval-howto-code pre .string { color: #fbbf24; }
-[data-theme="light"] .eval-howto-code pre .string { color: #d97706; }
 
 .eval-howto-note {
   margin-top: 1rem;