|
1 | | -{{ define "main" }} |
| 1 | +<!DOCTYPE html> |
| 2 | +<html lang="en"> |
| 3 | +<head> |
| 4 | + <meta charset="UTF-8"> |
| 5 | + <meta name="viewport" content="width=device-width, initial-scale=1.0"> |
| 6 | + <title>{{ .Site.Title }}</title> |
| 7 | + {{ partial "head.html" . }} |
| 8 | +</head> |
| 9 | +<body> |
| 10 | + <div class="bg-grid"></div> |
2 | 11 |
|
3 | | -<!-- Navigation --> |
4 | | -<nav class="nav"> |
5 | | - <a href="{{ "/" | relURL }}" class="nav-logo"> |
6 | | - <img src="{{ "images/logo-color.png" | relURL }}" alt="AgentEvals" class="logo-dark"> |
7 | | - <img src="{{ "images/logo-light.png" | relURL }}" alt="AgentEvals" class="logo-light"> |
8 | | - </a> |
9 | | - <button class="nav-toggle" onclick="document.querySelector('.nav-links').classList.toggle('active')" aria-label="Menu">☰</button> |
10 | | - <div class="nav-links"> |
11 | | - <a href="#features">Features</a> |
12 | | - <a href="#how-it-works">How It Works</a> |
13 | | - <a href="#interfaces">Interfaces</a> |
14 | | - <a href="#get-started">Get Started</a> |
15 | | - <a href="{{ "docs/" | relURL }}">Docs</a> |
16 | | - <a href="{{ "evaluators/" | relURL }}">Evaluators</a> |
17 | | - <a href="{{ .Site.Params.discord }}" target="_blank">Discord</a> |
18 | | - <a href="{{ .Site.Params.github }}" target="_blank" class="btn-sm">GitHub</a> |
19 | | - <button class="theme-toggle" onclick="toggleTheme()" aria-label="Toggle theme"> |
20 | | - <span class="icon-sun">☀</span> |
21 | | - <span class="icon-moon">☾</span> |
22 | | - </button> |
23 | | - </div> |
24 | | -</nav> |
25 | | - |
26 | | -<!-- Hero --> |
27 | | -<section class="hero"> |
28 | | - <div class="hero-content"> |
29 | | - <div class="hero-logo"> |
30 | | - <img src="{{ "images/logo-color.png" | relURL }}" alt="AgentEvals" class="logo-dark"> |
31 | | - <img src="{{ "images/logo-color-transparent.png" | relURL }}" alt="AgentEvals" class="logo-light"> |
32 | | - </div> |
33 | | - <h1>Ship Agents <span class="highlight">Reliably</span></h1> |
34 | | - <p>Benchmark your agents before they hit production. AgentEvals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.</p> |
35 | | - <div class="hero-buttons"> |
36 | | - <a href="{{ .Site.Params.github }}" target="_blank" class="btn btn-primary"> |
37 | | - <svg class="icon" viewBox="0 0 24 24" fill="currentColor"><path d="M12 0c-6.626 0-12 5.373-12 12 0 5.302 3.438 9.8 8.207 11.387.599.111.793-.261.793-.577v-2.234c-3.338.726-4.033-1.416-4.033-1.416-.546-1.387-1.333-1.756-1.333-1.756-1.089-.745.083-.729.083-.729 1.205.084 1.839 1.237 1.839 1.237 1.07 1.834 2.807 1.304 3.492.997.107-.775.418-1.305.762-1.604-2.665-.305-5.467-1.334-5.467-5.931 0-1.311.469-2.381 1.236-3.221-.124-.303-.535-1.524.117-3.176 0 0 1.008-.322 3.301 1.23.957-.266 1.983-.399 3.003-.404 1.02.005 2.047.138 3.006.404 2.291-1.552 3.297-1.23 3.297-1.23.653 1.653.242 2.874.118 3.176.77.84 1.235 1.911 1.235 3.221 0 4.609-2.807 5.624-5.479 5.921.43.372.823 1.102.823 2.222v3.293c0 .319.192.694.801.576 4.765-1.589 8.199-6.086 8.199-11.386 0-6.627-5.373-12-12-12z"/></svg> |
38 | | - View on GitHub |
39 | | - </a> |
40 | | - <a href="{{ "docs/" | relURL }}" class="btn btn-secondary">Read the Docs</a> |
41 | | - <a href="{{ .Site.Params.discord }}" target="_blank" class="btn btn-secondary"> |
42 | | - Join Discord |
43 | | - </a> |
44 | | - </div> |
45 | | - </div> |
46 | | -</section> |
47 | | - |
48 | | -<!-- Features --> |
49 | | -<section id="features" class="container"> |
50 | | - <div class="section-header"> |
51 | | - <h2>Why AgentEvals?</h2> |
52 | | - <p>Evaluate agent behavior from real traces, not synthetic replays.</p> |
53 | | - </div> |
54 | | - <div class="features-grid"> |
55 | | - <div class="feature-card"> |
56 | | - <div class="feature-icon">🔍</div> |
57 | | - <h3>Trace-Based Evaluation</h3> |
58 | | - <p>Parse OTLP streams and Jaeger JSON traces to evaluate agent behavior directly from production or test telemetry data.</p> |
59 | | - </div> |
60 | | - <div class="feature-card"> |
61 | | - <div class="feature-icon">⚡</div> |
62 | | - <h3>No Re-Running Required</h3> |
63 | | - <p>Score agent behavior from existing traces. No need to replay expensive LLM calls or wait for agent re-execution.</p> |
64 | | - </div> |
65 | | - <div class="feature-card"> |
66 | | - <div class="feature-icon">🎯</div> |
67 | | - <h3>Golden Eval Sets</h3> |
68 | | - <p>Define expected behaviors as golden eval sets and score traces against them using ADK's evaluation framework.</p> |
69 | | - </div> |
70 | | - <div class="feature-card"> |
71 | | - <div class="feature-icon">📊</div> |
72 | | - <h3>Trajectory Matching</h3> |
73 | | - <p>Compare agent trajectories with strict, unordered, subset, or superset matching modes for flexible evaluation.</p> |
74 | | - </div> |
75 | | - <div class="feature-card"> |
76 | | - <div class="feature-icon">🤖</div> |
77 | | - <h3>LLM-as-Judge</h3> |
78 | | - <p>Use LLM-powered evaluation for nuanced scoring of agent behavior without requiring reference trajectories.</p> |
79 | | - </div> |
80 | | - <div class="feature-card"> |
81 | | - <div class="feature-icon">🛠</div> |
82 | | - <h3>CI/CD Integration</h3> |
83 | | - <p>Run evaluations in your pipeline with the CLI. Gate deployments on agent behavior quality scores.</p> |
84 | | - </div> |
85 | | - <div class="feature-card"> |
86 | | - <div class="feature-icon">🧩</div> |
87 | | - <h3>Custom Evaluators</h3> |
88 | | - <p>Write custom scoring logic in Python, JavaScript, or any language. Share and discover evaluators through the community registry.</p> |
89 | | - </div> |
90 | | - </div> |
91 | | -</section> |
92 | | - |
93 | | -<!-- How It Works --> |
94 | | -<section id="how-it-works" class="how-it-works"> |
95 | | - <div class="container"> |
96 | | - <div class="section-header"> |
97 | | - <h2>How It Works</h2> |
98 | | - <p>Three steps from traces to scores.</p> |
99 | | - </div> |
100 | | - <div class="steps"> |
101 | | - <div class="step"> |
102 | | - <div class="step-number">1</div> |
103 | | - <h3>Collect Traces</h3> |
104 | | - <p>Instrument your agent with OpenTelemetry or export Jaeger JSON traces from your observability platform.</p> |
| 12 | + <header class="hero"> |
| 13 | + <div class="hero-content"> |
| 14 | + <div class="hero-badge"> |
| 15 | + <span class="badge-dot"></span> |
| 16 | + Open source • Python SDK • OpenTelemetry native |
105 | 17 | </div> |
106 | | - <div class="step"> |
107 | | - <div class="step-number">2</div> |
108 | | - <h3>Define Eval Sets</h3> |
109 | | - <p>Create golden evaluation sets that describe expected agent behaviors, tool calls, and trajectories.</p> |
| 18 | + <h1>Score your AI agent behavior from traces.</h1> |
| 19 | + <p class="hero-subtitle"> |
| 20 | + AgentEvals is the open-source Python framework for scoring AI agent performance and behavior |
| 21 | + from OpenTelemetry traces. Test prompts, tools, memory, and workflows without re-running your agents. |
| 22 | + </p> |
| 23 | + <div class="hero-cta"> |
| 24 | + <a href="/docs/quick-start/" class="btn btn-primary">Quick Start</a> |
| 25 | + <a href="https://github.com/agentevals-dev/agentevals" class="btn btn-secondary" target="_blank" rel="noopener">GitHub</a> |
110 | 26 | </div> |
111 | | - <div class="step"> |
112 | | - <div class="step-number">3</div> |
113 | | - <h3>Score & Report</h3> |
114 | | - <p>Run evaluations via CLI or Web UI. Get detailed scores and pass/fail results.</p> |
| 27 | + <div class="hero-meta"> |
| 28 | + <span>CLI</span> |
| 29 | + <span>Custom Evaluators</span> |
| 30 | + <span>Web UI</span> |
| 31 | + <span>CI/CD</span> |
115 | 32 | </div> |
116 | 33 | </div> |
117 | | - </div> |
118 | | -</section> |
| 34 | + </header> |
119 | 35 |
|
120 | | -<!-- Interfaces --> |
121 | | -<section id="interfaces" class="interfaces"> |
122 | | - <div class="container"> |
123 | | - <div class="section-header"> |
124 | | - <h2>Three Ways to Evaluate</h2> |
125 | | - <p>Choose the interface that fits your workflow.</p> |
126 | | - </div> |
127 | | - <div class="interfaces-grid interfaces-grid-2"> |
128 | | - <div class="interface-card"> |
129 | | - <div class="interface-icon">⌨</div> |
130 | | - <h3>CLI</h3> |
131 | | - <p>Script evaluations and integrate into CI/CD pipelines. Pipe in traces, get scores out. Built for automation.</p> |
132 | | - </div> |
133 | | - <div class="interface-card"> |
134 | | - <div class="interface-icon">🖥</div> |
135 | | - <h3>Web UI</h3> |
136 | | - <p>Visually inspect traces and interactively evaluate agent behavior. Browse results, compare runs, and drill into details.</p> |
| 36 | + <main> |
| 37 | + <section class="features section"> |
| 38 | + <div class="section-header"> |
| 39 | + <span class="section-label">Why AgentEvals</span> |
| 40 | + <h2>Evaluation that matches how agents actually run.</h2> |
| 41 | + <p>Traditional evals re-run entire workflows. AgentEvals scores the traces you already collect, so you can measure behavior in realistic conditions.</p> |
137 | 42 | </div> |
138 | | - </div> |
139 | | - </div> |
140 | | -</section> |
141 | 43 |
|
142 | | -<!-- Custom Evaluators CTA --> |
143 | | -<section class="evaluators-cta"> |
144 | | - <div class="container"> |
145 | | - <div class="evaluators-cta-inner"> |
146 | | - <div class="evaluators-cta-text"> |
147 | | - <h2>Build Your Own Evaluators</h2> |
148 | | - <p>Write custom scoring logic in Python, JavaScript, or any language. Share it with the community through our evaluator registry.</p> |
| 44 | + <div class="feature-grid"> |
| 45 | + <article class="feature-card"> |
| 46 | + <div class="feature-icon">◉</div> |
| 47 | + <h3>Trace-native evaluation</h3> |
| 48 | + <p>Built on OpenTelemetry traces so you can evaluate real production-like runs without replaying agent execution.</p> |
| 49 | + </article> |
| 50 | + <article class="feature-card"> |
| 51 | + <div class="feature-icon">◈</div> |
| 52 | + <h3>Flexible scoring</h3> |
| 53 | + <p>Combine built-in evaluators with custom Python logic to measure correctness, tool usage, memory behavior, and more.</p> |
| 54 | + </article> |
| 55 | + <article class="feature-card"> |
| 56 | + <div class="feature-icon">◎</div> |
| 57 | + <h3>Works in your workflow</h3> |
| 58 | + <p>Run locally with the CLI, automate in CI/CD, or explore results visually in the web UI.</p> |
| 59 | + </article> |
149 | 60 | </div> |
150 | | - <div class="evaluators-cta-actions"> |
151 | | - <a href="{{ "evaluators/" | relURL }}" class="btn btn-primary">Browse Evaluators</a> |
152 | | - <a href="https://github.com/agentevals-dev/evaluators#contributing-an-evaluator" target="_blank" class="btn btn-secondary">Submit Your Own</a> |
| 61 | + </section> |
| 62 | + |
| 63 | + <section class="workflow section"> |
| 64 | + <div class="section-header"> |
| 65 | + <span class="section-label">How it works</span> |
| 66 | + <h2>From traces to scores in three steps.</h2> |
153 | 67 | </div> |
154 | | - </div> |
155 | | - </div> |
156 | | -</section> |
157 | 68 |
|
158 | | -<!-- Get Started --> |
159 | | -<section id="get-started" class="code-section"> |
160 | | - <div class="container"> |
161 | | - <div class="section-header"> |
162 | | - <h2>Get Started</h2> |
163 | | - <p>Up and running in seconds.</p> |
164 | | - </div> |
165 | | - <div class="code-block"> |
166 | | - <div class="code-header"> |
167 | | - <div class="code-dots"> |
168 | | - <span></span><span></span><span></span> |
| 69 | + <div class="workflow-steps"> |
| 70 | + <div class="workflow-step"> |
| 71 | + <span class="step-number">01</span> |
| 72 | + <h3>Collect traces</h3> |
| 73 | + <p>Instrument your agent with OpenTelemetry and emit traces for prompts, tool calls, memory operations, and outputs.</p> |
| 74 | + </div> |
| 75 | + <div class="workflow-step"> |
| 76 | + <span class="step-number">02</span> |
| 77 | + <h3>Define evaluators</h3> |
| 78 | + <p>Choose built-in evaluators or create your own to score the behaviors that matter for your agent.</p> |
| 79 | + </div> |
| 80 | + <div class="workflow-step"> |
| 81 | + <span class="step-number">03</span> |
| 82 | + <h3>Run evaluations</h3> |
| 83 | + <p>Score trace datasets through the CLI or web UI and compare results across prompts, models, or tool strategies.</p> |
169 | 84 | </div> |
170 | | - <span class="code-label">terminal</span> |
171 | 85 | </div> |
172 | | - <div class="code-body"> |
173 | | -<pre><span class="comment"># Install from release wheel</span> |
174 | | -<span class="cmd">pip</span> install agentevals-<version>-py3-none-any.whl |
| 86 | + </section> |
175 | 87 |
|
176 | | -<span class="comment"># Run an evaluation against a trace</span> |
177 | | -<span class="cmd">agentevals</span> run samples/helm.json \ |
178 | | - <span class="flag">--eval-set</span> <span class="string">samples/eval_set_helm.json</span> \ |
179 | | - <span class="flag">-m</span> <span class="string">tool_trajectory_avg_score</span> |
180 | | - |
181 | | -<span class="comment"># Start the web UI</span> |
182 | | -<span class="cmd">agentevals</span> serve |
| 88 | + <section class="docs-preview section"> |
| 89 | + <div class="section-header"> |
| 90 | + <span class="section-label">Docs</span> |
| 91 | + <h2>Start with the path that fits your workflow.</h2> |
| 92 | + </div> |
183 | 93 |
|
184 | | -</pre> |
| 94 | + <div class="docs-grid"> |
| 95 | + {{ range where .Site.RegularPages "Section" "docs" }} |
| 96 | + <a class="doc-card" href="{{ .RelPermalink }}"> |
| 97 | + <div> |
| 98 | + <h3>{{ .Title }}</h3> |
| 99 | + <p>{{ .Description }}</p> |
| 100 | + </div> |
| 101 | + <span class="doc-arrow">→</span> |
| 102 | + </a> |
| 103 | + {{ end }} |
185 | 104 | </div> |
186 | | - </div> |
187 | | - </div> |
188 | | -</section> |
| 105 | + </section> |
189 | 106 |
|
190 | | -<!-- CTA --> |
191 | | -<section class="cta"> |
192 | | - <div class="cta-content"> |
193 | | - <h2>Start Evaluating Your Agents</h2> |
194 | | - <p>Open source. Trace-driven. No re-runs needed.</p> |
195 | | - <div class="cta-buttons"> |
196 | | - <a href="{{ .Site.Params.github }}" target="_blank" class="btn btn-primary"> |
197 | | - <svg class="icon" viewBox="0 0 24 24" fill="currentColor"><path d="M12 0c-6.626 0-12 5.373-12 12 0 5.302 3.438 9.8 8.207 11.387.599.111.793-.261.793-.577v-2.234c-3.338.726-4.033-1.416-4.033-1.416-.546-1.387-1.333-1.756-1.333-1.756-1.089-.745.083-.729.083-.729 1.205.084 1.839 1.237 1.839 1.237 1.07 1.834 2.807 1.304 3.492.997.107-.775.418-1.305.762-1.604-2.665-.305-5.467-1.334-5.467-5.931 0-1.311.469-2.381 1.236-3.221-.124-.303-.535-1.524.117-3.176 0 0 1.008-.322 3.301 1.23.957-.266 1.983-.399 3.003-.404 1.02.005 2.047.138 3.006.404 2.291-1.552 3.297-1.23 3.297-1.23.653 1.653.242 2.874.118 3.176.77.84 1.235 1.911 1.235 3.221 0 4.609-2.807 5.624-5.479 5.921.43.372.823 1.102.823 2.222v3.293c0 .319.192.694.801.576 4.765-1.589 8.199-6.086 8.199-11.386 0-6.627-5.373-12-12-12z"/></svg> |
198 | | - GitHub |
199 | | - </a> |
200 | | - <a href="{{ .Site.Params.discord }}" target="_blank" class="btn btn-secondary"> |
201 | | - Join Discord |
202 | | - </a> |
203 | | - </div> |
204 | | - </div> |
205 | | -</section> |
| 107 | + <section class="usage section"> |
| 108 | + <div class="section-header"> |
| 109 | + <span class="section-label">Usage</span> |
| 110 | + <h2>Two ways to evaluate.</h2> |
| 111 | + <p>Use the CLI for fast, scriptable scoring or the Web UI for visual exploration of evaluation results.</p> |
| 112 | + </div> |
206 | 113 |
|
207 | | -<!-- Footer --> |
208 | | -<footer class="footer"> |
209 | | - <div class="footer-content"> |
210 | | - <a href="{{ "/" | relURL }}" class="footer-logo"> |
211 | | - <img src="{{ "images/logo-color.png" | relURL }}" alt="AgentEvals" class="logo-dark"> |
212 | | - <img src="{{ "images/logo-light.png" | relURL }}" alt="AgentEvals" class="logo-light"> |
213 | | - </a> |
214 | | - <div class="footer-links"> |
215 | | - <a href="{{ "docs/" | relURL }}">Docs</a> |
216 | | - <a href="{{ .Site.Params.github }}" target="_blank">GitHub</a> |
217 | | - <a href="{{ .Site.Params.discord }}" target="_blank">Discord</a> |
218 | | - <a href="https://github.com/agentregistry-dev/" target="_blank">AgentRegistry</a> |
219 | | - </div> |
220 | | - <span class="footer-copy">© {{ now.Year }} AgentEvals. Open source under Apache 2.0.</span> |
221 | | - </div> |
222 | | -</footer> |
| 114 | + <div class="usage-grid"> |
| 115 | + <article class="usage-card"> |
| 116 | + <h3>CLI</h3> |
| 117 | + <p>Run evaluations locally or in CI with straightforward commands and structured outputs.</p> |
| 118 | + <pre><code>agentevals eval run config.yaml</code></pre> |
| 119 | + </article> |
| 120 | + <article class="usage-card"> |
| 121 | + <h3>Web UI</h3> |
| 122 | + <p>Inspect trace datasets, compare runs, and review evaluator outputs in a visual interface.</p> |
| 123 | + <pre><code>agentevals ui</code></pre> |
| 124 | + </article> |
| 125 | + </div> |
| 126 | + </section> |
223 | 127 |
|
224 | | -{{ end }} |
| 128 | + <section class="cta section"> |
| 129 | + <div class="cta-card"> |
| 130 | + <span class="section-label">Get started</span> |
| 131 | + <h2>Bring evaluation into your agent development loop.</h2> |
| 132 | + <p>Install AgentEvals, connect your traces, and start measuring how your agent behaves in the real world.</p> |
| 133 | + <div class="hero-cta"> |
| 134 | + <a href="/docs/quick-start/" class="btn btn-primary">Read the docs</a> |
| 135 | + <a href="https://github.com/agentevals-dev/agentevals" class="btn btn-secondary" target="_blank" rel="noopener">View on GitHub</a> |
| 136 | + </div> |
| 137 | + </div> |
| 138 | + </section> |
| 139 | + </main> |
| 140 | +</body> |
| 141 | +</html> |
0 commit comments