Skip to content

Commit 40f92f6

Browse files
mayurpiseclaude
andcommitted
style(blog): add chart visuals to the graph-engine benchmark post
Rework the benchmark data tables into styled in-page charts (supporting CSS in blog.css). Content unchanged and corpora remain anonymized. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent b2b239a commit 40f92f6

2 files changed

Lines changed: 231 additions & 45 deletions

File tree

web/blog/blog.css

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -324,6 +324,125 @@
324324
margin: 2.75rem 0;
325325
}
326326

327+
/* ─── BAR CHARTS (in-post data viz) ───────────────────── */
328+
329+
.chart {
330+
margin: 0 0 1.75rem;
331+
border: 1px solid var(--border);
332+
border-radius: 12px;
333+
background: var(--bg-secondary);
334+
padding: 1.4rem 1.5rem 1.5rem;
335+
}
336+
337+
.chart-title {
338+
font-family: 'Space Grotesk', sans-serif;
339+
font-size: 0.95rem;
340+
font-weight: 600;
341+
color: var(--text-primary);
342+
margin: 0 0 0.35rem;
343+
letter-spacing: -0.01em;
344+
}
345+
346+
.chart-note {
347+
font-size: 0.82rem;
348+
line-height: 1.5;
349+
color: var(--text-muted);
350+
margin: 0 0 1.15rem;
351+
}
352+
353+
.chart-legend {
354+
display: flex;
355+
gap: 1.1rem;
356+
flex-wrap: wrap;
357+
margin: 0 0 1.15rem;
358+
font-family: 'JetBrains Mono', monospace;
359+
font-size: 0.74rem;
360+
text-transform: uppercase;
361+
letter-spacing: 0.06em;
362+
color: var(--text-muted);
363+
}
364+
365+
.chart-legend span {
366+
display: inline-flex;
367+
align-items: center;
368+
gap: 0.4rem;
369+
}
370+
371+
.chart-legend span::before {
372+
content: '';
373+
width: 11px;
374+
height: 11px;
375+
border-radius: 3px;
376+
background: var(--swatch, var(--text-muted));
377+
}
378+
379+
.chart-legend .lg-engine { --swatch: var(--accent-emerald); }
380+
.chart-legend .lg-grep { --swatch: var(--accent-rose); }
381+
.chart-legend .lg-neutral { --swatch: var(--accent-cyan); }
382+
383+
/* A row = one labelled group of bars */
384+
.chart-row {
385+
margin-bottom: 1.15rem;
386+
}
387+
388+
.chart-row:last-child { margin-bottom: 0; }
389+
390+
.chart-row-label {
391+
font-size: 0.88rem;
392+
color: var(--text-secondary);
393+
margin-bottom: 0.45rem;
394+
display: flex;
395+
justify-content: space-between;
396+
align-items: baseline;
397+
gap: 1rem;
398+
}
399+
400+
.chart-row-label .ratio {
401+
font-family: 'JetBrains Mono', monospace;
402+
font-weight: 500;
403+
color: var(--accent-emerald);
404+
white-space: nowrap;
405+
}
406+
407+
.bar {
408+
position: relative;
409+
height: 1.45rem;
410+
border-radius: 5px;
411+
margin-bottom: 0.3rem;
412+
min-width: 2px;
413+
display: flex;
414+
align-items: center;
415+
transition: width 0.5s var(--ease-out);
416+
background: var(--bar-color, var(--accent-cyan));
417+
}
418+
419+
.bar:last-child { margin-bottom: 0; }
420+
421+
.bar.engine { background: var(--accent-emerald); }
422+
.bar.grep { background: var(--accent-rose); }
423+
.bar.neutral { background: var(--accent-cyan); }
424+
425+
.bar-value {
426+
font-family: 'JetBrains Mono', monospace;
427+
font-size: 0.74rem;
428+
font-weight: 500;
429+
color: #fff;
430+
padding: 0 0.55rem;
431+
white-space: nowrap;
432+
}
433+
434+
/* When a bar is too short to hold its label, the label sits outside */
435+
.bar.tiny .bar-value {
436+
color: var(--text-secondary);
437+
position: absolute;
438+
left: calc(100% + 0.4rem);
439+
}
440+
441+
@media (max-width: 600px) {
442+
.chart { padding: 1.1rem 1.1rem 1.2rem; }
443+
.chart-row-label { flex-direction: column; gap: 0.15rem; }
444+
}
445+
327446
/* ─── SHARE BAR ───────────────────────────────────────── */
328447

329448
.post-share {

web/blog/graph-engine-vs-grep/index.html

Lines changed: 112 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -120,61 +120,128 @@ <h2>What We Measured</h2>
120120

121121
<h2>Latency: grep Often Wins, and It Doesn't Matter</h2>
122122
<p>Five canonical tasks, engine vs grep, on the large polyglot repo:</p>
123-
<table>
124-
<thead>
125-
<tr><th>Task</th><th>Engine</th><th>grep</th><th>Winner (speed)</th><th>Accuracy</th></tr>
126-
</thead>
127-
<tbody>
128-
<tr><td>Module inventory</td><td>170 ms</td><td>1 ms</td><td>grep 170&times;</td><td>engine adds edges</td></tr>
129-
<tr><td>Hotspot top-10 (fan-in)</td><td>1050 ms</td><td>30 ms</td><td>grep 35&times;</td><td><strong>engine</strong></td></tr>
130-
<tr><td>Callers of a function</td><td>310 ms</td><td>20 ms</td><td>grep 16&times;</td><td><strong>engine</strong></td></tr>
131-
<tr><td>Dependency cycles</td><td>1020 ms</td><td>&mdash;</td><td>engine only</td><td><strong>engine only</strong></td></tr>
132-
<tr><td>Impact / blast radius</td><td>40 ms</td><td>20 ms</td><td>grep 2&times;</td><td><strong>engine only</strong></td></tr>
133-
</tbody>
134-
</table>
123+
<figure class="chart" role="group" aria-label="Latency per task, engine versus grep, large polyglot repo">
124+
<figcaption class="chart-title">Latency per task &mdash; large polyglot repo</figcaption>
125+
<p class="chart-note">Wall-clock, median of 5 runs (lower is better). grep wins four of five &mdash; and both sit far under a 3&ndash;30&nbsp;s model turn. <em>Accuracy winner noted per row.</em></p>
126+
<div class="chart-legend">
127+
<span class="lg-engine">Engine</span>
128+
<span class="lg-neutral">grep</span>
129+
</div>
130+
131+
<div class="chart-row">
132+
<div class="chart-row-label">Module inventory <span class="ratio">grep 170&times; &middot; engine adds edges</span></div>
133+
<div class="bar engine" style="width:16.2%"><span class="bar-value">170 ms</span></div>
134+
<div class="bar neutral tiny" style="width:0.1%"><span class="bar-value">1 ms</span></div>
135+
</div>
136+
<div class="chart-row">
137+
<div class="chart-row-label">Hotspot top-10 (fan-in) <span class="ratio">grep 35&times; &middot; engine accurate</span></div>
138+
<div class="bar engine" style="width:100%"><span class="bar-value">1050 ms</span></div>
139+
<div class="bar neutral tiny" style="width:2.9%"><span class="bar-value">30 ms</span></div>
140+
</div>
141+
<div class="chart-row">
142+
<div class="chart-row-label">Callers of a function <span class="ratio">grep 16&times; &middot; engine accurate</span></div>
143+
<div class="bar engine" style="width:29.5%"><span class="bar-value">310 ms</span></div>
144+
<div class="bar neutral tiny" style="width:1.9%"><span class="bar-value">20 ms</span></div>
145+
</div>
146+
<div class="chart-row">
147+
<div class="chart-row-label">Dependency cycles <span class="ratio">engine only</span></div>
148+
<div class="bar engine" style="width:97.1%"><span class="bar-value">1020 ms</span></div>
149+
<div class="bar neutral tiny" style="width:0.1%"><span class="bar-value">grep: n/a</span></div>
150+
</div>
151+
<div class="chart-row">
152+
<div class="chart-row-label">Impact / blast radius <span class="ratio">grep 2&times; &middot; engine only</span></div>
153+
<div class="bar engine tiny" style="width:3.8%"><span class="bar-value">40 ms</span></div>
154+
<div class="bar neutral tiny" style="width:1.9%"><span class="bar-value">20 ms</span></div>
155+
</div>
156+
</figure>
135157
<p>grep is faster on four of five tasks, sometimes by a lot. We are not going to pretend otherwise. But both interfaces are sub-second, and a single agent turn &mdash; one model inference &mdash; takes 3 to 30 seconds. Shaving 250 ms off a tool call you make a handful of times per turn is invisible. The engine's latency is also an <em>upper bound</em>: each call pays CLI startup plus a stateless query against the on-disk graph. A persistent MCP server amortizes that away. <strong>Latency is the axis that's easy to measure and irrelevant to optimize.</strong> The decisive axes are the next two.</p>
136158

137159
<h2>Accuracy: grep Hands Your Agent Noise</h2>
138160
<p>This is where text search quietly fails. grep matches <em>characters</em>; it has no concept of a function, a call edge, or a module boundary. On real code, that produces both false positives (it matches things that aren't what you asked for) and false negatives (it misses things that are).</p>
139-
<table>
140-
<thead>
141-
<tr><th>Metric (large repo)</th><th>Graph engine</th><th>grep baseline</th><th>Gap</th></tr>
142-
</thead>
143-
<tbody>
144-
<tr><td>Hotspot top-10 precision</td><td><strong>100%</strong> (10/10 real)</td><td>10% &mdash; 9/10 are vars/keywords (<code>is</code>, <code>in</code>, <code>config</code>, <code>len</code>, <code>status</code>&hellip;)</td><td>10&times;</td></tr>
145-
<tr><td>Hotspot top-10 recall</td><td><strong>100%</strong></td><td>12% &mdash; missed nearly every real hotspot</td><td>8&times;</td></tr>
146-
<tr><td>Callers of a function</td><td><strong>80</strong> precise call edges, test/prod-classified</td><td>202 word-matches (180 in tests, 8 definitions, 22 unrelated)</td><td>recall + precision tax</td></tr>
147-
<tr><td>Dependency cycles</td><td>SCCs computed</td><td><strong>0</strong> &mdash; not expressible in grep</td><td>&infin;</td></tr>
148-
<tr><td>Blast radius of a change</td><td><strong>44</strong>-node transitive closure</td><td>490 flat hits, no transitivity</td><td>no edges</td></tr>
149-
</tbody>
150-
</table>
161+
<figure class="chart" role="group" aria-label="Hotspot ranking precision and recall, engine versus grep, large repo">
162+
<figcaption class="chart-title">Hotspot top-10 &mdash; precision &amp; recall, large repo</figcaption>
163+
<p class="chart-note">Higher is better. grep ranks variable names and keywords (<code>is</code>, <code>in</code>, <code>config</code>, <code>len</code>, <code>status</code>&hellip;) as if they were functions, so it scores near-zero on both.</p>
164+
<div class="chart-legend">
165+
<span class="lg-engine">Graph engine</span>
166+
<span class="lg-grep">grep baseline</span>
167+
</div>
168+
169+
<div class="chart-row">
170+
<div class="chart-row-label">Precision <span class="ratio">10&times; gap</span></div>
171+
<div class="bar engine" style="width:100%"><span class="bar-value">100% &middot; 10/10 real</span></div>
172+
<div class="bar grep tiny" style="width:10%"><span class="bar-value">10% &middot; 9/10 noise</span></div>
173+
</div>
174+
<div class="chart-row">
175+
<div class="chart-row-label">Recall <span class="ratio">8&times; gap</span></div>
176+
<div class="bar engine" style="width:100%"><span class="bar-value">100%</span></div>
177+
<div class="bar grep tiny" style="width:12%"><span class="bar-value">12% &middot; missed nearly all</span></div>
178+
</div>
179+
</figure>
180+
<p>The structural queries grep cannot express in percentages at all:</p>
181+
<ul>
182+
<li><strong>Callers of a function</strong> &mdash; engine: <strong>80</strong> precise call edges, test/prod-classified. grep: 202 word-matches (180 in tests, 8 definitions, 22 unrelated).</li>
183+
<li><strong>Dependency cycles</strong> &mdash; engine: SCCs computed. grep: <strong>0</strong>, not expressible in text search (&infin; gap).</li>
184+
<li><strong>Blast radius of a change</strong> &mdash; engine: <strong>44</strong>-node transitive closure. grep: 490 flat hits, no transitivity, no edges.</li>
185+
</ul>
151186
<p><strong>The recall trap is the killer.</strong> Ask grep for the callers of <code>submit_order</code> and <code>grep -wn submit_order</code> returns 202 lines &mdash; mostly tests and definitions &mdash; and <em>zero</em> of the three real production call sites. Why? Because in production the function is invoked as a method: <code>self._order_manager.submit_order(...)</code>. A search for <code>submit_order(</code> never sees <code>.submit_order(</code>. The agent now has to refine its regex <em>and</em> read source to recover the calls grep missed, all while wading through 180 test matches it didn't want. The engine returns all 80 call edges, each attributed to an exact <code>file:function</code> and classified test-vs-prod, in one call. One is a confident answer; the other is a research project.</p>
152187

153188
<h2>Tokens and Cost: The Bill You Actually Pay</h2>
154189
<p>Accuracy and cost are the same story told twice. To turn grep's noisy output into a <em>correct</em> answer, the agent has to read the files those matches point into &mdash; to tell a definition from a test from a string from a qualified call. The engine's structured output needs no follow-up reads. Here is what each path costs in tokens to reach the correct answer, on both repos:</p>
155-
<table>
156-
<thead>
157-
<tr><th>Scenario</th><th>Engine tokens</th><th>grep tokens</th><th>Ratio</th></tr>
158-
</thead>
159-
<tbody>
160-
<tr><td>Callers of one function &mdash; small Rust repo</td><td>2,424</td><td>66,534</td><td><strong>27&times;</strong></td></tr>
161-
<tr><td>Callers of one function &mdash; large polyglot repo</td><td>2,669</td><td>289,096</td><td><strong>108&times;</strong></td></tr>
162-
<tr><td>Full architecture pass &mdash; small Rust repo</td><td>24,218</td><td>141,297</td><td><strong>5.8&times;</strong></td></tr>
163-
<tr><td>Full architecture pass &mdash; large polyglot repo</td><td>48,195</td><td>1,859,804</td><td><strong>39&times;</strong></td></tr>
164-
</tbody>
165-
</table>
190+
<figure class="chart" role="group" aria-label="Tokens to reach the correct answer, engine versus grep">
191+
<figcaption class="chart-title">Tokens to reach the <em>correct</em> answer</figcaption>
192+
<p class="chart-note">Lower is better. Bar lengths are on a log scale (the raw gap spans ~770&times;); the ratio on each row is the real linear multiple. Counted with <code>tiktoken o200k_base</code>.</p>
193+
<div class="chart-legend">
194+
<span class="lg-engine">Engine</span>
195+
<span class="lg-grep">grep</span>
196+
</div>
197+
198+
<div class="chart-row">
199+
<div class="chart-row-label">Callers of one function &mdash; small Rust repo <span class="ratio">27&times;</span></div>
200+
<div class="bar engine tiny" style="width:12%"><span class="bar-value">2,424</span></div>
201+
<div class="bar grep" style="width:56%"><span class="bar-value">66,534</span></div>
202+
</div>
203+
<div class="chart-row">
204+
<div class="chart-row-label">Callers of one function &mdash; large polyglot repo <span class="ratio">108&times;</span></div>
205+
<div class="bar engine tiny" style="width:13%"><span class="bar-value">2,669</span></div>
206+
<div class="bar grep" style="width:75%"><span class="bar-value">289,096</span></div>
207+
</div>
208+
<div class="chart-row">
209+
<div class="chart-row-label">Full architecture pass &mdash; small Rust repo <span class="ratio">5.8&times;</span></div>
210+
<div class="bar engine" style="width:42%"><span class="bar-value">24,218</span></div>
211+
<div class="bar grep" style="width:66%"><span class="bar-value">141,297</span></div>
212+
</div>
213+
<div class="chart-row">
214+
<div class="chart-row-label">Full architecture pass &mdash; large polyglot repo <span class="ratio">39&times;</span></div>
215+
<div class="bar engine" style="width:51%"><span class="bar-value">48,195</span></div>
216+
<div class="bar grep" style="width:100%"><span class="bar-value">1,859,804</span></div>
217+
</div>
218+
</figure>
166219
<p>Notice the direction: on the small repo the caller query is 27&times; cheaper through the engine; on the large repo it's <strong>108&times;</strong>. grep's cost scales with how much noise it produces, which scales with the size of the codebase. The engine's output stays bounded. <strong>The bigger your repo, the more the engine saves &mdash; the opposite of how text search degrades.</strong></p>
167220
<p>Translate tokens to dollars at current Claude input prices (Opus 4.8 $5/MTok, Sonnet 4.6 $3/MTok, Haiku 4.5 $1/MTok) and the per-query gap looks small &mdash; fractions of a cent vs a few cents. It stops looking small at fleet scale. Here is one illustrative workload &mdash; a team running 1,000 caller-disambiguation queries a day &mdash; on the large repo:</p>
168-
<table>
169-
<thead>
170-
<tr><th>Model</th><th>Engine $/yr</th><th>grep $/yr</th><th>Avoidable spend/yr</th></tr>
171-
</thead>
172-
<tbody>
173-
<tr><td>Opus 4.8</td><td>$4.9k</td><td>$528k</td><td><strong>$523k</strong></td></tr>
174-
<tr><td>Sonnet 4.6</td><td>$2.9k</td><td>$317k</td><td><strong>$314k</strong></td></tr>
175-
<tr><td>Haiku 4.5</td><td>$1.0k</td><td>$106k</td><td><strong>$105k</strong></td></tr>
176-
</tbody>
177-
</table>
221+
<figure class="chart" role="group" aria-label="Annual cost, engine versus grep, 1000 caller queries per day on the large repo">
222+
<figcaption class="chart-title">Annual cost &mdash; 1,000 caller queries/day, large repo</figcaption>
223+
<p class="chart-note">Lower is better. Bars are linear; the bold figure is avoidable spend per year. Illustrative volume &mdash; the <em>ratio</em> is the property that holds.</p>
224+
<div class="chart-legend">
225+
<span class="lg-engine">Engine $/yr</span>
226+
<span class="lg-grep">grep $/yr</span>
227+
</div>
228+
229+
<div class="chart-row">
230+
<div class="chart-row-label">Opus 4.8 <span class="ratio">$523k avoidable</span></div>
231+
<div class="bar engine tiny" style="width:0.9%"><span class="bar-value">$4.9k</span></div>
232+
<div class="bar grep" style="width:100%"><span class="bar-value">$528k</span></div>
233+
</div>
234+
<div class="chart-row">
235+
<div class="chart-row-label">Sonnet 4.6 <span class="ratio">$314k avoidable</span></div>
236+
<div class="bar engine tiny" style="width:0.55%"><span class="bar-value">$2.9k</span></div>
237+
<div class="bar grep" style="width:60%"><span class="bar-value">$317k</span></div>
238+
</div>
239+
<div class="chart-row">
240+
<div class="chart-row-label">Haiku 4.5 <span class="ratio">$105k avoidable</span></div>
241+
<div class="bar engine tiny" style="width:0.2%"><span class="bar-value">$1.0k</span></div>
242+
<div class="bar grep" style="width:20.1%"><span class="bar-value">$106k</span></div>
243+
</div>
244+
</figure>
178245
<p>The exact dollars depend on your volume and model mix &mdash; treat them as illustrative. The <em>ratio</em> doesn't: it's a property of how much each path makes the agent read, and it's tokenizer-independent.</p>
179246

180247
<h2>The Scaling Wall: When grep Stops Fitting at All</h2>

0 commit comments

Comments
 (0)