style(blog): add chart visuals to the graph-engine benchmark post

mayurpise · claude · mayurpise · commit 40f92f642bd0 · 2026-06-16T01:54:35.000-07:00
Rework the benchmark data tables into styled in-page charts (supporting
CSS in blog.css). Content unchanged and corpora remain anonymized.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/web/blog/blog.css b/web/blog/blog.css
@@ -324,6 +324,125 @@
     margin: 2.75rem 0;
 }
 
+/* ─── BAR CHARTS (in-post data viz) ───────────────────── */
+
+.chart {
+    margin: 0 0 1.75rem;
+    border: 1px solid var(--border);
+    border-radius: 12px;
+    background: var(--bg-secondary);
+    padding: 1.4rem 1.5rem 1.5rem;
+}
+
+.chart-title {
+    font-family: 'Space Grotesk', sans-serif;
+    font-size: 0.95rem;
+    font-weight: 600;
+    color: var(--text-primary);
+    margin: 0 0 0.35rem;
+    letter-spacing: -0.01em;
+}
+
+.chart-note {
+    font-size: 0.82rem;
+    line-height: 1.5;
+    color: var(--text-muted);
+    margin: 0 0 1.15rem;
+}
+
+.chart-legend {
+    display: flex;
+    gap: 1.1rem;
+    flex-wrap: wrap;
+    margin: 0 0 1.15rem;
+    font-family: 'JetBrains Mono', monospace;
+    font-size: 0.74rem;
+    text-transform: uppercase;
+    letter-spacing: 0.06em;
+    color: var(--text-muted);
+}
+
+.chart-legend span {
+    display: inline-flex;
+    align-items: center;
+    gap: 0.4rem;
+}
+
+.chart-legend span::before {
+    content: '';
+    width: 11px;
+    height: 11px;
+    border-radius: 3px;
+    background: var(--swatch, var(--text-muted));
+}
+
+.chart-legend .lg-engine { --swatch: var(--accent-emerald); }
+.chart-legend .lg-grep   { --swatch: var(--accent-rose); }
+.chart-legend .lg-neutral { --swatch: var(--accent-cyan); }
+
+/* A row = one labelled group of bars */
+.chart-row {
+    margin-bottom: 1.15rem;
+}
+
+.chart-row:last-child { margin-bottom: 0; }
+
+.chart-row-label {
+    font-size: 0.88rem;
+    color: var(--text-secondary);
+    margin-bottom: 0.45rem;
+    display: flex;
+    justify-content: space-between;
+    align-items: baseline;
+    gap: 1rem;
+}
+
+.chart-row-label .ratio {
+    font-family: 'JetBrains Mono', monospace;
+    font-weight: 500;
+    color: var(--accent-emerald);
+    white-space: nowrap;
+}
+
+.bar {
+    position: relative;
+    height: 1.45rem;
+    border-radius: 5px;
+    margin-bottom: 0.3rem;
+    min-width: 2px;
+    display: flex;
+    align-items: center;
+    transition: width 0.5s var(--ease-out);
+    background: var(--bar-color, var(--accent-cyan));
+}
+
+.bar:last-child { margin-bottom: 0; }
+
+.bar.engine { background: var(--accent-emerald); }
+.bar.grep   { background: var(--accent-rose); }
+.bar.neutral { background: var(--accent-cyan); }
+
+.bar-value {
+    font-family: 'JetBrains Mono', monospace;
+    font-size: 0.74rem;
+    font-weight: 500;
+    color: #fff;
+    padding: 0 0.55rem;
+    white-space: nowrap;
+}
+
+/* When a bar is too short to hold its label, the label sits outside */
+.bar.tiny .bar-value {
+    color: var(--text-secondary);
+    position: absolute;
+    left: calc(100% + 0.4rem);
+}
+
+@media (max-width: 600px) {
+    .chart { padding: 1.1rem 1.1rem 1.2rem; }
+    .chart-row-label { flex-direction: column; gap: 0.15rem; }
+}
+
 /* ─── SHARE BAR ───────────────────────────────────────── */
 
 .post-share {
diff --git a/web/blog/graph-engine-vs-grep/index.html b/web/blog/graph-engine-vs-grep/index.html
@@ -120,61 +120,128 @@ <h2>What We Measured</h2>
 
                 <h2>Latency: grep Often Wins, and It Doesn't Matter</h2>
                 <p>Five canonical tasks, engine vs grep, on the large polyglot repo:</p>
-                <table>
-                    <thead>
-                        <tr><th>Task</th><th>Engine</th><th>grep</th><th>Winner (speed)</th><th>Accuracy</th></tr>
-                    </thead>
-                    <tbody>
-                        <tr><td>Module inventory</td><td>170 ms</td><td>1 ms</td><td>grep 170&times;</td><td>engine adds edges</td></tr>
-                        <tr><td>Hotspot top-10 (fan-in)</td><td>1050 ms</td><td>30 ms</td><td>grep 35&times;</td><td><strong>engine</strong></td></tr>
-                        <tr><td>Callers of a function</td><td>310 ms</td><td>20 ms</td><td>grep 16&times;</td><td><strong>engine</strong></td></tr>
-                        <tr><td>Dependency cycles</td><td>1020 ms</td><td>&mdash;</td><td>engine only</td><td><strong>engine only</strong></td></tr>
-                        <tr><td>Impact / blast radius</td><td>40 ms</td><td>20 ms</td><td>grep 2&times;</td><td><strong>engine only</strong></td></tr>
-                    </tbody>
-                </table>
+                <figure class="chart" role="group" aria-label="Latency per task, engine versus grep, large polyglot repo">
+                    <figcaption class="chart-title">Latency per task &mdash; large polyglot repo</figcaption>
+                    <p class="chart-note">Wall-clock, median of 5 runs (lower is better). grep wins four of five &mdash; and both sit far under a 3&ndash;30&nbsp;s model turn. <em>Accuracy winner noted per row.</em></p>
+                    <div class="chart-legend">
+                        <span class="lg-engine">Engine</span>
+                        <span class="lg-neutral">grep</span>
+                    </div>
+
+                    <div class="chart-row">
+                        <div class="chart-row-label">Module inventory <span class="ratio">grep 170&times; &middot; engine adds edges</span></div>
+                        <div class="bar engine" style="width:16.2%"><span class="bar-value">170 ms</span></div>
+                        <div class="bar neutral tiny" style="width:0.1%"><span class="bar-value">1 ms</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Hotspot top-10 (fan-in) <span class="ratio">grep 35&times; &middot; engine accurate</span></div>
+                        <div class="bar engine" style="width:100%"><span class="bar-value">1050 ms</span></div>
+                        <div class="bar neutral tiny" style="width:2.9%"><span class="bar-value">30 ms</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Callers of a function <span class="ratio">grep 16&times; &middot; engine accurate</span></div>
+                        <div class="bar engine" style="width:29.5%"><span class="bar-value">310 ms</span></div>
+                        <div class="bar neutral tiny" style="width:1.9%"><span class="bar-value">20 ms</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Dependency cycles <span class="ratio">engine only</span></div>
+                        <div class="bar engine" style="width:97.1%"><span class="bar-value">1020 ms</span></div>
+                        <div class="bar neutral tiny" style="width:0.1%"><span class="bar-value">grep: n/a</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Impact / blast radius <span class="ratio">grep 2&times; &middot; engine only</span></div>
+                        <div class="bar engine tiny" style="width:3.8%"><span class="bar-value">40 ms</span></div>
+                        <div class="bar neutral tiny" style="width:1.9%"><span class="bar-value">20 ms</span></div>
+                    </div>
+                </figure>
                 <p>grep is faster on four of five tasks, sometimes by a lot. We are not going to pretend otherwise. But both interfaces are sub-second, and a single agent turn &mdash; one model inference &mdash; takes 3 to 30 seconds. Shaving 250 ms off a tool call you make a handful of times per turn is invisible. The engine's latency is also an <em>upper bound</em>: each call pays CLI startup plus a stateless query against the on-disk graph. A persistent MCP server amortizes that away. <strong>Latency is the axis that's easy to measure and irrelevant to optimize.</strong> The decisive axes are the next two.</p>
 
                 <h2>Accuracy: grep Hands Your Agent Noise</h2>
                 <p>This is where text search quietly fails. grep matches <em>characters</em>; it has no concept of a function, a call edge, or a module boundary. On real code, that produces both false positives (it matches things that aren't what you asked for) and false negatives (it misses things that are).</p>
-                <table>
-                    <thead>
-                        <tr><th>Metric (large repo)</th><th>Graph engine</th><th>grep baseline</th><th>Gap</th></tr>
-                    </thead>
-                    <tbody>
-                        <tr><td>Hotspot top-10 precision</td><td><strong>100%</strong> (10/10 real)</td><td>10% &mdash; 9/10 are vars/keywords (<code>is</code>, <code>in</code>, <code>config</code>, <code>len</code>, <code>status</code>&hellip;)</td><td>10&times;</td></tr>
-                        <tr><td>Hotspot top-10 recall</td><td><strong>100%</strong></td><td>12% &mdash; missed nearly every real hotspot</td><td>8&times;</td></tr>
-                        <tr><td>Callers of a function</td><td><strong>80</strong> precise call edges, test/prod-classified</td><td>202 word-matches (180 in tests, 8 definitions, 22 unrelated)</td><td>recall + precision tax</td></tr>
-                        <tr><td>Dependency cycles</td><td>SCCs computed</td><td><strong>0</strong> &mdash; not expressible in grep</td><td>&infin;</td></tr>
-                        <tr><td>Blast radius of a change</td><td><strong>44</strong>-node transitive closure</td><td>490 flat hits, no transitivity</td><td>no edges</td></tr>
-                    </tbody>
-                </table>
+                <figure class="chart" role="group" aria-label="Hotspot ranking precision and recall, engine versus grep, large repo">
+                    <figcaption class="chart-title">Hotspot top-10 &mdash; precision &amp; recall, large repo</figcaption>
+                    <p class="chart-note">Higher is better. grep ranks variable names and keywords (<code>is</code>, <code>in</code>, <code>config</code>, <code>len</code>, <code>status</code>&hellip;) as if they were functions, so it scores near-zero on both.</p>
+                    <div class="chart-legend">
+                        <span class="lg-engine">Graph engine</span>
+                        <span class="lg-grep">grep baseline</span>
+                    </div>
+
+                    <div class="chart-row">
+                        <div class="chart-row-label">Precision <span class="ratio">10&times; gap</span></div>
+                        <div class="bar engine" style="width:100%"><span class="bar-value">100% &middot; 10/10 real</span></div>
+                        <div class="bar grep tiny" style="width:10%"><span class="bar-value">10% &middot; 9/10 noise</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Recall <span class="ratio">8&times; gap</span></div>
+                        <div class="bar engine" style="width:100%"><span class="bar-value">100%</span></div>
+                        <div class="bar grep tiny" style="width:12%"><span class="bar-value">12% &middot; missed nearly all</span></div>
+                    </div>
+                </figure>
+                <p>The structural queries grep cannot express in percentages at all:</p>
+                <ul>
+                    <li><strong>Callers of a function</strong> &mdash; engine: <strong>80</strong> precise call edges, test/prod-classified. grep: 202 word-matches (180 in tests, 8 definitions, 22 unrelated).</li>
+                    <li><strong>Dependency cycles</strong> &mdash; engine: SCCs computed. grep: <strong>0</strong>, not expressible in text search (&infin; gap).</li>
+                    <li><strong>Blast radius of a change</strong> &mdash; engine: <strong>44</strong>-node transitive closure. grep: 490 flat hits, no transitivity, no edges.</li>
+                </ul>
                 <p><strong>The recall trap is the killer.</strong> Ask grep for the callers of <code>submit_order</code> and <code>grep -wn submit_order</code> returns 202 lines &mdash; mostly tests and definitions &mdash; and <em>zero</em> of the three real production call sites. Why? Because in production the function is invoked as a method: <code>self._order_manager.submit_order(...)</code>. A search for <code>submit_order(</code> never sees <code>.submit_order(</code>. The agent now has to refine its regex <em>and</em> read source to recover the calls grep missed, all while wading through 180 test matches it didn't want. The engine returns all 80 call edges, each attributed to an exact <code>file:function</code> and classified test-vs-prod, in one call. One is a confident answer; the other is a research project.</p>
 
                 <h2>Tokens and Cost: The Bill You Actually Pay</h2>
                 <p>Accuracy and cost are the same story told twice. To turn grep's noisy output into a <em>correct</em> answer, the agent has to read the files those matches point into &mdash; to tell a definition from a test from a string from a qualified call. The engine's structured output needs no follow-up reads. Here is what each path costs in tokens to reach the correct answer, on both repos:</p>
-                <table>
-                    <thead>
-                        <tr><th>Scenario</th><th>Engine tokens</th><th>grep tokens</th><th>Ratio</th></tr>
-                    </thead>
-                    <tbody>
-                        <tr><td>Callers of one function &mdash; small Rust repo</td><td>2,424</td><td>66,534</td><td><strong>27&times;</strong></td></tr>
-                        <tr><td>Callers of one function &mdash; large polyglot repo</td><td>2,669</td><td>289,096</td><td><strong>108&times;</strong></td></tr>
-                        <tr><td>Full architecture pass &mdash; small Rust repo</td><td>24,218</td><td>141,297</td><td><strong>5.8&times;</strong></td></tr>
-                        <tr><td>Full architecture pass &mdash; large polyglot repo</td><td>48,195</td><td>1,859,804</td><td><strong>39&times;</strong></td></tr>
-                    </tbody>
-                </table>
+                <figure class="chart" role="group" aria-label="Tokens to reach the correct answer, engine versus grep">
+                    <figcaption class="chart-title">Tokens to reach the <em>correct</em> answer</figcaption>
+                    <p class="chart-note">Lower is better. Bar lengths are on a log scale (the raw gap spans ~770&times;); the ratio on each row is the real linear multiple. Counted with <code>tiktoken o200k_base</code>.</p>
+                    <div class="chart-legend">
+                        <span class="lg-engine">Engine</span>
+                        <span class="lg-grep">grep</span>
+                    </div>
+
+                    <div class="chart-row">
+                        <div class="chart-row-label">Callers of one function &mdash; small Rust repo <span class="ratio">27&times;</span></div>
+                        <div class="bar engine tiny" style="width:12%"><span class="bar-value">2,424</span></div>
+                        <div class="bar grep" style="width:56%"><span class="bar-value">66,534</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Callers of one function &mdash; large polyglot repo <span class="ratio">108&times;</span></div>
+                        <div class="bar engine tiny" style="width:13%"><span class="bar-value">2,669</span></div>
+                        <div class="bar grep" style="width:75%"><span class="bar-value">289,096</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Full architecture pass &mdash; small Rust repo <span class="ratio">5.8&times;</span></div>
+                        <div class="bar engine" style="width:42%"><span class="bar-value">24,218</span></div>
+                        <div class="bar grep" style="width:66%"><span class="bar-value">141,297</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Full architecture pass &mdash; large polyglot repo <span class="ratio">39&times;</span></div>
+                        <div class="bar engine" style="width:51%"><span class="bar-value">48,195</span></div>
+                        <div class="bar grep" style="width:100%"><span class="bar-value">1,859,804</span></div>
+                    </div>
+                </figure>
                 <p>Notice the direction: on the small repo the caller query is 27&times; cheaper through the engine; on the large repo it's <strong>108&times;</strong>. grep's cost scales with how much noise it produces, which scales with the size of the codebase. The engine's output stays bounded. <strong>The bigger your repo, the more the engine saves &mdash; the opposite of how text search degrades.</strong></p>
                 <p>Translate tokens to dollars at current Claude input prices (Opus 4.8 $5/MTok, Sonnet 4.6 $3/MTok, Haiku 4.5 $1/MTok) and the per-query gap looks small &mdash; fractions of a cent vs a few cents. It stops looking small at fleet scale. Here is one illustrative workload &mdash; a team running 1,000 caller-disambiguation queries a day &mdash; on the large repo:</p>
-                <table>
-                    <thead>
-                        <tr><th>Model</th><th>Engine $/yr</th><th>grep $/yr</th><th>Avoidable spend/yr</th></tr>
-                    </thead>
-                    <tbody>
-                        <tr><td>Opus 4.8</td><td>$4.9k</td><td>$528k</td><td><strong>$523k</strong></td></tr>
-                        <tr><td>Sonnet 4.6</td><td>$2.9k</td><td>$317k</td><td><strong>$314k</strong></td></tr>
-                        <tr><td>Haiku 4.5</td><td>$1.0k</td><td>$106k</td><td><strong>$105k</strong></td></tr>
-                    </tbody>
-                </table>
+                <figure class="chart" role="group" aria-label="Annual cost, engine versus grep, 1000 caller queries per day on the large repo">
+                    <figcaption class="chart-title">Annual cost &mdash; 1,000 caller queries/day, large repo</figcaption>
+                    <p class="chart-note">Lower is better. Bars are linear; the bold figure is avoidable spend per year. Illustrative volume &mdash; the <em>ratio</em> is the property that holds.</p>
+                    <div class="chart-legend">
+                        <span class="lg-engine">Engine $/yr</span>
+                        <span class="lg-grep">grep $/yr</span>
+                    </div>
+
+                    <div class="chart-row">
+                        <div class="chart-row-label">Opus 4.8 <span class="ratio">$523k avoidable</span></div>
+                        <div class="bar engine tiny" style="width:0.9%"><span class="bar-value">$4.9k</span></div>
+                        <div class="bar grep" style="width:100%"><span class="bar-value">$528k</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Sonnet 4.6 <span class="ratio">$314k avoidable</span></div>
+                        <div class="bar engine tiny" style="width:0.55%"><span class="bar-value">$2.9k</span></div>
+                        <div class="bar grep" style="width:60%"><span class="bar-value">$317k</span></div>
+                    </div>
+                    <div class="chart-row">
+                        <div class="chart-row-label">Haiku 4.5 <span class="ratio">$105k avoidable</span></div>
+                        <div class="bar engine tiny" style="width:0.2%"><span class="bar-value">$1.0k</span></div>
+                        <div class="bar grep" style="width:20.1%"><span class="bar-value">$106k</span></div>
+                    </div>
+                </figure>
                 <p>The exact dollars depend on your volume and model mix &mdash; treat them as illustrative. The <em>ratio</em> doesn't: it's a property of how much each path makes the agent read, and it's tokenizer-independent.</p>
 
                 <h2>The Scaling Wall: When grep Stops Fitting at All</h2>