You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
style(blog): add chart visuals to the graph-engine benchmark post
Rework the benchmark data tables into styled in-page charts (supporting
CSS in blog.css). Content unchanged and corpora remain anonymized.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
<figureclass="chart" role="group" aria-label="Latency per task, engine versus grep, large polyglot repo">
124
+
<figcaptionclass="chart-title">Latency per task — large polyglot repo</figcaption>
125
+
<pclass="chart-note">Wall-clock, median of 5 runs (lower is better). grep wins four of five — and both sit far under a 3–30 s model turn. <em>Accuracy winner noted per row.</em></p>
<p>grep is faster on four of five tasks, sometimes by a lot. We are not going to pretend otherwise. But both interfaces are sub-second, and a single agent turn — one model inference — takes 3 to 30 seconds. Shaving 250 ms off a tool call you make a handful of times per turn is invisible. The engine's latency is also an <em>upper bound</em>: each call pays CLI startup plus a stateless query against the on-disk graph. A persistent MCP server amortizes that away. <strong>Latency is the axis that's easy to measure and irrelevant to optimize.</strong> The decisive axes are the next two.</p>
136
158
137
159
<h2>Accuracy: grep Hands Your Agent Noise</h2>
138
160
<p>This is where text search quietly fails. grep matches <em>characters</em>; it has no concept of a function, a call edge, or a module boundary. On real code, that produces both false positives (it matches things that aren't what you asked for) and false negatives (it misses things that are).</p>
<tr><td>Hotspot top-10 recall</td><td><strong>100%</strong></td><td>12% — missed nearly every real hotspot</td><td>8×</td></tr>
146
-
<tr><td>Callers of a function</td><td><strong>80</strong> precise call edges, test/prod-classified</td><td>202 word-matches (180 in tests, 8 definitions, 22 unrelated)</td><td>recall + precision tax</td></tr>
147
-
<tr><td>Dependency cycles</td><td>SCCs computed</td><td><strong>0</strong> — not expressible in grep</td><td>∞</td></tr>
148
-
<tr><td>Blast radius of a change</td><td><strong>44</strong>-node transitive closure</td><td>490 flat hits, no transitivity</td><td>no edges</td></tr>
149
-
</tbody>
150
-
</table>
161
+
<figureclass="chart" role="group" aria-label="Hotspot ranking precision and recall, engine versus grep, large repo">
162
+
<figcaptionclass="chart-title">Hotspot top-10 — precision & recall, large repo</figcaption>
163
+
<pclass="chart-note">Higher is better. grep ranks variable names and keywords (<code>is</code>, <code>in</code>, <code>config</code>, <code>len</code>, <code>status</code>…) as if they were functions, so it scores near-zero on both.</p>
<p>The structural queries grep cannot express in percentages at all:</p>
181
+
<ul>
182
+
<li><strong>Callers of a function</strong> — engine: <strong>80</strong> precise call edges, test/prod-classified. grep: 202 word-matches (180 in tests, 8 definitions, 22 unrelated).</li>
183
+
<li><strong>Dependency cycles</strong> — engine: SCCs computed. grep: <strong>0</strong>, not expressible in text search (∞ gap).</li>
184
+
<li><strong>Blast radius of a change</strong> — engine: <strong>44</strong>-node transitive closure. grep: 490 flat hits, no transitivity, no edges.</li>
185
+
</ul>
151
186
<p><strong>The recall trap is the killer.</strong> Ask grep for the callers of <code>submit_order</code> and <code>grep -wn submit_order</code> returns 202 lines — mostly tests and definitions — and <em>zero</em> of the three real production call sites. Why? Because in production the function is invoked as a method: <code>self._order_manager.submit_order(...)</code>. A search for <code>submit_order(</code> never sees <code>.submit_order(</code>. The agent now has to refine its regex <em>and</em> read source to recover the calls grep missed, all while wading through 180 test matches it didn't want. The engine returns all 80 call edges, each attributed to an exact <code>file:function</code> and classified test-vs-prod, in one call. One is a confident answer; the other is a research project.</p>
152
187
153
188
<h2>Tokens and Cost: The Bill You Actually Pay</h2>
154
189
<p>Accuracy and cost are the same story told twice. To turn grep's noisy output into a <em>correct</em> answer, the agent has to read the files those matches point into — to tell a definition from a test from a string from a qualified call. The engine's structured output needs no follow-up reads. Here is what each path costs in tokens to reach the correct answer, on both repos:</p>
<tr><td>Callers of one function — small Rust repo</td><td>2,424</td><td>66,534</td><td><strong>27×</strong></td></tr>
161
-
<tr><td>Callers of one function — large polyglot repo</td><td>2,669</td><td>289,096</td><td><strong>108×</strong></td></tr>
162
-
<tr><td>Full architecture pass — small Rust repo</td><td>24,218</td><td>141,297</td><td><strong>5.8×</strong></td></tr>
163
-
<tr><td>Full architecture pass — large polyglot repo</td><td>48,195</td><td>1,859,804</td><td><strong>39×</strong></td></tr>
164
-
</tbody>
165
-
</table>
190
+
<figureclass="chart" role="group" aria-label="Tokens to reach the correct answer, engine versus grep">
191
+
<figcaptionclass="chart-title">Tokens to reach the <em>correct</em> answer</figcaption>
192
+
<pclass="chart-note">Lower is better. Bar lengths are on a log scale (the raw gap spans ~770×); the ratio on each row is the real linear multiple. Counted with <code>tiktoken o200k_base</code>.</p>
193
+
<divclass="chart-legend">
194
+
<spanclass="lg-engine">Engine</span>
195
+
<spanclass="lg-grep">grep</span>
196
+
</div>
197
+
198
+
<divclass="chart-row">
199
+
<divclass="chart-row-label">Callers of one function — small Rust repo <spanclass="ratio">27×</span></div>
<p>Notice the direction: on the small repo the caller query is 27× cheaper through the engine; on the large repo it's <strong>108×</strong>. grep's cost scales with how much noise it produces, which scales with the size of the codebase. The engine's output stays bounded. <strong>The bigger your repo, the more the engine saves — the opposite of how text search degrades.</strong></p>
167
220
<p>Translate tokens to dollars at current Claude input prices (Opus 4.8 $5/MTok, Sonnet 4.6 $3/MTok, Haiku 4.5 $1/MTok) and the per-query gap looks small — fractions of a cent vs a few cents. It stops looking small at fleet scale. Here is one illustrative workload — a team running 1,000 caller-disambiguation queries a day — on the large repo:</p>
<figureclass="chart" role="group" aria-label="Annual cost, engine versus grep, 1000 caller queries per day on the large repo">
222
+
<figcaptionclass="chart-title">Annual cost — 1,000 caller queries/day, large repo</figcaption>
223
+
<pclass="chart-note">Lower is better. Bars are linear; the bold figure is avoidable spend per year. Illustrative volume — the <em>ratio</em> is the property that holds.</p>
<p>The exact dollars depend on your volume and model mix — treat them as illustrative. The <em>ratio</em> doesn't: it's a property of how much each path makes the agent read, and it's tokenizer-independent.</p>
179
246
180
247
<h2>The Scaling Wall: When grep Stops Fitting at All</h2>
0 commit comments