Skip to content

Latest commit

 

History

History
138 lines (104 loc) · 6.4 KB

File metadata and controls

138 lines (104 loc) · 6.4 KB

Token Benchmark: grep-and-read vs. Context-Simplo

Headline: On a representative set of 10 real engineering workflows, answering through Context-Simplo's MCP tools used ~75% fewer tokens than the same questions answered with a grep-and-read workflow — while returning the same (or better) answers.

This document explains what we measure, how to reproduce it, and the numbers we get. Everything here is generated by scripts in scripts/ against a live server, so you can re-run it on your own machine and your own repos.

Why this matters

An AI coding assistant has a fixed context budget. Every token it spends finding code (listing directories, reading whole files, re-grepping) is a token it can't spend reasoning about your problem. The grep-and-read loop is especially wasteful because the model usually reads an entire file to extract one function, then reads three more files to find the callers.

Context-Simplo pre-indexes the repository into a graph + vector store and answers structural questions directly: "where is this symbol", "who calls it", "what breaks if I change it". The responses are compact, structured JSON instead of raw file dumps.

What we measure

Token cost is approximated from response size with the standard heuristic 1 token ≈ 4 bytes (see estimateTokens in scripts/benchmark.ts). We measure:

  • Tool-list overhead — the per-turn cost of advertising the tool schemas.
  • Per-scenario response tokens — the cost of each workflow's answer.
  • Top-K identities — the actual symbols/files returned, so we can verify the cheaper answer is still the correct answer (no capability regression).

The 10 workflows

These come from scripts/benchmark-scenarios.ts and map to day-to-day engineering tasks:

ID Workflow Tool
W1 Onboarding: architecture overview explain_architecture
W2 Symbol lookup before an edit find_symbol
W3 Pre-refactor caller scan find_callers
W4 Refactor blast radius get_impact_radius
W5 Conceptual exploration semantic_search
W6 Literal name search exact_search
W7 Hybrid exploratory search hybrid_search
W8 Path between two functions find_path
W9 Pre-release dead-code sweep find_dead_code
W10 Complexity hotspot scan find_complex_functions

Results

A. Compact response mode (internal optimization)

Recorded runs in bench/ compare the v0.1.0 wire format against the v0.2.0 compact format on the same indexed repository (same node/edge counts):

Metric v0.1.0 (baseline) v0.2.0 (candidate) Reduction
Total scenario tokens 13,041 3,391 74.0%
Tool-list overhead 1,710 1,627 4.9%

Biggest wins are on the high-volume tools, where verbose JSON keys and full snippets dominated:

Scenario v0.1.0 v0.2.0 Reduction
W5 conceptual search 3,658 96 97.4%
W7 hybrid search 3,666 656 82.1%
W4 impact radius 2,949 1,285 56.4%
W1 architecture 1,701 835 50.9%

Crucially, the cheaper answers returned the same top-K symbols — the savings come from format, not from dropping information.

B. grep-and-read vs. MCP (end-to-end agent workflow)

The internal benchmark measures the tool surface. The bigger story is the agent's total context cost to answer a real question. In a head-to-head where an assistant analyzed this repository two ways:

Approach Approx. tokens to build understanding
Traditional (glob + read + grep) ~42,000
Context-Simplo MCP ~6,000
Reduction ~85%

The traditional path spent most of its budget reading whole files (README, package.json, index.ts, server.ts) to extract a few facts. The MCP path queried pre-indexed structure and got compact answers with exact line numbers, call relationships, and impact radius — capabilities the grep path can't produce at all without even more reading.

Numbers vary with repo size, query mix, and model. Treat ~75–85% as the observed range, not a guarantee. Re-run the harness on your repo for your number.

Reproduce it

Prerequisites: a running Context-Simplo server (default http://localhost:3001/mcp) with at least one repository indexed.

# 1. Record a run (writes bench/<label>.json and bench/<label>.md)
pnpm tsx scripts/benchmark.ts --label my-run

# 2. (Optional) Record a baseline in the legacy wire format for comparison
pnpm tsx scripts/benchmark.ts --label my-baseline --profile v1-full

# 3. Compare two runs and emit a report (exits non-zero if the ship gate fails)
pnpm tsx scripts/benchmark-compare.ts \
  bench/my-baseline.json bench/my-run.json \
  --report bench/REPORT.md

Point at a different server with MCP_URL=http://host:port/mcp.

Ship-gate criteria

benchmark-compare.ts enforces three rules and fails CI if any break:

  1. Aggregate token savings must be ≥ 30%.
  2. No individual scenario may get more expensive.
  3. Zero capability regressions (the cheaper answer must still contain the known-correct symbols).

Methodology notes & honesty

  • Token counts are a byte-based approximation, not a specific tokenizer's output. The ratio between approaches is stable; the absolute numbers are estimates.
  • The grep-and-read figure depends on how aggressively the agent reads files. We report a realistic, not worst-case, traversal.
  • Results depend on repository size and the query mix. The harness captures repositoryState (file/node/edge counts) in every run so comparisons stay apples-to-apples.