Token Benchmark: grep-and-read vs. Context-Simplo

Headline: On a representative set of 10 real engineering workflows, answering through Context-Simplo's MCP tools used ~75% fewer tokens than the same questions answered with a grep-and-read workflow — while returning the same (or better) answers.

This document explains what we measure, how to reproduce it, and the numbers we get. Everything here is generated by scripts in scripts/ against a live server, so you can re-run it on your own machine and your own repos.

Why this matters

An AI coding assistant has a fixed context budget. Every token it spends finding code (listing directories, reading whole files, re-grepping) is a token it can't spend reasoning about your problem. The grep-and-read loop is especially wasteful because the model usually reads an entire file to extract one function, then reads three more files to find the callers.

Context-Simplo pre-indexes the repository into a graph + vector store and answers structural questions directly: "where is this symbol", "who calls it", "what breaks if I change it". The responses are compact, structured JSON instead of raw file dumps.

What we measure

Token cost is approximated from response size with the standard heuristic 1 token ≈ 4 bytes (see estimateTokens in scripts/benchmark.ts). We measure:

Tool-list overhead — the per-turn cost of advertising the tool schemas.
Per-scenario response tokens — the cost of each workflow's answer.
Top-K identities — the actual symbols/files returned, so we can verify the cheaper answer is still the correct answer (no capability regression).

The 10 workflows

These come from scripts/benchmark-scenarios.ts and map to day-to-day engineering tasks:

ID	Workflow	Tool
W1	Onboarding: architecture overview	`explain_architecture`
W2	Symbol lookup before an edit	`find_symbol`
W3	Pre-refactor caller scan	`find_callers`
W4	Refactor blast radius	`get_impact_radius`
W5	Conceptual exploration	`semantic_search`
W6	Literal name search	`exact_search`
W7	Hybrid exploratory search	`hybrid_search`
W8	Path between two functions	`find_path`
W9	Pre-release dead-code sweep	`find_dead_code`
W10	Complexity hotspot scan	`find_complex_functions`

Results

A. Compact response mode (internal optimization)

Recorded runs in bench/ compare the v0.1.0 wire format against the v0.2.0 compact format on the same indexed repository (same node/edge counts):

Metric	v0.1.0 (`baseline`)	v0.2.0 (`candidate`)	Reduction
Total scenario tokens	13,041	3,391	74.0%
Tool-list overhead	1,710	1,627	4.9%

Biggest wins are on the high-volume tools, where verbose JSON keys and full snippets dominated:

Scenario	v0.1.0	v0.2.0	Reduction
W5 conceptual search	3,658	96	97.4%
W7 hybrid search	3,666	656	82.1%
W4 impact radius	2,949	1,285	56.4%
W1 architecture	1,701	835	50.9%

Crucially, the cheaper answers returned the same top-K symbols — the savings come from format, not from dropping information.

B. grep-and-read vs. MCP (end-to-end agent workflow)

The internal benchmark measures the tool surface. The bigger story is the agent's total context cost to answer a real question. In a head-to-head where an assistant analyzed this repository two ways:

Approach	Approx. tokens to build understanding
Traditional (glob + read + grep)	~42,000
Context-Simplo MCP	~6,000
Reduction	~85%

The traditional path spent most of its budget reading whole files (README, package.json, index.ts, server.ts) to extract a few facts. The MCP path queried pre-indexed structure and got compact answers with exact line numbers, call relationships, and impact radius — capabilities the grep path can't produce at all without even more reading.

Numbers vary with repo size, query mix, and model. Treat ~75–85% as the observed range, not a guarantee. Re-run the harness on your repo for your number.

Reproduce it

Prerequisites: a running Context-Simplo server (default http://localhost:3001/mcp) with at least one repository indexed.

# 1. Record a run (writes bench/<label>.json and bench/<label>.md)
pnpm tsx scripts/benchmark.ts --label my-run

# 2. (Optional) Record a baseline in the legacy wire format for comparison
pnpm tsx scripts/benchmark.ts --label my-baseline --profile v1-full

# 3. Compare two runs and emit a report (exits non-zero if the ship gate fails)
pnpm tsx scripts/benchmark-compare.ts \
  bench/my-baseline.json bench/my-run.json \
  --report bench/REPORT.md

Point at a different server with MCP_URL=http://host:port/mcp.

Ship-gate criteria

benchmark-compare.ts enforces three rules and fails CI if any break:

Aggregate token savings must be ≥ 30%.
No individual scenario may get more expensive.
Zero capability regressions (the cheaper answer must still contain the known-correct symbols).

Methodology notes & honesty

Token counts are a byte-based approximation, not a specific tokenizer's output. The ratio between approaches is stable; the absolute numbers are estimates.
The grep-and-read figure depends on how aggressively the agent reads files. We report a realistic, not worst-case, traversal.
Results depend on repository size and the query mix. The harness captures repositoryState (file/node/edge counts) in every run so comparisons stay apples-to-apples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token Benchmark: grep-and-read vs. Context-Simplo

Why this matters

What we measure

The 10 workflows

Results

A. Compact response mode (internal optimization)

B. grep-and-read vs. MCP (end-to-end agent workflow)

Reproduce it

Ship-gate criteria

Methodology notes & honesty

Uh oh!

FilesExpand file tree

benchmark.md

Latest commit

History

benchmark.md

File metadata and controls

Token Benchmark: grep-and-read vs. Context-Simplo

Why this matters

What we measure

The 10 workflows

Results

A. Compact response mode (internal optimization)

B. grep-and-read vs. MCP (end-to-end agent workflow)

Reproduce it

Ship-gate criteria

Methodology notes & honesty