stainless-code · SutuSebastian · May 26, 2026 · May 25, 2026 · May 25, 2026 · May 25, 2026
diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md
@@ -12,8 +12,9 @@ bun install   # runs `prepare` → Husky git hooks
 bun run dev   # same as `bun src/index.ts` — CLI from source
 bun test
 bun run test:golden   # golden SQL vs fixtures/minimal (also runs at end of `bun run check`)
+bun run test:agent-eval   # probe A/B harness smoke on fixtures/minimal (also runs at end of `bun run check`)
 bun run test:golden:external   # Tier B: local tree via CODEMAP_ROOT / --root (not in CI)
-bun run check   # build, then format:check + lint:ci + test + typecheck, then test:golden
+bun run check   # build, then format:check + lint:ci + test + typecheck, then test:golden + test:agent-eval
 bun run clean   # remove untracked/ignored build artifacts (keeps `.env`, `.codemap/`)
 bun run check-updates   # interactive dependency updates (`bun update -i --latest`)
 ```

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -95,6 +95,9 @@ jobs:
       - name: Golden query regression (fixtures/minimal)
         run: bun run test:golden
 
+      - name: Agent eval probe harness (fixtures/minimal)
+        run: bun run test:agent-eval
+
   build:
     name: 🧰 Build
     needs: skip-ci

diff --git a/.gitignore b/.gitignore
@@ -15,3 +15,4 @@ fixtures/golden/scenarios.external.json
 # QA chat prompts tied to a private/local index (paths + product names)
 fixtures/qa/*.local.md
 fixtures/benchmark/*.local.json
+.agent-eval/
diff --git a/README.md b/README.md
@@ -278,17 +278,18 @@ Tooling: **Oxfmt**, **Oxlint**, **tsgo** (`@typescript/native-preview`).
 | Command                              | Purpose                                                                                                                                                                            |
 | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `bun run dev`                        | Run the CLI from source (same as `bun src/index.ts`)                                                                                                                               |
-| `bun run check`                      | Build, format check, lint, tests, typecheck, golden queries — run before pushing                                                                                                   |
+| `bun run check`                      | Build, format check, lint, tests, typecheck, golden queries + agent-eval probe smoke — run before pushing                                                                          |
 | `bun run fix`                        | Apply lint fixes, then format                                                                                                                                                      |
 | `bun run test` / `bun run typecheck` | Focused checks                                                                                                                                                                     |
 | `bun run test:golden`                | SQL snapshot regression on `fixtures/minimal` (included in `check`)                                                                                                                |
+| `bun run test:agent-eval`            | Probe A/B harness smoke on `fixtures/minimal` (included in `check`; [docs/benchmark.md § Agent eval harness](docs/benchmark.md#agent-eval-harness))                                |
 | `bun run test:golden:external`       | Tier B: local tree via `CODEMAP_*` / `--root` (not in default `check`)                                                                                                             |
 | `bun run benchmark:query`            | Compare `console.table` vs `--json` stdout size (needs local `.codemap/index.db`; [docs/benchmark.md § Query stdout](docs/benchmark.md#query-stdout-table-vs-json-benchmarkquery)) |
 | `bun run qa:external`                | Index + sanity checks + benchmark on `CODEMAP_ROOT` / `CODEMAP_TEST_BENCH`                                                                                                         |
 
 ```bash
 bun install
-bun run check    # build + format:check + lint:ci + test + typecheck + test:golden
+bun run check    # build + format:check + lint:ci + test + typecheck + test:golden + test:agent-eval
 bun run fix      # oxlint --fix, then oxfmt
 ```
 

diff --git a/docs/README.md b/docs/README.md
@@ -14,7 +14,7 @@ Each topic has exactly one canonical file. Other files cross-reference by relati
 | [architecture.md](./architecture.md)          | Schema, layering, CLI internals, API, [**User config**](./architecture.md#user-config) (Zod), parsers, [Key Files](./architecture.md#key-files).                                                                                                                                                                                                                                                                                                                                                                   |
 | [glossary.md](./glossary.md)                  | Canonical term definitions. Disambiguates pairs like `FileRow` vs `files` table, recipe vs query, schema vs DDL, hub vs barrel.                                                                                                                                                                                                                                                                                                                                                                                    |
 | [agents.md](./agents.md)                      | **`codemap agents init`** — bundled **`templates/agents/`** (thin pointer files) → **`.agents/`** in consumer projects; full content served live by **`codemap skill`** / **`codemap rule`** + **`codemap://skill`** / **`codemap://rule`** from `templates/agent-content/`; section assembler + `*.gen.md` renderers, **[pointer protocol](./agents.md#pointer-protocol-and-staleness-detection)** + staleness nag, per-file IDE symlink/copy, **`--interactive`**, **`--mcp`**, **`.gitignore` / `.codemap.*`**. |
-| [benchmark.md](./benchmark.md)                | [**Indexing another project**](./benchmark.md#indexing-another-project) · [**Benchmark script**](./benchmark.md#the-benchmark-script) · [**Query stdout (table vs JSON)**](./benchmark.md#query-stdout-table-vs-json-benchmarkquery) · [**Custom scenarios**](./benchmark.md#custom-scenarios-codemap_benchmark_config) (`CODEMAP_BENCHMARK_CONFIG`) · [`fixtures/minimal/`](../fixtures/minimal/).                                                                                                                |
+| [benchmark.md](./benchmark.md)                | [**Indexing another project**](./benchmark.md#indexing-another-project) · [**Benchmark script**](./benchmark.md#the-benchmark-script) · [**Query stdout (table vs JSON)**](./benchmark.md#query-stdout-table-vs-json-benchmarkquery) · [**Custom scenarios**](./benchmark.md#custom-scenarios-codemap_benchmark_config) (`CODEMAP_BENCHMARK_CONFIG`) · [**Agent eval harness**](./benchmark.md#agent-eval-harness) · [`fixtures/minimal/`](../fixtures/minimal/).                                                  |
 | [golden-queries.md](./golden-queries.md)      | Golden `query` **design & policy** (Tier A/B, no proprietary trees); runner: [scripts/query-golden.ts](../scripts/query-golden.ts).                                                                                                                                                                                                                                                                                                                                                                                |
 | [fixtures/golden/](../fixtures/golden/)       | [scenarios.json](../fixtures/golden/scenarios.json) + [minimal/](../fixtures/golden/minimal/) — **`bun run test:golden`**; Tier B: [scenarios.external.example.json](../fixtures/golden/scenarios.external.example.json) + **`bun run test:golden:external`** ([benchmark § Fixtures](./benchmark.md#fixtures)).                                                                                                                                                                                                   |
 | [fixtures/benchmark/](../fixtures/benchmark/) | Tracked [scenarios.example.json](../fixtures/benchmark/scenarios.example.json) — copy to `*.local.json` (gitignored) for [`CODEMAP_BENCHMARK_CONFIG`](./benchmark.md#custom-scenarios-codemap_benchmark_config).                                                                                                                                                                                                                                                                                                   |
@@ -61,6 +61,7 @@ Cross-cutting topics that span multiple files. Each has exactly one canonical ho
 | **`CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / Copilot** — managed **`codemap-pointer`** sections, merge vs **`--force`**                                                                                                                                                                                                                                                                       | [agents.md § Pointer files](./agents.md#pointer-files)                                   | Link here; do not duplicate the situation table                                                                                                                                                        |
 | End-user CLI (index, **`query --json`**, **`query --recipe`**, **`query --recipes-json`**, **`query --print-sql`**, **`skill`**, **`rule`**, agents, flags, env) — query has no row cap; use SQL **`LIMIT`**; **`--json`** errors include SQL, DB open, and bootstrap failures; bundled `templates/agent-content/skill/*.md` examples default to **`--json`**                                | [../README.md § CLI](../README.md#cli)                                                   | [architecture § CLI usage](./architecture.md#cli-usage) summarizes and links back; [agents.md](./agents.md)                                                                                            |
 | Golden query regression (`test:golden`, `test:golden:external`, `--update`)                                                                                                                                                                                                                                                                                                                  | [golden-queries.md](./golden-queries.md)                                                 | CONTRIBUTING § Golden queries; [benchmark § Fixtures](./benchmark.md#fixtures)                                                                                                                         |
+| Agent eval probe harness (`test:agent-eval`, `scripts/agent-eval/`)                                                                                                                                                                                                                                                                                                                          | [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness)                      | Reuses golden scenarios via `goldenId`; structural cost A/B (indexed query vs glob/read/grep), not SQL correctness                                                                                     |
 | **`CODEMAP_BENCHMARK_CONFIG`** (per-repo benchmark JSON)                                                                                                                                                                                                                                                                                                                                     | [benchmark § Custom scenarios](./benchmark.md#custom-scenarios-codemap_benchmark_config) | [fixtures/benchmark/scenarios.example.json](../fixtures/benchmark/scenarios.example.json) only                                                                                                         |
 | `bun run qa:external` — index + disk checks + `benchmark.ts` on **`CODEMAP_*`**                                                                                                                                                                                                                                                                                                              | [.github/CONTRIBUTING.md](../.github/CONTRIBUTING.md)                                    | [scripts/qa-external-repo.ts](../scripts/qa-external-repo.ts) (invocation only)                                                                                                                        |
 | **Non-goals (v1)** — what Codemap deliberately doesn't do (full-text search, LSP, static analysis, visualization, daemon, deep intent classification)                                                                                                                                                                                                                                        | [roadmap.md § Non-goals](./roadmap.md#non-goals-v1)                                      | [why-codemap.md § When to reach for something else](./why-codemap.md#when-to-reach-for-something-else) (consumer-facing framing) — links here; [research/](./research/) notes link here, never re-list |

diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -10,6 +10,7 @@
 | **Measure SQL vs glob+read+regex** after an index exists — `src/benchmark.ts`, scenarios, fixtures                                                       | [§ The benchmark script](#the-benchmark-script)                                  |
 | **Compare `codemap query` table vs `--json` stdout** (lines/bytes) on an existing index                                                                  | [§ Query stdout (`benchmark:query`)](#query-stdout-table-vs-json-benchmarkquery) |
 | **Guardrail full-rebuild per-phase walls against a committed baseline** (local + weekly scheduled)                                                       | [§ Perf baseline (regression guardrail)](#perf-baseline-regression-guardrail)    |
+| **A/B agent eval** — indexed MCP-on vs file-scan MCP-off tool-call + token comparison on fixed probes                                                    | [§ Agent eval harness](#agent-eval-harness)                                      |
 
 ---
 
@@ -282,10 +283,34 @@ bun run dev --full
 bun run benchmark
 ```
 
-**CI:** the workflow **Benchmark (fixture)** runs the same steps with `CODEMAP_ROOT=$GITHUB_WORKSPACE/fixtures/minimal`.
+**CI:** the **Test** job runs `bun run test:agent-eval` after `test:golden` (probe smoke reuses the golden index via `--skip-index` when present; typically ~1–2 min combined); **Benchmark (fixture)** indexes the same corpus and runs `bun run benchmark`.
+
+### Agent eval harness
+
+Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shipped in npm). Indexes the fixture corpus once, then compares an **indexed query arm** (one simulated `query` tool call per probe via `queryRows`, not an MCP transport round-trip) against an **MCP-off** arm that simulates agent discovery without the index (`glob` → `read` × N → `grep`). Probe **prompts and SQL/recipe** reuse [golden scenarios](../fixtures/golden/scenarios.json) via `goldenId` (override with `--scenarios` / `AGENT_EVAL_SCENARIOS` when using an external corpus); probe definitions live in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json) (override with `--probes` / `AGENT_EVAL_PROBES`). The MCP-off **traditional** regex/globs in each probe approximate naive file discovery (not byte-identical to golden SQL).
+
+**One-command local run:**
+
+```bash
+bash scripts/agent-eval/run-arms.sh
+# default output: .agent-eval/comparison.json
+# exits non-zero when any probe's scenarioSuccess is false
+```
+
+Environment overrides: `AGENT_EVAL_OUTPUT`, `AGENT_EVAL_FIXTURE_ROOT`, `AGENT_EVAL_SCENARIOS`, `AGENT_EVAL_PROBES`. **`AGENT_EVAL_RUNS`** (or `--runs`) repeats each probe and **averages** `wallMs`, `estTokens`, `resultCount`, and `toolCallCount` (rounded; `estTokens` re-ceiled after averaging); `toolSequence` stays from the first run. **`--skip-index`** skips a full reindex when `.codemap/index.db` already exists (CI smoke reuses the index left by `test:golden`). Optional real agent session logs: `AGENT_EVAL_LOG=path/to/export.json bash scripts/agent-eval/run-arms.sh` (prints parsed tool metrics via `print-log-metrics.ts`).
+
+**Metrics (per scenario and summary):** tool-call sequence + count, wall time, estimated tokens (`chars / 4` on prompt + payload — MCP-on includes SQL, bind values, and JSON rows; MCP-off includes bytes read + grep hits), per-arm `success` (non-empty results) plus `scenarioSuccess` when both arms succeed. Results stay local JSON — no telemetry upload ([plan](./plans/agent-eval-harness.md) L.5).
+
+**Methodology notes:**
+
+- **Probe mode** is deterministic (no LLM): it measures structural cost of indexed SQL vs traditional file scan on the same corpus. Use it for regression guardrails and fixture tuning.
+- **Log mode** parses exported agent transcripts (entries / messages / line formats) when you run live A/B sessions with MCP on vs off. Token estimates include tool `args` / `arguments` payloads and structured `content` part arrays where present; `wallMs` sums per-entry timings when exported.
+- External public repos (zod, fastify, etc.): point `AGENT_EVAL_FIXTURE_ROOT` at an indexed tree, pass matching `--scenarios` / `--probes` overrides, and extend probe definitions — same harness, not duplicated fixtures.
+
+Plan: [`docs/plans/agent-eval-harness.md`](./plans/agent-eval-harness.md). PR CI runs `bun run test:agent-eval` in the **Test** job; optional nightly / `workflow_dispatch` for external fixtures is not wired yet.
 
 **Correctness (golden queries):** `bun run test:golden` indexes `fixtures/minimal`, runs SQL against [fixtures/golden/scenarios.json](../fixtures/golden/scenarios.json), and compares to [fixtures/golden/minimal/](../fixtures/golden/minimal/). See [golden-queries.md](./golden-queries.md). Refresh goldens after intentional fixture or schema changes: `bun scripts/query-golden.ts --update`.
 
 **Tier B (local tree, not in default CI):** `bun run test:golden:external` (or `bun scripts/query-golden.ts --corpus external`) indexes **`CODEMAP_ROOT`**, **`CODEMAP_TEST_BENCH`**, or **`--root`**, loads [fixtures/golden/scenarios.external.json](../fixtures/golden/scenarios.external.json) if present else [scenarios.external.example.json](../fixtures/golden/scenarios.external.example.json), and writes/compares goldens under `fixtures/golden/external/` (gitignored). Use **`match`** in scenarios for subset checks (`minRows`, `everyRowContains`); use **`budgetMs`** with optional **`--strict-budget`** for perf warnings. Do not commit proprietary paths or goldens from private apps.
 
-Scenario titles match the table above; **indexed row counts** on the fixture are stable for a given schema. A larger second fixture is optional — see [roadmap.md](./roadmap.md).
+Scenario titles in the [benchmark scenarios table](#custom-scenarios-codemap_benchmark_config) describe latency fixtures; **agent-eval probes** are a separate three-scenario subset in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json). **Indexed row counts** on the fixture are stable for a given schema. A larger second fixture is optional — see [roadmap.md](./roadmap.md).