Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,9 @@ bun install # runs `prepare` → Husky git hooks
bun run dev # same as `bun src/index.ts` — CLI from source
bun test
bun run test:golden # golden SQL vs fixtures/minimal (also runs at end of `bun run check`)
bun run test:agent-eval # probe A/B harness smoke on fixtures/minimal (also runs at end of `bun run check`)
bun run test:golden:external # Tier B: local tree via CODEMAP_ROOT / --root (not in CI)
bun run check # build, then format:check + lint:ci + test + typecheck, then test:golden
bun run check # build, then format:check + lint:ci + test + typecheck, then test:golden + test:agent-eval
bun run clean # remove untracked/ignored build artifacts (keeps `.env`, `.codemap/`)
bun run check-updates # interactive dependency updates (`bun update -i --latest`)
```
Expand Down
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,9 @@ jobs:
- name: Golden query regression (fixtures/minimal)
run: bun run test:golden

- name: Agent eval probe harness (fixtures/minimal)
run: bun run test:agent-eval

build:
name: 🧰 Build
needs: skip-ci
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ fixtures/golden/scenarios.external.json
# QA chat prompts tied to a private/local index (paths + product names)
fixtures/qa/*.local.md
fixtures/benchmark/*.local.json
.agent-eval/
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,17 +278,18 @@ Tooling: **Oxfmt**, **Oxlint**, **tsgo** (`@typescript/native-preview`).
| Command | Purpose |
| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `bun run dev` | Run the CLI from source (same as `bun src/index.ts`) |
| `bun run check` | Build, format check, lint, tests, typecheck, golden queries — run before pushing |
| `bun run check` | Build, format check, lint, tests, typecheck, golden queries + agent-eval probe smoke — run before pushing |
| `bun run fix` | Apply lint fixes, then format |
| `bun run test` / `bun run typecheck` | Focused checks |
| `bun run test:golden` | SQL snapshot regression on `fixtures/minimal` (included in `check`) |
| `bun run test:agent-eval` | Probe A/B harness smoke on `fixtures/minimal` (included in `check`; [docs/benchmark.md § Agent eval harness](docs/benchmark.md#agent-eval-harness)) |
| `bun run test:golden:external` | Tier B: local tree via `CODEMAP_*` / `--root` (not in default `check`) |
| `bun run benchmark:query` | Compare `console.table` vs `--json` stdout size (needs local `.codemap/index.db`; [docs/benchmark.md § Query stdout](docs/benchmark.md#query-stdout-table-vs-json-benchmarkquery)) |
| `bun run qa:external` | Index + sanity checks + benchmark on `CODEMAP_ROOT` / `CODEMAP_TEST_BENCH` |

```bash
bun install
bun run check # build + format:check + lint:ci + test + typecheck + test:golden
bun run check # build + format:check + lint:ci + test + typecheck + test:golden + test:agent-eval
bun run fix # oxlint --fix, then oxfmt
```

Expand Down
3 changes: 2 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Each topic has exactly one canonical file. Other files cross-reference by relati
| [architecture.md](./architecture.md) | Schema, layering, CLI internals, API, [**User config**](./architecture.md#user-config) (Zod), parsers, [Key Files](./architecture.md#key-files). |
| [glossary.md](./glossary.md) | Canonical term definitions. Disambiguates pairs like `FileRow` vs `files` table, recipe vs query, schema vs DDL, hub vs barrel. |
| [agents.md](./agents.md) | **`codemap agents init`** — bundled **`templates/agents/`** (thin pointer files) → **`.agents/`** in consumer projects; full content served live by **`codemap skill`** / **`codemap rule`** + **`codemap://skill`** / **`codemap://rule`** from `templates/agent-content/`; section assembler + `*.gen.md` renderers, **[pointer protocol](./agents.md#pointer-protocol-and-staleness-detection)** + staleness nag, per-file IDE symlink/copy, **`--interactive`**, **`--mcp`**, **`.gitignore` / `.codemap.*`**. |
| [benchmark.md](./benchmark.md) | [**Indexing another project**](./benchmark.md#indexing-another-project) · [**Benchmark script**](./benchmark.md#the-benchmark-script) · [**Query stdout (table vs JSON)**](./benchmark.md#query-stdout-table-vs-json-benchmarkquery) · [**Custom scenarios**](./benchmark.md#custom-scenarios-codemap_benchmark_config) (`CODEMAP_BENCHMARK_CONFIG`) · [`fixtures/minimal/`](../fixtures/minimal/). |
| [benchmark.md](./benchmark.md) | [**Indexing another project**](./benchmark.md#indexing-another-project) · [**Benchmark script**](./benchmark.md#the-benchmark-script) · [**Query stdout (table vs JSON)**](./benchmark.md#query-stdout-table-vs-json-benchmarkquery) · [**Custom scenarios**](./benchmark.md#custom-scenarios-codemap_benchmark_config) (`CODEMAP_BENCHMARK_CONFIG`) · [**Agent eval harness**](./benchmark.md#agent-eval-harness) · [`fixtures/minimal/`](../fixtures/minimal/). |
| [golden-queries.md](./golden-queries.md) | Golden `query` **design & policy** (Tier A/B, no proprietary trees); runner: [scripts/query-golden.ts](../scripts/query-golden.ts). |
| [fixtures/golden/](../fixtures/golden/) | [scenarios.json](../fixtures/golden/scenarios.json) + [minimal/](../fixtures/golden/minimal/) — **`bun run test:golden`**; Tier B: [scenarios.external.example.json](../fixtures/golden/scenarios.external.example.json) + **`bun run test:golden:external`** ([benchmark § Fixtures](./benchmark.md#fixtures)). |
| [fixtures/benchmark/](../fixtures/benchmark/) | Tracked [scenarios.example.json](../fixtures/benchmark/scenarios.example.json) — copy to `*.local.json` (gitignored) for [`CODEMAP_BENCHMARK_CONFIG`](./benchmark.md#custom-scenarios-codemap_benchmark_config). |
Expand Down Expand Up @@ -61,6 +61,7 @@ Cross-cutting topics that span multiple files. Each has exactly one canonical ho
| **`CLAUDE.md` / `AGENTS.md` / `GEMINI.md` / Copilot** — managed **`codemap-pointer`** sections, merge vs **`--force`** | [agents.md § Pointer files](./agents.md#pointer-files) | Link here; do not duplicate the situation table |
| End-user CLI (index, **`query --json`**, **`query --recipe`**, **`query --recipes-json`**, **`query --print-sql`**, **`skill`**, **`rule`**, agents, flags, env) — query has no row cap; use SQL **`LIMIT`**; **`--json`** errors include SQL, DB open, and bootstrap failures; bundled `templates/agent-content/skill/*.md` examples default to **`--json`** | [../README.md § CLI](../README.md#cli) | [architecture § CLI usage](./architecture.md#cli-usage) summarizes and links back; [agents.md](./agents.md) |
| Golden query regression (`test:golden`, `test:golden:external`, `--update`) | [golden-queries.md](./golden-queries.md) | CONTRIBUTING § Golden queries; [benchmark § Fixtures](./benchmark.md#fixtures) |
| Agent eval probe harness (`test:agent-eval`, `scripts/agent-eval/`) | [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness) | Reuses golden scenarios via `goldenId`; structural cost A/B (indexed query vs glob/read/grep), not SQL correctness |
| **`CODEMAP_BENCHMARK_CONFIG`** (per-repo benchmark JSON) | [benchmark § Custom scenarios](./benchmark.md#custom-scenarios-codemap_benchmark_config) | [fixtures/benchmark/scenarios.example.json](../fixtures/benchmark/scenarios.example.json) only |
| `bun run qa:external` — index + disk checks + `benchmark.ts` on **`CODEMAP_*`** | [.github/CONTRIBUTING.md](../.github/CONTRIBUTING.md) | [scripts/qa-external-repo.ts](../scripts/qa-external-repo.ts) (invocation only) |
| **Non-goals (v1)** — what Codemap deliberately doesn't do (full-text search, LSP, static analysis, visualization, daemon, deep intent classification) | [roadmap.md § Non-goals](./roadmap.md#non-goals-v1) | [why-codemap.md § When to reach for something else](./why-codemap.md#when-to-reach-for-something-else) (consumer-facing framing) — links here; [research/](./research/) notes link here, never re-list |
Expand Down
29 changes: 27 additions & 2 deletions docs/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
| **Measure SQL vs glob+read+regex** after an index exists — `src/benchmark.ts`, scenarios, fixtures | [§ The benchmark script](#the-benchmark-script) |
| **Compare `codemap query` table vs `--json` stdout** (lines/bytes) on an existing index | [§ Query stdout (`benchmark:query`)](#query-stdout-table-vs-json-benchmarkquery) |
| **Guardrail full-rebuild per-phase walls against a committed baseline** (local + weekly scheduled) | [§ Perf baseline (regression guardrail)](#perf-baseline-regression-guardrail) |
| **A/B agent eval** — indexed MCP-on vs file-scan MCP-off tool-call + token comparison on fixed probes | [§ Agent eval harness](#agent-eval-harness) |

---

Expand Down Expand Up @@ -282,10 +283,34 @@ bun run dev --full
bun run benchmark
```

**CI:** the workflow **Benchmark (fixture)** runs the same steps with `CODEMAP_ROOT=$GITHUB_WORKSPACE/fixtures/minimal`.
**CI:** the **Test** job runs `bun run test:agent-eval` after `test:golden` (probe smoke reuses the golden index via `--skip-index` when present; typically ~1–2 min combined); **Benchmark (fixture)** indexes the same corpus and runs `bun run benchmark`.

### Agent eval harness

Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shipped in npm). Indexes the fixture corpus once, then compares an **indexed query arm** (one simulated `query` tool call per probe via `queryRows`, not an MCP transport round-trip) against an **MCP-off** arm that simulates agent discovery without the index (`glob` → `read` × N → `grep`). Probe **prompts and SQL/recipe** reuse [golden scenarios](../fixtures/golden/scenarios.json) via `goldenId` (override with `--scenarios` / `AGENT_EVAL_SCENARIOS` when using an external corpus); probe definitions live in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json) (override with `--probes` / `AGENT_EVAL_PROBES`). The MCP-off **traditional** regex/globs in each probe approximate naive file discovery (not byte-identical to golden SQL).

**One-command local run:**

```bash
bash scripts/agent-eval/run-arms.sh
# default output: .agent-eval/comparison.json
# exits non-zero when any probe's scenarioSuccess is false
```

Environment overrides: `AGENT_EVAL_OUTPUT`, `AGENT_EVAL_FIXTURE_ROOT`, `AGENT_EVAL_SCENARIOS`, `AGENT_EVAL_PROBES`. **`AGENT_EVAL_RUNS`** (or `--runs`) repeats each probe and **averages** `wallMs`, `estTokens`, `resultCount`, and `toolCallCount` (rounded; `estTokens` re-ceiled after averaging); `toolSequence` stays from the first run. **`--skip-index`** skips a full reindex when `.codemap/index.db` already exists (CI smoke reuses the index left by `test:golden`). Optional real agent session logs: `AGENT_EVAL_LOG=path/to/export.json bash scripts/agent-eval/run-arms.sh` (prints parsed tool metrics via `print-log-metrics.ts`).

**Metrics (per scenario and summary):** tool-call sequence + count, wall time, estimated tokens (`chars / 4` on prompt + payload — MCP-on includes SQL, bind values, and JSON rows; MCP-off includes bytes read + grep hits), per-arm `success` (non-empty results) plus `scenarioSuccess` when both arms succeed. Results stay local JSON — no telemetry upload ([plan](./plans/agent-eval-harness.md) L.5).

**Methodology notes:**

- **Probe mode** is deterministic (no LLM): it measures structural cost of indexed SQL vs traditional file scan on the same corpus. Use it for regression guardrails and fixture tuning.
- **Log mode** parses exported agent transcripts (entries / messages / line formats) when you run live A/B sessions with MCP on vs off. Token estimates include tool `args` / `arguments` payloads and structured `content` part arrays where present; `wallMs` sums per-entry timings when exported.
- External public repos (zod, fastify, etc.): point `AGENT_EVAL_FIXTURE_ROOT` at an indexed tree, pass matching `--scenarios` / `--probes` overrides, and extend probe definitions — same harness, not duplicated fixtures.

Plan: [`docs/plans/agent-eval-harness.md`](./plans/agent-eval-harness.md). PR CI runs `bun run test:agent-eval` in the **Test** job; optional nightly / `workflow_dispatch` for external fixtures is not wired yet.

**Correctness (golden queries):** `bun run test:golden` indexes `fixtures/minimal`, runs SQL against [fixtures/golden/scenarios.json](../fixtures/golden/scenarios.json), and compares to [fixtures/golden/minimal/](../fixtures/golden/minimal/). See [golden-queries.md](./golden-queries.md). Refresh goldens after intentional fixture or schema changes: `bun scripts/query-golden.ts --update`.

**Tier B (local tree, not in default CI):** `bun run test:golden:external` (or `bun scripts/query-golden.ts --corpus external`) indexes **`CODEMAP_ROOT`**, **`CODEMAP_TEST_BENCH`**, or **`--root`**, loads [fixtures/golden/scenarios.external.json](../fixtures/golden/scenarios.external.json) if present else [scenarios.external.example.json](../fixtures/golden/scenarios.external.example.json), and writes/compares goldens under `fixtures/golden/external/` (gitignored). Use **`match`** in scenarios for subset checks (`minRows`, `everyRowContains`); use **`budgetMs`** with optional **`--strict-budget`** for perf warnings. Do not commit proprietary paths or goldens from private apps.

Scenario titles match the table above; **indexed row counts** on the fixture are stable for a given schema. A larger second fixture is optional — see [roadmap.md](./roadmap.md).
Scenario titles in the [benchmark scenarios table](#custom-scenarios-codemap_benchmark_config) describe latency fixtures; **agent-eval probes** are a separate three-scenario subset in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json). **Indexed row counts** on the fixture are stable for a given schema. A larger second fixture is optional — see [roadmap.md](./roadmap.md).
Loading
Loading