Skip to content

Commit 684ce51

Browse files
committed
docs: close research sweep — lift decisions, delete adopted archives
Lift dual-agent findings, apply-path rejected alternatives, and backlog items into benchmark/architecture/roadmap; retarget cross-refs; delete three adopted research notes per docs-governance Rule 8.
1 parent 849a121 commit 684ce51

12 files changed

Lines changed: 77 additions & 931 deletions

docs/README.md

Lines changed: 14 additions & 14 deletions
Large diffs are not rendered by default.

docs/architecture.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,25 @@ Three **mutually exclusive** CLI entry shapes; all converge on `applyDiffPayload
166166

167167
**Discover → preview → apply** (agent loop): `query_recipe` / `query --recipe <id> --format diff-json` (or audit baseline `added` rows) → `apply` with `dry_run: true``apply` with `yes: true` (+ `force: true` when required). Per-row `actions[].command` on `--json` query output renders a copy-paste shell line (`renderRecipeActionCommands`).
168168

169-
**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback. Rejected alternatives + revisit triggers: [synthesis §7](./research/codemap-richer-index-synthesis-2026-05.md#7-rejected-items-with-trigger-conditions) (`organize-imports`, Path A AST apply, trust tiers, …).
169+
**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback.
170+
171+
**Rejected apply-path alternatives** (grep `rg "Path A|trust tiers|auto_fixable"` in `docs/` for related plans):
172+
173+
| Item | Why rejected | Revisit when |
174+
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
175+
| Curated write verbs (`codemap fix deprecated`, …) — excludes thin `codemap rename``apply rename-preview` ([#166](https://github.com/stainless-code/codemap/pull/166)) | Premature before recipe layer proves out; pro-verb sources disagreed on cap (8–12 vs 3–5) | ≥3 diff-shape recipes ship AND agent-host UX needs verb discovery beyond `actions[].command` |
176+
| Parallel `applyAstPayload()` AST engine (Path A) | Competes with `ts-morph` / `jscodeshift`; printer burden; positioning blur; floor disappearance | ≥2 of: (a) ≥3 external teams hit substring wall; (b) concrete AST-shape demand; (c) ecosystem moves to AST patches AND substring is bottleneck; (d) Path B (`codemap-to-tsmorph`) handoff friction |
177+
| Trust tiers (`safe` / `review` / `risky`) | Taxonomy debt; `auto_fixable` + `apply.autoApplyRecipes` cover same cases | Allowlist insufficient AND ≥2 consumers ship `jq`-style trust filters in CI |
178+
| Per-row confidence scores in `diff-json` | No consensus on computation | Recipe needs per-site ranking when `before_pattern` matches multiple sites |
179+
| Verifier as product surface (typecheck / lint / tests) | Scope creep; watch + reindex covers structural verify | Consumer plan PR with concrete verifier shape |
180+
| Reliability loop (conflict-rate / apply-success metrics) | No telemetry upload ([Floors](./roadmap.md#floors-v1-product-shape)) | Consumer requests offline / self-hosted observability |
181+
| Generalised `references` + `bindings` consolidation before demand | Incremental position tables first | Third position-table lands AND a recipe wants UNION across all three |
182+
| `--branch` / `--output-patch` workflow flags | `--commit` is priority | `--commit` insufficient in practice |
183+
| Multi-line + kind-tagged row contract | Single-line cases first | Recipe needs multi-line AND workarounds fail |
184+
| Cross-file moves (`move_to`) | Higher risk than single-file | Delete-source + insert-dest two-step insufficient |
185+
| Cross-file atomic apply (backup + restore) | Per-file atomicity fine for ≤10 files | Real apply crosses 50 files AND phase-2 failure leaks partial state |
186+
187+
**Backlog (not rejected):** `organize-imports` diff-shape recipe; `codemap-to-tsmorph` Path B adapter (separate package after `apply --rows` shipped). **Tracked elsewhere:** C.9 entry-point integration — [`plans/c9-plugin-layer.md`](./plans/c9-plugin-layer.md).
170188

171189
**Show / snippet wiring:** **`src/cli/show-snippet-args.ts`** (shared argv parser) + **`src/cli/show-snippet-render.ts`** (shared terminal/JSON error helpers) + **`src/cli/cmd-show.ts`** + **`src/cli/cmd-snippet.ts`** — sibling CLI verbs sharing the same parser shape (`<name>` or **`--query '<field:value …>'`** + **`--with-fts`** + `--kind` + `--in <path>` + `--json`; show adds **`--print-sql`**) and the pure engines **`src/application/show-engine.ts`** (exact lookup + envelope builders), **`src/application/search-query-parser.ts`** + **`src/application/search-engine.ts`** (field-qualified search → parameterized SQL on `symbols`, optional `source_fts` join), and **`src/application/show-search-mode.ts`** (shared parse/normalize + FTS resolution + **`executeShowLookup`** + **`formatShowSearchSqlForQuery`** for CLI/MCP/HTTP). Exact lookup: `findSymbolsByName({db, name, kind?, inPath?})`. Query lookup: `searchSymbols({db, parsed, withFts?})`. Snippet FS read: `readSymbolSource({match, projectRoot, indexedContentHash?})` + `getIndexedContentHash(db, filePath)`. **`buildShowResult`** + **`buildSnippetResult`** envelope builders — same engines the MCP show/snippet tools call. Both verbs return the same `{matches, disambiguation?, warning?}` envelope — single match → `{matches: [{...}]}`; multi-match adds `{n, by_kind, files, hint}`; optional **`warning`** when FTS was requested but `source_fts` is empty. Snippet matches add `source` / `stale` / `missing` fields (additive — no shape divergence). **`--in <path>`** and **`path:`** inside **`--query`** normalize through `toProjectRelative(projectRoot, p)` (from **`src/application/validate-engine.ts`**). Stale-file behavior on `snippet`: `hashContent` (from **`src/hash.ts`**) compares on-disk content against `files.content_hash`; mismatch sets `stale: true` but source IS still returned. MCP tools `show` and `snippet` register parallel to the CLI surface (see [§ MCP wiring](#cli-usage)).
172190

docs/benchmark.md

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -300,7 +300,7 @@ Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shi
300300
| ------- | ---------------------------------------------------------------------- | --------------------------------------------------- |
301301
| **Log** | `AGENT_EVAL_LOG_ON` + `AGENT_EVAL_LOG_OFF` (or `compare-live-logs.ts`) | Parses exported MCP-on vs MCP-off agent transcripts |
302302

303-
**Eval layers** (full methodology and exploratory findings: [research/agent-eval-findings-2026-05.md](./research/agent-eval-findings-2026-05.md)):
303+
**Eval layers** (dual-agent provisional findings: [§ Dual-agent study](#dual-agent-study-codemap-self-index-provisional) below):
304304

305305
| Layer | MCP-on | MCP-off / baseline | In CI today? |
306306
| -------------- | ------------------------------ | ------------------------------------------ | ----------------------------------------------------------------------------- |
@@ -309,7 +309,7 @@ Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shi
309309
| **Log** | Parsed MCP-on export | Parsed MCP-off export | Parser smoke only (`test:agent-eval` on sample logs); no CI on ad-hoc exports |
310310
| **Dual-agent** | Live MCP tools in an LLM agent | Same tasks; MCP/`codemap query` prohibited | No (research only) |
311311

312-
**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob``read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see the research note § 4.
312+
**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob``read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see [§ Dual-agent study](#dual-agent-study-codemap-self-index-provisional) finding #1.
313313

314314
Probe **prompts and SQL/recipe** reuse [golden scenarios](../fixtures/golden/scenarios.json) via `goldenId` (override with `--scenarios` / `AGENT_EVAL_SCENARIOS` when using an external corpus); probe definitions live in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json) (override with `--probes` / `AGENT_EVAL_PROBES`). The MCP-off **traditional** regex/globs in each probe approximate naive file discovery (not byte-identical to golden SQL).
315315

@@ -365,7 +365,32 @@ Environment overrides: `AGENT_EVAL_OUTPUT`, `AGENT_EVAL_FIXTURE_ROOT`, `AGENT_EV
365365
| `find-call-sites` | 1 | 25 | 375 | 2,667 |
366366
| **Totals** | **3** | **75** | **601** | **7,955** |
367367

368-
Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes. Dual-agent and self-index studies: [research/agent-eval-findings-2026-05.md](./research/agent-eval-findings-2026-05.md).
368+
Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes.
369+
370+
#### Dual-agent study (codemap self-index, provisional)
371+
372+
Exploratory runs on the **codemap repo** index (not `fixtures/minimal`) — four structural tasks, MCP-on vs MCP-forbidden (grep/read/shell only). Not pinned in CI; methodology caveats apply.
373+
374+
| Task | MCP-on | MCP-off | Verdict |
375+
| ---------------------------------------------------------- | ---------------------------------- | ----------------------------- | -------------------------------------------- |
376+
| Call path `createCodemap``resolveStateDir` | 2 hops, 1 MCP call | Same path, 8 tools | Tie on answer; MCP cheaper |
377+
| Transitive dependents of `src/db.ts` (depth 4) | **132 files** | 124–133 (scope-dependent) | MCP exact; grep approximate |
378+
| Rename preview `resolveStateDir``resolveStateDirectory` | **8 code files** (21 binding refs) | 11 files (+ docs, comments) | MCP matches `rename-preview` scope |
379+
| Upstream callers of `resolveStateDir` (2 hops) | **28 symbols**, 6 depth-1 | ~28 (text-inferred), 23 tools | Similar count; MCP 1 call, higher confidence |
380+
381+
**Structural cost (same session):** MCP-on ~6 MCP calls (+ schema reads), ~38 KB payload; MCP-off ~37 tools (21 grep, 13 read, 3 shell), ~85 KB.
382+
383+
**Provisional findings:**
384+
385+
1. **Naive discovery vs skilled grep** — harness MCP-off models glob→read→grep. Skilled targeted grep can tie MCP on tool count for simple symbol/import/call-site lookups.
386+
2. **Graph questions favor MCP** — transitive deps, impact, trace, rename-preview: indexed answers in 1–2 calls; grep chains cost more and often report medium confidence.
387+
3. **Token estimate nuance** — recipe payloads with `actions` metadata can make MCP **larger** than grep on simple tasks; MCP still wins on correctness (resolved edges, column-precise call sites, binding kinds).
388+
4. **Dual-agent > simulation** — hand-waving grep token math understates real agent cost (re-reads, shell graph scripts, scope ambiguity).
389+
5. **Not an LLM eval** — layers measure **structural tool cost** and answer alignment with the index, not model reasoning quality or task success rate.
390+
391+
**Limitations:** corpus-dependent (minimal fixture magnifies MCP-off read fan-out); re-run after `SCHEMA_VERSION` or fixture changes; log mode omits full read payloads unless exports include them.
392+
393+
**Follow-up:** scripted dual-agent harness + external fixture CI — [roadmap § Backlog](./roadmap.md#backlog) (`Scripted dual-agent harness`, `Falsifiable benchmark CI`).
369394

370395
**Correctness (golden queries):** `bun run test:golden` indexes `fixtures/minimal`, runs declared **`setup`** steps when present (e.g. coverage ingest), then runs SQL against [fixtures/golden/scenarios.json](../fixtures/golden/scenarios.json) and compares to [fixtures/golden/minimal/](../fixtures/golden/minimal/). See [golden-queries.md](./golden-queries.md). Refresh goldens after intentional fixture or schema changes: `bun scripts/query-golden.ts --update`.
371396

0 commit comments

Comments
 (0)