stainless-code
diff --git a/‎docs/README.md‎
Lines changed: 14 additions & 14 deletions b/‎docs/README.md‎
Lines changed: 14 additions & 14 deletions
diff --git a/‎docs/architecture.md‎
Lines changed: 19 additions & 1 deletion b/‎docs/architecture.md‎
Lines changed: 19 additions & 1 deletion
diff --git a/‎docs/benchmark.md‎
Lines changed: 28 additions & 3 deletions b/‎docs/benchmark.md‎
Lines changed: 28 additions & 3 deletions
@@ -166,7 +166,25 @@ Three **mutually exclusive** CLI entry shapes; all converge on `applyDiffPayload
 
 **Discover → preview → apply** (agent loop): `query_recipe` / `query --recipe <id> --format diff-json` (or audit baseline `added` rows) → `apply` with `dry_run: true` → `apply` with `yes: true` (+ `force: true` when required). Per-row `actions[].command` on `--json` query output renders a copy-paste shell line (`renderRecipeActionCommands`).
 
-**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback. Rejected alternatives + revisit triggers: [synthesis §7](./research/codemap-richer-index-synthesis-2026-05.md#7-rejected-items-with-trigger-conditions) (`organize-imports`, Path A AST apply, trust tiers, …).
+**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback.
+
+**Rejected apply-path alternatives** (grep `rg "Path A|trust tiers|auto_fixable"` in `docs/` for related plans):
+
+| Item                                                                                                                                                                     | Why rejected                                                                                    | Revisit when                                                                                                                                                                                       |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Curated write verbs (`codemap fix deprecated`, …) — excludes thin `codemap rename` → `apply rename-preview` ([#166](https://github.com/stainless-code/codemap/pull/166)) | Premature before recipe layer proves out; pro-verb sources disagreed on cap (8–12 vs 3–5)       | ≥3 diff-shape recipes ship AND agent-host UX needs verb discovery beyond `actions[].command`                                                                                                       |
+| Parallel `applyAstPayload()` AST engine (Path A)                                                                                                                         | Competes with `ts-morph` / `jscodeshift`; printer burden; positioning blur; floor disappearance | ≥2 of: (a) ≥3 external teams hit substring wall; (b) concrete AST-shape demand; (c) ecosystem moves to AST patches AND substring is bottleneck; (d) Path B (`codemap-to-tsmorph`) handoff friction |
+| Trust tiers (`safe` / `review` / `risky`)                                                                                                                                | Taxonomy debt; `auto_fixable` + `apply.autoApplyRecipes` cover same cases                       | Allowlist insufficient AND ≥2 consumers ship `jq`-style trust filters in CI                                                                                                                        |
+| Per-row confidence scores in `diff-json`                                                                                                                                 | No consensus on computation                                                                     | Recipe needs per-site ranking when `before_pattern` matches multiple sites                                                                                                                         |
+| Verifier as product surface (typecheck / lint / tests)                                                                                                                   | Scope creep; watch + reindex covers structural verify                                           | Consumer plan PR with concrete verifier shape                                                                                                                                                      |
+| Reliability loop (conflict-rate / apply-success metrics)                                                                                                                 | No telemetry upload ([Floors](./roadmap.md#floors-v1-product-shape))                            | Consumer requests offline / self-hosted observability                                                                                                                                              |
+| Generalised `references` + `bindings` consolidation before demand                                                                                                        | Incremental position tables first                                                               | Third position-table lands AND a recipe wants UNION across all three                                                                                                                               |
+| `--branch` / `--output-patch` workflow flags                                                                                                                             | `--commit` is priority                                                                          | `--commit` insufficient in practice                                                                                                                                                                |
+| Multi-line + kind-tagged row contract                                                                                                                                    | Single-line cases first                                                                         | Recipe needs multi-line AND workarounds fail                                                                                                                                                       |
+| Cross-file moves (`move_to`)                                                                                                                                             | Higher risk than single-file                                                                    | Delete-source + insert-dest two-step insufficient                                                                                                                                                  |
+| Cross-file atomic apply (backup + restore)                                                                                                                               | Per-file atomicity fine for ≤10 files                                                           | Real apply crosses 50 files AND phase-2 failure leaks partial state                                                                                                                                |
+
+**Backlog (not rejected):** `organize-imports` diff-shape recipe; `codemap-to-tsmorph` Path B adapter (separate package after `apply --rows` shipped). **Tracked elsewhere:** C.9 entry-point integration — [`plans/c9-plugin-layer.md`](./plans/c9-plugin-layer.md).
 
 **Show / snippet wiring:** **`src/cli/show-snippet-args.ts`** (shared argv parser) + **`src/cli/show-snippet-render.ts`** (shared terminal/JSON error helpers) + **`src/cli/cmd-show.ts`** + **`src/cli/cmd-snippet.ts`** — sibling CLI verbs sharing the same parser shape (`<name>` or **`--query '<field:value …>'`** + **`--with-fts`** + `--kind` + `--in <path>` + `--json`; show adds **`--print-sql`**) and the pure engines **`src/application/show-engine.ts`** (exact lookup + envelope builders), **`src/application/search-query-parser.ts`** + **`src/application/search-engine.ts`** (field-qualified search → parameterized SQL on `symbols`, optional `source_fts` join), and **`src/application/show-search-mode.ts`** (shared parse/normalize + FTS resolution + **`executeShowLookup`** + **`formatShowSearchSqlForQuery`** for CLI/MCP/HTTP). Exact lookup: `findSymbolsByName({db, name, kind?, inPath?})`. Query lookup: `searchSymbols({db, parsed, withFts?})`. Snippet FS read: `readSymbolSource({match, projectRoot, indexedContentHash?})` + `getIndexedContentHash(db, filePath)`. **`buildShowResult`** + **`buildSnippetResult`** envelope builders — same engines the MCP show/snippet tools call. Both verbs return the same `{matches, disambiguation?, warning?}` envelope — single match → `{matches: [{...}]}`; multi-match adds `{n, by_kind, files, hint}`; optional **`warning`** when FTS was requested but `source_fts` is empty. Snippet matches add `source` / `stale` / `missing` fields (additive — no shape divergence). **`--in <path>`** and **`path:`** inside **`--query`** normalize through `toProjectRelative(projectRoot, p)` (from **`src/application/validate-engine.ts`**). Stale-file behavior on `snippet`: `hashContent` (from **`src/hash.ts`**) compares on-disk content against `files.content_hash`; mismatch sets `stale: true` but source IS still returned. MCP tools `show` and `snippet` register parallel to the CLI surface (see [§ MCP wiring](#cli-usage)).
 
 
@@ -300,7 +300,7 @@ Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shi
 | ------- | ---------------------------------------------------------------------- | --------------------------------------------------- |
 | **Log** | `AGENT_EVAL_LOG_ON` + `AGENT_EVAL_LOG_OFF` (or `compare-live-logs.ts`) | Parses exported MCP-on vs MCP-off agent transcripts |
 
-**Eval layers** (full methodology and exploratory findings: [research/agent-eval-findings-2026-05.md](./research/agent-eval-findings-2026-05.md)):
+**Eval layers** (dual-agent provisional findings: [§ Dual-agent study](#dual-agent-study-codemap-self-index-provisional) below):
 
 | Layer          | MCP-on                         | MCP-off / baseline                         | In CI today?                                                                  |
 | -------------- | ------------------------------ | ------------------------------------------ | ----------------------------------------------------------------------------- |
@@ -309,7 +309,7 @@ Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shi
 | **Log**        | Parsed MCP-on export           | Parsed MCP-off export                      | Parser smoke only (`test:agent-eval` on sample logs); no CI on ad-hoc exports |
 | **Dual-agent** | Live MCP tools in an LLM agent | Same tasks; MCP/`codemap query` prohibited | No (research only)                                                            |
 
-**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob` → `read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see the research note § 4.
+**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob` → `read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see [§ Dual-agent study](#dual-agent-study-codemap-self-index-provisional) finding #1.
 
 Probe **prompts and SQL/recipe** reuse [golden scenarios](../fixtures/golden/scenarios.json) via `goldenId` (override with `--scenarios` / `AGENT_EVAL_SCENARIOS` when using an external corpus); probe definitions live in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json) (override with `--probes` / `AGENT_EVAL_PROBES`). The MCP-off **traditional** regex/globs in each probe approximate naive file discovery (not byte-identical to golden SQL).
 
@@ -365,7 +365,32 @@ Environment overrides: `AGENT_EVAL_OUTPUT`, `AGENT_EVAL_FIXTURE_ROOT`, `AGENT_EV
 | `find-call-sites`            | 1            | 25            | 375                | 2,667               |
 | **Totals**                   | **3**        | **75**        | **601**            | **7,955**           |
 
-Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes. Dual-agent and self-index studies: [research/agent-eval-findings-2026-05.md](./research/agent-eval-findings-2026-05.md).
+Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes.
+
+#### Dual-agent study (codemap self-index, provisional)
+
+Exploratory runs on the **codemap repo** index (not `fixtures/minimal`) — four structural tasks, MCP-on vs MCP-forbidden (grep/read/shell only). Not pinned in CI; methodology caveats apply.
+
+| Task                                                       | MCP-on                             | MCP-off                       | Verdict                                      |
+| ---------------------------------------------------------- | ---------------------------------- | ----------------------------- | -------------------------------------------- |
+| Call path `createCodemap` → `resolveStateDir`              | 2 hops, 1 MCP call                 | Same path, 8 tools            | Tie on answer; MCP cheaper                   |
+| Transitive dependents of `src/db.ts` (depth 4)             | **132 files**                      | 124–133 (scope-dependent)     | MCP exact; grep approximate                  |
+| Rename preview `resolveStateDir` → `resolveStateDirectory` | **8 code files** (21 binding refs) | 11 files (+ docs, comments)   | MCP matches `rename-preview` scope           |
+| Upstream callers of `resolveStateDir` (2 hops)             | **28 symbols**, 6 depth-1          | ~28 (text-inferred), 23 tools | Similar count; MCP 1 call, higher confidence |
+
+**Structural cost (same session):** MCP-on ~6 MCP calls (+ schema reads), ~38 KB payload; MCP-off ~37 tools (21 grep, 13 read, 3 shell), ~85 KB.
+
+**Provisional findings:**
+
+1. **Naive discovery vs skilled grep** — harness MCP-off models glob→read→grep. Skilled targeted grep can tie MCP on tool count for simple symbol/import/call-site lookups.
+2. **Graph questions favor MCP** — transitive deps, impact, trace, rename-preview: indexed answers in 1–2 calls; grep chains cost more and often report medium confidence.
+3. **Token estimate nuance** — recipe payloads with `actions` metadata can make MCP **larger** than grep on simple tasks; MCP still wins on correctness (resolved edges, column-precise call sites, binding kinds).
+4. **Dual-agent > simulation** — hand-waving grep token math understates real agent cost (re-reads, shell graph scripts, scope ambiguity).
+5. **Not an LLM eval** — layers measure **structural tool cost** and answer alignment with the index, not model reasoning quality or task success rate.
+
+**Limitations:** corpus-dependent (minimal fixture magnifies MCP-off read fan-out); re-run after `SCHEMA_VERSION` or fixture changes; log mode omits full read payloads unless exports include them.
+
+**Follow-up:** scripted dual-agent harness + external fixture CI — [roadmap § Backlog](./roadmap.md#backlog) (`Scripted dual-agent harness`, `Falsifiable benchmark CI`).
 
 **Correctness (golden queries):** `bun run test:golden` indexes `fixtures/minimal`, runs declared **`setup`** steps when present (e.g. coverage ingest), then runs SQL against [fixtures/golden/scenarios.json](../fixtures/golden/scenarios.json) and compares to [fixtures/golden/minimal/](../fixtures/golden/minimal/). See [golden-queries.md](./golden-queries.md). Refresh goldens after intentional fixture or schema changes: `bun scripts/query-golden.ts --update`.