You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: close research sweep — lift decisions, delete adopted archives
Lift dual-agent findings, apply-path rejected alternatives, and backlog
items into benchmark/architecture/roadmap; retarget cross-refs; delete
three adopted research notes per docs-governance Rule 8.
Copy file name to clipboardExpand all lines: docs/architecture.md
+19-1Lines changed: 19 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,7 +166,25 @@ Three **mutually exclusive** CLI entry shapes; all converge on `applyDiffPayload
166
166
167
167
**Discover → preview → apply** (agent loop): `query_recipe` / `query --recipe <id> --format diff-json` (or audit baseline `added` rows) → `apply` with `dry_run: true` → `apply` with `yes: true` (+ `force: true` when required). Per-row `actions[].command` on `--json` query output renders a copy-paste shell line (`renderRecipeActionCommands`).
168
168
169
-
**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback. Rejected alternatives + revisit triggers: [synthesis §7](./research/codemap-richer-index-synthesis-2026-05.md#7-rejected-items-with-trigger-conditions) (`organize-imports`, Path A AST apply, trust tiers, …).
169
+
**Non-goals on the apply path** (Moat A preserved): no curated write verbs with new semantics (`codemap fix deprecated`, …); **`codemap rename`** is a thin alias to `apply rename-preview` (same recipe + policy gates as outcome aliases → `query --recipe`). No severity / verdict engine on rows; no JS execution at apply time; no Path A AST apply engine; no cross-file transactional rollback.
170
+
171
+
**Rejected apply-path alternatives** (grep `rg "Path A|trust tiers|auto_fixable"` in `docs/` for related plans):
| Generalised `references` + `bindings` consolidation before demand | Incremental position tables first | Third position-table lands AND a recipe wants UNION across all three |
182
+
|`--branch` / `--output-patch` workflow flags |`--commit` is priority |`--commit` insufficient in practice |
183
+
| Multi-line + kind-tagged row contract | Single-line cases first | Recipe needs multi-line AND workarounds fail |
@@ -309,7 +309,7 @@ Dev-only A/B harness in [`scripts/agent-eval/`](../scripts/agent-eval/) (not shi
309
309
|**Log**| Parsed MCP-on export | Parsed MCP-off export | Parser smoke only (`test:agent-eval` on sample logs); no CI on ad-hoc exports |
310
310
|**Dual-agent**| Live MCP tools in an LLM agent | Same tasks; MCP/`codemap query` prohibited | No (research only) |
311
311
312
-
**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob` → `read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see the research note § 4.
312
+
**Probe** and **live** index the fixture, then compare MCP-on against a simulated **MCP-off** arm (`glob` → `read` × N → `grep`). **Log** mode is orthogonal to `AGENT_EVAL_MODE`: it compares two exported session logs via `compare-live-logs.ts`. The traditional arm models **naive** discovery; a skilled grep-only agent may match MCP on simple lookups — see [§ Dual-agent study](#dual-agent-study-codemap-self-index-provisional) finding #1.
313
313
314
314
Probe **prompts and SQL/recipe** reuse [golden scenarios](../fixtures/golden/scenarios.json) via `goldenId` (override with `--scenarios` / `AGENT_EVAL_SCENARIOS` when using an external corpus); probe definitions live in [`scripts/agent-eval/scenarios.json`](../scripts/agent-eval/scenarios.json) (override with `--probes` / `AGENT_EVAL_PROBES`). The MCP-off **traditional** regex/globs in each probe approximate naive file discovery (not byte-identical to golden SQL).
Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes. Dual-agent and self-index studies: [research/agent-eval-findings-2026-05.md](./research/agent-eval-findings-2026-05.md).
368
+
Numbers are stable for a given fixture + schema; re-run locally after intentional schema or probe changes.
369
+
370
+
#### Dual-agent study (codemap self-index, provisional)
371
+
372
+
Exploratory runs on the **codemap repo** index (not `fixtures/minimal`) — four structural tasks, MCP-on vs MCP-forbidden (grep/read/shell only). Not pinned in CI; methodology caveats apply.
1.**Naive discovery vs skilled grep** — harness MCP-off models glob→read→grep. Skilled targeted grep can tie MCP on tool count for simple symbol/import/call-site lookups.
386
+
2.**Graph questions favor MCP** — transitive deps, impact, trace, rename-preview: indexed answers in 1–2 calls; grep chains cost more and often report medium confidence.
387
+
3.**Token estimate nuance** — recipe payloads with `actions` metadata can make MCP **larger** than grep on simple tasks; MCP still wins on correctness (resolved edges, column-precise call sites, binding kinds).
388
+
4.**Dual-agent > simulation** — hand-waving grep token math understates real agent cost (re-reads, shell graph scripts, scope ambiguity).
389
+
5.**Not an LLM eval** — layers measure **structural tool cost** and answer alignment with the index, not model reasoning quality or task success rate.
390
+
391
+
**Limitations:** corpus-dependent (minimal fixture magnifies MCP-off read fan-out); re-run after `SCHEMA_VERSION` or fixture changes; log mode omits full read payloads unless exports include them.
**Correctness (golden queries):**`bun run test:golden` indexes `fixtures/minimal`, runs declared **`setup`** steps when present (e.g. coverage ingest), then runs SQL against [fixtures/golden/scenarios.json](../fixtures/golden/scenarios.json) and compares to [fixtures/golden/minimal/](../fixtures/golden/minimal/). See [golden-queries.md](./golden-queries.md). Refresh goldens after intentional fixture or schema changes: `bun scripts/query-golden.ts --update`.
0 commit comments