stainless-code
diff --git a/‎docs/plans/apply-write-safety.md‎
Lines changed: 118 additions & 0 deletions b/‎docs/plans/apply-write-safety.md‎
Lines changed: 118 additions & 0 deletions
diff --git a/‎docs/plans/ast-hash-duplication.md‎
Lines changed: 151 additions & 0 deletions b/‎docs/plans/ast-hash-duplication.md‎
Lines changed: 151 additions & 0 deletions
@@ -0,0 +1,118 @@
+# Apply write-safety hardening — plan
+
+> **Status:** open · **Priority:** P2 · **Effort:** L (~1–2 weeks)
+>
+> **Motivator:** `codemap apply` already validates line content in phase 1 and writes via sibling temp + `rename`, but the module documents a **TOCTOU window** (phase-1 cache → phase-2 write without re-read) and omits `fsync` before promote. Agent-driven apply runs (recipe → dry-run → `--yes`) need stronger guarantees that disk did not change between validation and write, plus explicit skips for unsafe EOL states.
+>
+> **Roadmap:** [§ Core substrate & platform](../roadmap.md#core-substrate--platform) · extends shipped [Apply engine](../architecture.md#apply--input-modes-transport-and-policy)
+
+---
+
+## Agent start here
+
+Read the **TOCTOU** and **EOL** comments at the top of [`apply-engine.ts`](../../src/application/apply-engine.ts) — this plan retires the TOCTOU bullet. Implement hash cache + phase-2 recheck first; `fsync` + mixed-EOL second. All transports use the same engine.
+
+### Key touchpoints
+
+| File                                                                                 | What to read                                        |
+| ------------------------------------------------------------------------------------ | --------------------------------------------------- |
+| [`src/application/apply-engine.ts`](../../src/application/apply-engine.ts)           | Phase 1 cache, phase 2 write loop, `ConflictReason` |
+| [`src/hash.ts`](../../src/hash.ts)                                                   | `hashContent` (SHA-256) — use this, not xxh3        |
+| [`src/application/apply-run.ts`](../../src/application/apply-run.ts)                 | CLI/MCP entry to engine                             |
+| [`src/cli/cmd-apply.ts`](../../src/cli/cmd-apply.ts)                                 | `--yes` gating                                      |
+| [`src/application/apply-engine.test.ts`](../../src/application/apply-engine.test.ts) | Conflict fixtures                                   |
+| [`docs/architecture.md`](../architecture.md)                                         | § Apply — update after implementation               |
+
+### Architecture
+
+```text
+apply / apply_rows / apply_diff_input
+  → apply-engine phase 1: read file → hashContent + mixed-EOL check → cache
+  → dry-run OR conflicts → stop
+  → phase 2: re-read → hash compare → transform cache → temp write → fsync → rename
+```
+
+### Tracer bullet (slice 1)
+
+Extend `sourceCache` with `contentHash`; phase-2 mismatch → `file content changed` conflict; test simulating disk edit between phases. Ship before `fsync` / mixed-EOL if schedule tight (document partial state in PR).
+
+### Out of scope (v1)
+
+Cross-file rollback on phase-2 failure; BOM round-trip policy (Q2); adversarial locking.
+
+---
+
+## Current state (shipped)
+
+| Behavior                                    | Status                                 |
+| ------------------------------------------- | -------------------------------------- |
+| Phase-1 line match (`line content drifted`) | ✅                                     |
+| Per-file temp + `rename`                    | ✅                                     |
+| All-or-nothing on phase-1 conflicts         | ✅                                     |
+| Symlink / path-escape guards                | ✅                                     |
+| Content-hash recheck before write           | ❌                                     |
+| `fsync` before `rename`                     | ❌                                     |
+| Mixed CRLF/LF skip                          | ❌                                     |
+| Cross-file rollback on phase-2 failure      | ❌ (deferred per architecture § Apply) |
+
+---
+
+## Pre-locked decisions
+
+| #   | Decision                                                                                                                                                                                                                      | Source                                     |
+| --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
+| W.1 | **Hash at capture** — on first phase-1 read per file, store `hashContent(source)` via existing `src/hash.ts` (SHA-256).                                                                                                       | Project hash convention                    |
+| W.2 | **Recheck immediately before write** — phase 2 re-reads file (or compares hash of disk vs cached hash); mismatch → conflict `file content changed` for that file's rows; **abort entire apply** (preserve Q2 all-or-nothing). | Closes documented TOCTOU gap               |
+| W.3 | **Atomic promote** — `writeFileSync` → `fsync` (or `fdatasync`) on temp handle → `renameSync`; preserve mode bits where practical (`lstat` before write).                                                                     | Crash-safe promote                         |
+| W.4 | **Mixed line endings** — if file contains both `\r\n` and bare `\n` (excluding trailing file newline edge cases per impl PR), skip file with conflict reason `mixed line endings` — no silent corruption.                     | EOL safety                                 |
+| W.5 | **Envelope extension** — new `ConflictReason` values: `file content changed`, `mixed line endings`; optional per-file `skip_reason` on `ApplyFile` when whole file skipped.                                                   | Q5 envelope stability                      |
+| W.6 | **Cross-file rollback stays deferred** — per-file atomicity + pre-write hash guard is v1 scope; full transaction log out of scope.                                                                                            | [architecture § Apply](../architecture.md) |
+
+---
+
+## Implementation steps
+
+1. Extend `sourceCache` to `Map<string, { source: string; contentHash: string }>` in `apply-engine.ts`.
+2. Add `detectMixedLineEndings(source: string): boolean` helper + unit tests (CRLF-only, LF-only, mixed).
+3. Phase-1: after read, run mixed-EOL check; fail rows for that file early.
+4. Phase-2 loop: before transforms, `readFileSync` + `hashContent`; compare to cached hash → conflict or proceed.
+5. Replace bare `writeFileSync` with open/write/fsync/close or `writeFileSync` + `fsyncSync(fd)` on temp path; then `renameSync`.
+6. Update `apply-engine.test.ts` + `cmd-apply.test.ts` — simulate concurrent edit between phases (mock or temp file rewrite).
+7. Document in `architecture.md` § Apply — retire TOCTOU bullet; list new conflict reasons in `glossary.md` § `codemap apply`.
+8. MCP `apply` / `apply_rows` / `apply_diff_input` inherit via shared engine (no transport changes).
+
+---
+
+### Verification
+
+```bash
+bun test src/application/apply-engine.test.ts src/cli/cmd-apply.test.ts
+bun src/index.ts apply <recipe> --dry-run
+# rewrite file on disk between dry-run and --yes → expect file content changed conflict
+```
+
+---
+
+## Acceptance
+
+- [ ] Edit file on disk after dry-run passes but before `--yes` apply → `file content changed`, zero files modified
+- [ ] Mixed-EOL fixture file → `mixed line endings`, no write
+- [ ] Happy path unchanged: valid apply still returns `applied: true`
+- [ ] `destructiveHint` apply tools document recheck behavior in tool description (synergy with [mcp-tool-annotations](./mcp-tool-annotations.md))
+
+---
+
+## Open decisions (impl PR)
+
+| #   | Question                                                                                                        |
+| --- | --------------------------------------------------------------------------------------------------------------- |
+| Q1  | Re-read full file vs hash-only compare (hash-only cheaper; re-read safer if hash collision concern irrelevant). |
+| Q2  | BOM preservation — strip/re-emit UTF-8 BOM on round-trip?                                                       |
+| Q3  | Per-file skip vs whole-run abort when one file fails hash recheck (default: whole-run abort per W.2).           |
+
+---
+
+## Dependencies
+
+- Shipped: `apply-engine.ts`, `apply-run.ts`, `hashContent`, apply confirmation gates (`--yes`)
+- Synergy: [evidence-chains-on-recipe-rows](./evidence-chains-on-recipe-rows.md) (agents should dry-run then apply with safety net)
@@ -0,0 +1,151 @@
+# AST-hash duplication — plan
+
+> **Status:** open · **Priority:** P2 · **Effort:** M (~2 weeks)
+>
+> **Motivator:** Agents and maintainers need to find **structurally identical** function bodies across files — same control-flow shape, not merely copy-pasted text with renamed identifiers. Token-level suffix-array engines solve a different problem (literal clones). Codemap exposes duplication as **substrate + recipe**: `symbols.body_hash` at parse time + bundled `duplicates` recipe (`GROUP BY body_hash HAVING COUNT(*) > 1`). No severity primitive, no suppression-by-default.
+>
+> **Roadmap:** [§ Core substrate & platform](../roadmap.md#core-substrate--platform)
+
+---
+
+## Agent start here
+
+Ship **`body_hash` column + migration + one parse fixture** before the `duplicates` recipe. Add a **new extractor** (or extend `symbolsExtractor` pop path) in the **same oxc visitor pass** ([substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions)). Hash only **function-shaped** symbols in slice 1.
+
+### Key touchpoints
+
+| File                                                                 | What to read                                                                          |
+| -------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
+| [`src/extractors/symbols.ts`](../../src/extractors/symbols.ts)       | Function/method enter + `functionShapeColumns`; where `line_start`/`line_end` are set |
+| [`src/extractors/complexity.ts`](../../src/extractors/complexity.ts) | Pattern for per-symbol body-scoped visitor state                                      |
+| [`src/parser.ts`](../../src/parser.ts)                               | `EXTRACTORS` registration order; single-pass walk                                     |
+| [`src/db.ts`](../../src/db.ts)                                       | `SymbolRow` (~L886), `insertSymbols`, `SCHEMA_VERSION` migration pattern              |
+| [`src/parser.ts`](../../src/parser.ts)                               | `EXTRACTORS` array — register new extractor after `complexityExtractor`               |
+| [`src/extractors/types.ts`](../../src/extractors/types.ts)           | `TierExtractor` contract for `bodyHashExtractor`                                      |
+| [`src/hash.ts`](../../src/hash.ts)                                   | `hashContent` (SHA-256) for canonical body serialization                              |
+| [`templates/recipes/`](../../templates/recipes/)                     | Recipe `.sql` + `.md` pair (e.g. `fan-in`)                                            |
+| [`docs/golden-queries.md`](../golden-queries.md)                     | Register golden scenario for `duplicates` recipe                                      |
+
+### Architecture
+
+```text
+oxc visitor (existing symbol walk)
+  → on function-shaped symbol exit: serialize normalized body AST → hashContent → body_hash
+  → symbol row persisted in symbols.body_hash (nullable for non-function kinds)
+recipe duplicates
+  → SQL GROUP BY body_hash HAVING COUNT(*) > 1
+  → rows: hash group + member symbols (file_path, name, line_start)
+  → query / MCP / HTTP (Moat A — no new verb)
+```
+
+**Not** suffix-array / LCP semantic clones — different problem class (literal copy-paste); stay deferred unless `body_hash` proves insufficient.
+
+### Tracer bullet (slice 1)
+
+1. `body_hash` on `FunctionDeclaration` bodies only; two fixtures with isomorphic bodies → same hash, different names. 2. `SCHEMA_VERSION` bump. 3. `duplicates.sql` returns the pair. Expand to arrows/methods in slice 2.
+
+### Out of scope (v1)
+
+Suffix-array semantic duplication engine; verdict / severity on duplicate groups; default suppressions; hashing type/interface bodies; comment-aware hashing.
+
+---
+
+## Pre-locked decisions
+
+| #   | Decision                                                                                                                                                                | Source                                                                     |
+| --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
+| B.1 | **New column** `symbols.body_hash TEXT` — nullable; populated for function-shaped symbols only in v1.                                                                   | [Moat B](../roadmap.md#moats-load-bearing)                                 |
+| B.2 | **Single-pass extraction** — compute hash in the existing oxc visitor; no second AST walk.                                                                              | [substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions) |
+| B.3 | **Structural, not textual** — hash canonical serialization of the function **body** subtree (not raw `source.slice`), so whitespace-normalized identical logic matches. | Roadmap differentiation vs suffix-array dupes                              |
+| B.4 | **Moat-A exposure** — bundled recipe id `duplicates` (SQL join on `body_hash`); consumer applies `LIMIT` / directory filters.                                           | [Moat A](../roadmap.md#moats-load-bearing)                                 |
+| B.5 | **SHA-256 hex** — reuse `hashContent` on the canonical body string (same convention as `files.content_hash`).                                                           | [`src/hash.ts`](../../src/hash.ts)                                         |
+| B.6 | **No verdict primitive** — recipe returns rows; no `pass`/`fail` on duplicate count.                                                                                    | Moat A                                                                     |
+
+---
+
+## Normalization sketch (v1 default — confirm in impl PR)
+
+Canonical string built from a depth-first walk of the body AST:
+
+- Node `type` + ordered child slots
+- **Identifier tokens → placeholder** `$id` (rename-insensitive structural match)
+- **Literal values → kind** (`string`, `number`, …) not value (so `"a"` vs `"b"` still match structure-only mode — document false-positive class)
+- Skip `loc` / comment attachment
+- Exclude `doc_comment` on the symbol row (comments not in body_hash)
+
+Document the exact rules in `architecture.md` when landed so agents can predict matches.
+
+---
+
+## Recipe SQL sketch
+
+```sql
+-- illustrative; final SQL in templates/recipes/duplicates.sql
+SELECT body_hash,
+       COUNT(*) AS duplicate_count,
+       GROUP_CONCAT(file_path || ':' || name, ', ') AS members
+FROM symbols
+WHERE body_hash IS NOT NULL
+GROUP BY body_hash
+HAVING COUNT(*) > 1
+ORDER BY duplicate_count DESC;
+```
+
+v1 may emit one row per group or one row per symbol with `duplicate_group_size` — pick in impl PR (golden-query ergonomics).
+
+---
+
+## Implementation steps
+
+1. **`body-hash.ts` extractor** (or module) — `canonicalizeBody(node): string` + `hashContent`.
+2. Wire on function exit in `symbols.ts` (or dedicated `bodyHashExtractor` registered after symbols).
+3. Extend `SymbolRow` type + `insertSymbols` + migration in `db.ts`.
+4. **`templates/recipes/duplicates.sql` + `.md`** — params: optional `min_count`, `path_prefix`.
+5. Golden fixture: two files, same structure different param names → one duplicate group.
+6. Negative fixture: same name different bodies → different hashes.
+7. Docs — `architecture.md` `symbols.body_hash`; `glossary.md` disambiguate vs suffix-array dupes.
+
+---
+
+### Verification
+
+```bash
+bun test src/extractors/*.test.ts   # add body-hash fixtures
+bun test src/parser.test.ts         # if parse integration tests exist for fixtures
+bun src/index.ts --files <fixture>  # reindex duplicate fixture
+bun src/index.ts query --recipe duplicates --json
+bun run typecheck                   # SymbolRow + insertSymbols column touch db.ts types
+```
+
+Register golden scenario per [`docs/golden-queries.md`](../golden-queries.md); guard via `scripts/query-golden-coverage-matrix.test.mjs`.
+
+---
+
+## Acceptance
+
+- [ ] Two isomorphic function bodies (renamed locals) share `body_hash`
+- [ ] Different control flow → different `body_hash`
+- [ ] `codemap query --recipe duplicates --json` returns groups with `COUNT > 1`
+- [ ] Non-function symbols have `body_hash IS NULL`
+- [ ] Incremental reindex updates hash for changed files
+- [ ] No new pass/fail CLI verb
+
+---
+
+## Open decisions (impl PR)
+
+| #   | Question                                                                                         |
+| --- | ------------------------------------------------------------------------------------------------ |
+| Q1  | v1 kinds: `FunctionDeclaration` only, or include arrows / methods / class methods in slice 1?    |
+| Q2  | Identifier normalization: all → `$id`, or preserve exported param names for stricter matching?   |
+| Q3  | Recipe row shape: one row per duplicate **group** vs one row per **symbol** with group metadata? |
+| Q4  | Minimum body size gate (skip `() => x` one-liners) — default off or `min_body_lines` param?      |
+| Q5  | Index on `symbols(body_hash)` for recipe perf — add in v1 or measure first?                      |
+
+---
+
+## Dependencies
+
+- Shipped: `symbols` extraction, `hashContent`, recipe loader
+- Independent of [churn-complexity-hotspots](./churn-complexity-hotspots.md), [cognitive-complexity](./cognitive-complexity.md)
+- Supersedes motivation for suffix-array semantic dupes (stay deferred)