Skip to content

Commit 1ab9607

Browse files
committed
docs: expand backlog with agent-ready implementation plans
Add roadmap entries and standalone plan files for audit attribution, evidence chains, apply write-safety, churn hotspots, cognitive complexity, AST-hash dupes, CRAP/coverage recipes, CI output formats, MCP annotations, and shell installer — plus agent-start sections on C.9, LSP, and Marketplace Action plans.
1 parent 22588be commit 1ab9607

15 files changed

Lines changed: 1494 additions & 5 deletions

docs/plans/apply-write-safety.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Apply write-safety hardening — plan
2+
3+
> **Status:** open · **Priority:** P2 · **Effort:** L (~1–2 weeks)
4+
>
5+
> **Motivator:** `codemap apply` already validates line content in phase 1 and writes via sibling temp + `rename`, but the module documents a **TOCTOU window** (phase-1 cache → phase-2 write without re-read) and omits `fsync` before promote. Agent-driven apply runs (recipe → dry-run → `--yes`) need stronger guarantees that disk did not change between validation and write, plus explicit skips for unsafe EOL states.
6+
>
7+
> **Roadmap:** [§ Core substrate & platform](../roadmap.md#core-substrate--platform) · extends shipped [Apply engine](../architecture.md#apply--input-modes-transport-and-policy)
8+
9+
---
10+
11+
## Agent start here
12+
13+
Read the **TOCTOU** and **EOL** comments at the top of [`apply-engine.ts`](../../src/application/apply-engine.ts) — this plan retires the TOCTOU bullet. Implement hash cache + phase-2 recheck first; `fsync` + mixed-EOL second. All transports use the same engine.
14+
15+
### Key touchpoints
16+
17+
| File | What to read |
18+
| ------------------------------------------------------------------------------------ | --------------------------------------------------- |
19+
| [`src/application/apply-engine.ts`](../../src/application/apply-engine.ts) | Phase 1 cache, phase 2 write loop, `ConflictReason` |
20+
| [`src/hash.ts`](../../src/hash.ts) | `hashContent` (SHA-256) — use this, not xxh3 |
21+
| [`src/application/apply-run.ts`](../../src/application/apply-run.ts) | CLI/MCP entry to engine |
22+
| [`src/cli/cmd-apply.ts`](../../src/cli/cmd-apply.ts) | `--yes` gating |
23+
| [`src/application/apply-engine.test.ts`](../../src/application/apply-engine.test.ts) | Conflict fixtures |
24+
| [`docs/architecture.md`](../architecture.md) | § Apply — update after implementation |
25+
26+
### Architecture
27+
28+
```text
29+
apply / apply_rows / apply_diff_input
30+
→ apply-engine phase 1: read file → hashContent + mixed-EOL check → cache
31+
→ dry-run OR conflicts → stop
32+
→ phase 2: re-read → hash compare → transform cache → temp write → fsync → rename
33+
```
34+
35+
### Tracer bullet (slice 1)
36+
37+
Extend `sourceCache` with `contentHash`; phase-2 mismatch → `file content changed` conflict; test simulating disk edit between phases. Ship before `fsync` / mixed-EOL if schedule tight (document partial state in PR).
38+
39+
### Out of scope (v1)
40+
41+
Cross-file rollback on phase-2 failure; BOM round-trip policy (Q2); adversarial locking.
42+
43+
---
44+
45+
## Current state (shipped)
46+
47+
| Behavior | Status |
48+
| ------------------------------------------- | -------------------------------------- |
49+
| Phase-1 line match (`line content drifted`) ||
50+
| Per-file temp + `rename` ||
51+
| All-or-nothing on phase-1 conflicts ||
52+
| Symlink / path-escape guards ||
53+
| Content-hash recheck before write ||
54+
| `fsync` before `rename` ||
55+
| Mixed CRLF/LF skip ||
56+
| Cross-file rollback on phase-2 failure | ❌ (deferred per architecture § Apply) |
57+
58+
---
59+
60+
## Pre-locked decisions
61+
62+
| # | Decision | Source |
63+
| --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------ |
64+
| W.1 | **Hash at capture** — on first phase-1 read per file, store `hashContent(source)` via existing `src/hash.ts` (SHA-256). | Project hash convention |
65+
| W.2 | **Recheck immediately before write** — phase 2 re-reads file (or compares hash of disk vs cached hash); mismatch → conflict `file content changed` for that file's rows; **abort entire apply** (preserve Q2 all-or-nothing). | Closes documented TOCTOU gap |
66+
| W.3 | **Atomic promote**`writeFileSync``fsync` (or `fdatasync`) on temp handle → `renameSync`; preserve mode bits where practical (`lstat` before write). | Crash-safe promote |
67+
| W.4 | **Mixed line endings** — if file contains both `\r\n` and bare `\n` (excluding trailing file newline edge cases per impl PR), skip file with conflict reason `mixed line endings` — no silent corruption. | EOL safety |
68+
| W.5 | **Envelope extension** — new `ConflictReason` values: `file content changed`, `mixed line endings`; optional per-file `skip_reason` on `ApplyFile` when whole file skipped. | Q5 envelope stability |
69+
| W.6 | **Cross-file rollback stays deferred** — per-file atomicity + pre-write hash guard is v1 scope; full transaction log out of scope. | [architecture § Apply](../architecture.md) |
70+
71+
---
72+
73+
## Implementation steps
74+
75+
1. Extend `sourceCache` to `Map<string, { source: string; contentHash: string }>` in `apply-engine.ts`.
76+
2. Add `detectMixedLineEndings(source: string): boolean` helper + unit tests (CRLF-only, LF-only, mixed).
77+
3. Phase-1: after read, run mixed-EOL check; fail rows for that file early.
78+
4. Phase-2 loop: before transforms, `readFileSync` + `hashContent`; compare to cached hash → conflict or proceed.
79+
5. Replace bare `writeFileSync` with open/write/fsync/close or `writeFileSync` + `fsyncSync(fd)` on temp path; then `renameSync`.
80+
6. Update `apply-engine.test.ts` + `cmd-apply.test.ts` — simulate concurrent edit between phases (mock or temp file rewrite).
81+
7. Document in `architecture.md` § Apply — retire TOCTOU bullet; list new conflict reasons in `glossary.md` § `codemap apply`.
82+
8. MCP `apply` / `apply_rows` / `apply_diff_input` inherit via shared engine (no transport changes).
83+
84+
---
85+
86+
### Verification
87+
88+
```bash
89+
bun test src/application/apply-engine.test.ts src/cli/cmd-apply.test.ts
90+
bun src/index.ts apply <recipe> --dry-run
91+
# rewrite file on disk between dry-run and --yes → expect file content changed conflict
92+
```
93+
94+
---
95+
96+
## Acceptance
97+
98+
- [ ] Edit file on disk after dry-run passes but before `--yes` apply → `file content changed`, zero files modified
99+
- [ ] Mixed-EOL fixture file → `mixed line endings`, no write
100+
- [ ] Happy path unchanged: valid apply still returns `applied: true`
101+
- [ ] `destructiveHint` apply tools document recheck behavior in tool description (synergy with [mcp-tool-annotations](./mcp-tool-annotations.md))
102+
103+
---
104+
105+
## Open decisions (impl PR)
106+
107+
| # | Question |
108+
| --- | --------------------------------------------------------------------------------------------------------------- |
109+
| Q1 | Re-read full file vs hash-only compare (hash-only cheaper; re-read safer if hash collision concern irrelevant). |
110+
| Q2 | BOM preservation — strip/re-emit UTF-8 BOM on round-trip? |
111+
| Q3 | Per-file skip vs whole-run abort when one file fails hash recheck (default: whole-run abort per W.2). |
112+
113+
---
114+
115+
## Dependencies
116+
117+
- Shipped: `apply-engine.ts`, `apply-run.ts`, `hashContent`, apply confirmation gates (`--yes`)
118+
- Synergy: [evidence-chains-on-recipe-rows](./evidence-chains-on-recipe-rows.md) (agents should dry-run then apply with safety net)

docs/plans/ast-hash-duplication.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# AST-hash duplication — plan
2+
3+
> **Status:** open · **Priority:** P2 · **Effort:** M (~2 weeks)
4+
>
5+
> **Motivator:** Agents and maintainers need to find **structurally identical** function bodies across files — same control-flow shape, not merely copy-pasted text with renamed identifiers. Token-level suffix-array engines solve a different problem (literal clones). Codemap exposes duplication as **substrate + recipe**: `symbols.body_hash` at parse time + bundled `duplicates` recipe (`GROUP BY body_hash HAVING COUNT(*) > 1`). No severity primitive, no suppression-by-default.
6+
>
7+
> **Roadmap:** [§ Core substrate & platform](../roadmap.md#core-substrate--platform)
8+
9+
---
10+
11+
## Agent start here
12+
13+
Ship **`body_hash` column + migration + one parse fixture** before the `duplicates` recipe. Add a **new extractor** (or extend `symbolsExtractor` pop path) in the **same oxc visitor pass** ([substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions)). Hash only **function-shaped** symbols in slice 1.
14+
15+
### Key touchpoints
16+
17+
| File | What to read |
18+
| -------------------------------------------------------------------- | ------------------------------------------------------------------------------------- |
19+
| [`src/extractors/symbols.ts`](../../src/extractors/symbols.ts) | Function/method enter + `functionShapeColumns`; where `line_start`/`line_end` are set |
20+
| [`src/extractors/complexity.ts`](../../src/extractors/complexity.ts) | Pattern for per-symbol body-scoped visitor state |
21+
| [`src/parser.ts`](../../src/parser.ts) | `EXTRACTORS` registration order; single-pass walk |
22+
| [`src/db.ts`](../../src/db.ts) | `SymbolRow` (~L886), `insertSymbols`, `SCHEMA_VERSION` migration pattern |
23+
| [`src/parser.ts`](../../src/parser.ts) | `EXTRACTORS` array — register new extractor after `complexityExtractor` |
24+
| [`src/extractors/types.ts`](../../src/extractors/types.ts) | `TierExtractor` contract for `bodyHashExtractor` |
25+
| [`src/hash.ts`](../../src/hash.ts) | `hashContent` (SHA-256) for canonical body serialization |
26+
| [`templates/recipes/`](../../templates/recipes/) | Recipe `.sql` + `.md` pair (e.g. `fan-in`) |
27+
| [`docs/golden-queries.md`](../golden-queries.md) | Register golden scenario for `duplicates` recipe |
28+
29+
### Architecture
30+
31+
```text
32+
oxc visitor (existing symbol walk)
33+
→ on function-shaped symbol exit: serialize normalized body AST → hashContent → body_hash
34+
→ symbol row persisted in symbols.body_hash (nullable for non-function kinds)
35+
recipe duplicates
36+
→ SQL GROUP BY body_hash HAVING COUNT(*) > 1
37+
→ rows: hash group + member symbols (file_path, name, line_start)
38+
→ query / MCP / HTTP (Moat A — no new verb)
39+
```
40+
41+
**Not** suffix-array / LCP semantic clones — different problem class (literal copy-paste); stay deferred unless `body_hash` proves insufficient.
42+
43+
### Tracer bullet (slice 1)
44+
45+
1. `body_hash` on `FunctionDeclaration` bodies only; two fixtures with isomorphic bodies → same hash, different names. 2. `SCHEMA_VERSION` bump. 3. `duplicates.sql` returns the pair. Expand to arrows/methods in slice 2.
46+
47+
### Out of scope (v1)
48+
49+
Suffix-array semantic duplication engine; verdict / severity on duplicate groups; default suppressions; hashing type/interface bodies; comment-aware hashing.
50+
51+
---
52+
53+
## Pre-locked decisions
54+
55+
| # | Decision | Source |
56+
| --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
57+
| B.1 | **New column** `symbols.body_hash TEXT` — nullable; populated for function-shaped symbols only in v1. | [Moat B](../roadmap.md#moats-load-bearing) |
58+
| B.2 | **Single-pass extraction** — compute hash in the existing oxc visitor; no second AST walk. | [substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions) |
59+
| B.3 | **Structural, not textual** — hash canonical serialization of the function **body** subtree (not raw `source.slice`), so whitespace-normalized identical logic matches. | Roadmap differentiation vs suffix-array dupes |
60+
| B.4 | **Moat-A exposure** — bundled recipe id `duplicates` (SQL join on `body_hash`); consumer applies `LIMIT` / directory filters. | [Moat A](../roadmap.md#moats-load-bearing) |
61+
| B.5 | **SHA-256 hex** — reuse `hashContent` on the canonical body string (same convention as `files.content_hash`). | [`src/hash.ts`](../../src/hash.ts) |
62+
| B.6 | **No verdict primitive** — recipe returns rows; no `pass`/`fail` on duplicate count. | Moat A |
63+
64+
---
65+
66+
## Normalization sketch (v1 default — confirm in impl PR)
67+
68+
Canonical string built from a depth-first walk of the body AST:
69+
70+
- Node `type` + ordered child slots
71+
- **Identifier tokens → placeholder** `$id` (rename-insensitive structural match)
72+
- **Literal values → kind** (`string`, `number`, …) not value (so `"a"` vs `"b"` still match structure-only mode — document false-positive class)
73+
- Skip `loc` / comment attachment
74+
- Exclude `doc_comment` on the symbol row (comments not in body_hash)
75+
76+
Document the exact rules in `architecture.md` when landed so agents can predict matches.
77+
78+
---
79+
80+
## Recipe SQL sketch
81+
82+
```sql
83+
-- illustrative; final SQL in templates/recipes/duplicates.sql
84+
SELECT body_hash,
85+
COUNT(*) AS duplicate_count,
86+
GROUP_CONCAT(file_path || ':' || name, ', ') AS members
87+
FROM symbols
88+
WHERE body_hash IS NOT NULL
89+
GROUP BY body_hash
90+
HAVING COUNT(*) > 1
91+
ORDER BY duplicate_count DESC;
92+
```
93+
94+
v1 may emit one row per group or one row per symbol with `duplicate_group_size` — pick in impl PR (golden-query ergonomics).
95+
96+
---
97+
98+
## Implementation steps
99+
100+
1. **`body-hash.ts` extractor** (or module) — `canonicalizeBody(node): string` + `hashContent`.
101+
2. Wire on function exit in `symbols.ts` (or dedicated `bodyHashExtractor` registered after symbols).
102+
3. Extend `SymbolRow` type + `insertSymbols` + migration in `db.ts`.
103+
4. **`templates/recipes/duplicates.sql` + `.md`** — params: optional `min_count`, `path_prefix`.
104+
5. Golden fixture: two files, same structure different param names → one duplicate group.
105+
6. Negative fixture: same name different bodies → different hashes.
106+
7. Docs — `architecture.md` `symbols.body_hash`; `glossary.md` disambiguate vs suffix-array dupes.
107+
108+
---
109+
110+
### Verification
111+
112+
```bash
113+
bun test src/extractors/*.test.ts # add body-hash fixtures
114+
bun test src/parser.test.ts # if parse integration tests exist for fixtures
115+
bun src/index.ts --files <fixture> # reindex duplicate fixture
116+
bun src/index.ts query --recipe duplicates --json
117+
bun run typecheck # SymbolRow + insertSymbols column touch db.ts types
118+
```
119+
120+
Register golden scenario per [`docs/golden-queries.md`](../golden-queries.md); guard via `scripts/query-golden-coverage-matrix.test.mjs`.
121+
122+
---
123+
124+
## Acceptance
125+
126+
- [ ] Two isomorphic function bodies (renamed locals) share `body_hash`
127+
- [ ] Different control flow → different `body_hash`
128+
- [ ] `codemap query --recipe duplicates --json` returns groups with `COUNT > 1`
129+
- [ ] Non-function symbols have `body_hash IS NULL`
130+
- [ ] Incremental reindex updates hash for changed files
131+
- [ ] No new pass/fail CLI verb
132+
133+
---
134+
135+
## Open decisions (impl PR)
136+
137+
| # | Question |
138+
| --- | ------------------------------------------------------------------------------------------------ |
139+
| Q1 | v1 kinds: `FunctionDeclaration` only, or include arrows / methods / class methods in slice 1? |
140+
| Q2 | Identifier normalization: all → `$id`, or preserve exported param names for stricter matching? |
141+
| Q3 | Recipe row shape: one row per duplicate **group** vs one row per **symbol** with group metadata? |
142+
| Q4 | Minimum body size gate (skip `() => x` one-liners) — default off or `min_body_lines` param? |
143+
| Q5 | Index on `symbols(body_hash)` for recipe perf — add in v1 or measure first? |
144+
145+
---
146+
147+
## Dependencies
148+
149+
- Shipped: `symbols` extraction, `hashContent`, recipe loader
150+
- Independent of [churn-complexity-hotspots](./churn-complexity-hotspots.md), [cognitive-complexity](./cognitive-complexity.md)
151+
- Supersedes motivation for suffix-array semantic dupes (stay deferred)

0 commit comments

Comments
 (0)