|
| 1 | +# AST-hash duplication — plan |
| 2 | + |
| 3 | +> **Status:** open · **Priority:** P2 · **Effort:** M (~2 weeks) |
| 4 | +> |
| 5 | +> **Motivator:** Agents and maintainers need to find **structurally identical** function bodies across files — same control-flow shape, not merely copy-pasted text with renamed identifiers. Token-level suffix-array engines solve a different problem (literal clones). Codemap exposes duplication as **substrate + recipe**: `symbols.body_hash` at parse time + bundled `duplicates` recipe (`GROUP BY body_hash HAVING COUNT(*) > 1`). No severity primitive, no suppression-by-default. |
| 6 | +> |
| 7 | +> **Roadmap:** [§ Core substrate & platform](../roadmap.md#core-substrate--platform) |
| 8 | +
|
| 9 | +--- |
| 10 | + |
| 11 | +## Agent start here |
| 12 | + |
| 13 | +Ship **`body_hash` column + migration + one parse fixture** before the `duplicates` recipe. Add a **new extractor** (or extend `symbolsExtractor` pop path) in the **same oxc visitor pass** ([substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions)). Hash only **function-shaped** symbols in slice 1. |
| 14 | + |
| 15 | +### Key touchpoints |
| 16 | + |
| 17 | +| File | What to read | |
| 18 | +| -------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | |
| 19 | +| [`src/extractors/symbols.ts`](../../src/extractors/symbols.ts) | Function/method enter + `functionShapeColumns`; where `line_start`/`line_end` are set | |
| 20 | +| [`src/extractors/complexity.ts`](../../src/extractors/complexity.ts) | Pattern for per-symbol body-scoped visitor state | |
| 21 | +| [`src/parser.ts`](../../src/parser.ts) | `EXTRACTORS` registration order; single-pass walk | |
| 22 | +| [`src/db.ts`](../../src/db.ts) | `SymbolRow` (~L886), `insertSymbols`, `SCHEMA_VERSION` migration pattern | |
| 23 | +| [`src/parser.ts`](../../src/parser.ts) | `EXTRACTORS` array — register new extractor after `complexityExtractor` | |
| 24 | +| [`src/extractors/types.ts`](../../src/extractors/types.ts) | `TierExtractor` contract for `bodyHashExtractor` | |
| 25 | +| [`src/hash.ts`](../../src/hash.ts) | `hashContent` (SHA-256) for canonical body serialization | |
| 26 | +| [`templates/recipes/`](../../templates/recipes/) | Recipe `.sql` + `.md` pair (e.g. `fan-in`) | |
| 27 | +| [`docs/golden-queries.md`](../golden-queries.md) | Register golden scenario for `duplicates` recipe | |
| 28 | + |
| 29 | +### Architecture |
| 30 | + |
| 31 | +```text |
| 32 | +oxc visitor (existing symbol walk) |
| 33 | + → on function-shaped symbol exit: serialize normalized body AST → hashContent → body_hash |
| 34 | + → symbol row persisted in symbols.body_hash (nullable for non-function kinds) |
| 35 | +recipe duplicates |
| 36 | + → SQL GROUP BY body_hash HAVING COUNT(*) > 1 |
| 37 | + → rows: hash group + member symbols (file_path, name, line_start) |
| 38 | + → query / MCP / HTTP (Moat A — no new verb) |
| 39 | +``` |
| 40 | + |
| 41 | +**Not** suffix-array / LCP semantic clones — different problem class (literal copy-paste); stay deferred unless `body_hash` proves insufficient. |
| 42 | + |
| 43 | +### Tracer bullet (slice 1) |
| 44 | + |
| 45 | +1. `body_hash` on `FunctionDeclaration` bodies only; two fixtures with isomorphic bodies → same hash, different names. 2. `SCHEMA_VERSION` bump. 3. `duplicates.sql` returns the pair. Expand to arrows/methods in slice 2. |
| 46 | + |
| 47 | +### Out of scope (v1) |
| 48 | + |
| 49 | +Suffix-array semantic duplication engine; verdict / severity on duplicate groups; default suppressions; hashing type/interface bodies; comment-aware hashing. |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## Pre-locked decisions |
| 54 | + |
| 55 | +| # | Decision | Source | |
| 56 | +| --- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | |
| 57 | +| B.1 | **New column** `symbols.body_hash TEXT` — nullable; populated for function-shaped symbols only in v1. | [Moat B](../roadmap.md#moats-load-bearing) | |
| 58 | +| B.2 | **Single-pass extraction** — compute hash in the existing oxc visitor; no second AST walk. | [substrate-extraction R.1](./substrate-extraction.md#pre-locked-decisions) | |
| 59 | +| B.3 | **Structural, not textual** — hash canonical serialization of the function **body** subtree (not raw `source.slice`), so whitespace-normalized identical logic matches. | Roadmap differentiation vs suffix-array dupes | |
| 60 | +| B.4 | **Moat-A exposure** — bundled recipe id `duplicates` (SQL join on `body_hash`); consumer applies `LIMIT` / directory filters. | [Moat A](../roadmap.md#moats-load-bearing) | |
| 61 | +| B.5 | **SHA-256 hex** — reuse `hashContent` on the canonical body string (same convention as `files.content_hash`). | [`src/hash.ts`](../../src/hash.ts) | |
| 62 | +| B.6 | **No verdict primitive** — recipe returns rows; no `pass`/`fail` on duplicate count. | Moat A | |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +## Normalization sketch (v1 default — confirm in impl PR) |
| 67 | + |
| 68 | +Canonical string built from a depth-first walk of the body AST: |
| 69 | + |
| 70 | +- Node `type` + ordered child slots |
| 71 | +- **Identifier tokens → placeholder** `$id` (rename-insensitive structural match) |
| 72 | +- **Literal values → kind** (`string`, `number`, …) not value (so `"a"` vs `"b"` still match structure-only mode — document false-positive class) |
| 73 | +- Skip `loc` / comment attachment |
| 74 | +- Exclude `doc_comment` on the symbol row (comments not in body_hash) |
| 75 | + |
| 76 | +Document the exact rules in `architecture.md` when landed so agents can predict matches. |
| 77 | + |
| 78 | +--- |
| 79 | + |
| 80 | +## Recipe SQL sketch |
| 81 | + |
| 82 | +```sql |
| 83 | +-- illustrative; final SQL in templates/recipes/duplicates.sql |
| 84 | +SELECT body_hash, |
| 85 | + COUNT(*) AS duplicate_count, |
| 86 | + GROUP_CONCAT(file_path || ':' || name, ', ') AS members |
| 87 | +FROM symbols |
| 88 | +WHERE body_hash IS NOT NULL |
| 89 | +GROUP BY body_hash |
| 90 | +HAVING COUNT(*) > 1 |
| 91 | +ORDER BY duplicate_count DESC; |
| 92 | +``` |
| 93 | + |
| 94 | +v1 may emit one row per group or one row per symbol with `duplicate_group_size` — pick in impl PR (golden-query ergonomics). |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +## Implementation steps |
| 99 | + |
| 100 | +1. **`body-hash.ts` extractor** (or module) — `canonicalizeBody(node): string` + `hashContent`. |
| 101 | +2. Wire on function exit in `symbols.ts` (or dedicated `bodyHashExtractor` registered after symbols). |
| 102 | +3. Extend `SymbolRow` type + `insertSymbols` + migration in `db.ts`. |
| 103 | +4. **`templates/recipes/duplicates.sql` + `.md`** — params: optional `min_count`, `path_prefix`. |
| 104 | +5. Golden fixture: two files, same structure different param names → one duplicate group. |
| 105 | +6. Negative fixture: same name different bodies → different hashes. |
| 106 | +7. Docs — `architecture.md` `symbols.body_hash`; `glossary.md` disambiguate vs suffix-array dupes. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +### Verification |
| 111 | + |
| 112 | +```bash |
| 113 | +bun test src/extractors/*.test.ts # add body-hash fixtures |
| 114 | +bun test src/parser.test.ts # if parse integration tests exist for fixtures |
| 115 | +bun src/index.ts --files <fixture> # reindex duplicate fixture |
| 116 | +bun src/index.ts query --recipe duplicates --json |
| 117 | +bun run typecheck # SymbolRow + insertSymbols column touch db.ts types |
| 118 | +``` |
| 119 | + |
| 120 | +Register golden scenario per [`docs/golden-queries.md`](../golden-queries.md); guard via `scripts/query-golden-coverage-matrix.test.mjs`. |
| 121 | + |
| 122 | +--- |
| 123 | + |
| 124 | +## Acceptance |
| 125 | + |
| 126 | +- [ ] Two isomorphic function bodies (renamed locals) share `body_hash` |
| 127 | +- [ ] Different control flow → different `body_hash` |
| 128 | +- [ ] `codemap query --recipe duplicates --json` returns groups with `COUNT > 1` |
| 129 | +- [ ] Non-function symbols have `body_hash IS NULL` |
| 130 | +- [ ] Incremental reindex updates hash for changed files |
| 131 | +- [ ] No new pass/fail CLI verb |
| 132 | + |
| 133 | +--- |
| 134 | + |
| 135 | +## Open decisions (impl PR) |
| 136 | + |
| 137 | +| # | Question | |
| 138 | +| --- | ------------------------------------------------------------------------------------------------ | |
| 139 | +| Q1 | v1 kinds: `FunctionDeclaration` only, or include arrows / methods / class methods in slice 1? | |
| 140 | +| Q2 | Identifier normalization: all → `$id`, or preserve exported param names for stricter matching? | |
| 141 | +| Q3 | Recipe row shape: one row per duplicate **group** vs one row per **symbol** with group metadata? | |
| 142 | +| Q4 | Minimum body size gate (skip `() => x` one-liners) — default off or `min_body_lines` param? | |
| 143 | +| Q5 | Index on `symbols(body_hash)` for recipe perf — add in v1 or measure first? | |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Dependencies |
| 148 | + |
| 149 | +- Shipped: `symbols` extraction, `hashContent`, recipe loader |
| 150 | +- Independent of [churn-complexity-hotspots](./churn-complexity-hotspots.md), [cognitive-complexity](./cognitive-complexity.md) |
| 151 | +- Supersedes motivation for suffix-array semantic dupes (stay deferred) |
0 commit comments