stainless-code · SutuSebastian · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.changeset/ast-hash-duplication.md b/.changeset/ast-hash-duplication.md
@@ -0,0 +1,5 @@
+---
+"@stainless-code/codemap": minor
+---
+
+Add structural duplicate detection: `symbols.body_hash` at index time (canonical function body AST) and bundled `duplicates` recipe. Function-shaped symbols only; trivial one-line bodies skipped.
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -301,6 +301,7 @@ All base tables use `STRICT` mode; **`source_fts`** is an FTS5 virtual table (no
 | return_type          | TEXT       | Stringified return type for function-shaped symbols; NULL when unannotated or N/A                                                                                                                                                                                                                                                                                                                                    |
 | is_async             | INTEGER    | 1 for async function-shaped symbols (`function`, `method`, arrow-assigned `function` kind)                                                                                                                                                                                                                                                                                                                           |
 | is_generator         | INTEGER    | 1 for generator function-shaped symbols                                                                                                                                                                                                                                                                                                                                                                              |
+| body_hash            | TEXT       | SHA-256 hex of canonicalized function **body** AST (identifiers → `$id`, literals → kind only). Populated for function-shaped symbols when `body_line_count >= 2`; NULL otherwise. Powers `duplicates` recipe. Partial index `idx_symbols_body_hash`                                                                                                                                                                 |
 
 ### `calls` — Function-scoped call edges, deduped per file (`STRICT`)
 

diff --git a/docs/glossary.md b/docs/glossary.md
@@ -147,6 +147,10 @@ Per-function decision-point count (REAL column on `symbols`). Computed by the pa
 
 SonarSource-inspired cognitive complexity (INTEGER on `symbols`) for the same function-shaped symbols as cyclomatic `complexity`. Penalizes nested control flow; computed in the same parser walk as McCabe. Recipes: `high-cognitive-complexity` (`min_score` default 15, Sonar rule threshold); `high-complexity-untested` includes the column while filtering on cyclomatic `complexity`.
 
+### `symbols.body_hash` / structural duplicate bodies
+
+SHA-256 hex of a canonicalized function **body** AST (not raw source). Normalization (v1): every identifier → `$id`; literals → kind only (`Literal:string`, …); template literals walked structurally. Populated for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`; NULL for trivial one-liners and non-functions. Recipe **`duplicates`** groups rows sharing a hash. Distinct from token-level suffix-array / copy-paste clone detectors — catches rename-insensitive structural twins; may false-positive on shared control-flow skeletons (triage with `snippet`).
+
 ### `source_fts` (FTS5 virtual table) / `--with-fts` / opt-in full-text
 
 Opt-in FTS5 virtual table over file content (`tokenize='porter unicode61'`). Always created (near-zero space when empty); populated only when the resolved config has FTS5 enabled (`.codemap/config.ts` `fts5: true` OR `--with-fts` CLI flag at index time; CLI wins, logs stderr override). Demonstrates the FTS5 ⨯ `symbols` ⨯ `coverage` JOIN composability that ripgrep can't match — bundled recipe `text-in-deprecated-functions` exemplifies the JOIN. Toggle change auto-detects via `meta.fts5_enabled` and forces a full rebuild so `source_fts` is consistently populated. Stderr telemetry `[fts5] source_fts populated: <N> files / <X> KB` on first populate. Distinct from `coverage` — `source_fts` is an FTS5 **virtual** table; `coverage` is a regular `STRICT, WITHOUT ROWID` table. Default OFF preserves `.codemap/index.db` size for non-users (~30–50% growth on text-heavy projects).

diff --git a/docs/golden-queries.md b/docs/golden-queries.md
@@ -78,6 +78,10 @@ Some bundled recipes add optional **`reason`** (TEXT) and **`evidence_json`** (T
 
 `coverage-confirmed-dead` adds **`confidence`** (`high` \| `medium`) on each row — **`high`** when static dead and ingested `coverage_pct = 0`; **`medium`** when static dead but the symbol has no ingested coverage row. Also **`reason`**, **`caller_count`**. Goldens: `coverage-confirmed-dead` (post-ingest mix) and `coverage-confirmed-dead-no-ingest` (`preSetup: clear-coverage`, `everyRowFieldEquals` on `confidence: medium`).
 
+### Duplication columns (`duplicates` recipe)
+
+`duplicates` returns one row per function-shaped symbol in a **`body_hash`** collision group: **`name`**, **`kind`**, **`file_path`**, **`line_start`**, **`line_end`**, **`body_hash`**, **`body_line_count`**, **`duplicate_count`** (group size). Substrate column **`symbols.body_hash`** is populated at index for named functions, arrows, and class methods when `body_line_count >= 2`. Goldens: `duplicates` (includes `src/bench/duplicate-body-{a,b}.ts` pair). False positives possible when unrelated functions share control-flow skeleton — triage with `snippet`.
+
 ---
 
 ## Status

diff --git a/docs/plans/ast-hash-duplication.md b/docs/plans/ast-hash-duplication.md
diff --git a/docs/roadmap.md b/docs/roadmap.md
@@ -112,7 +112,7 @@ Predicate-as-API only — enrich row shape and audit deltas; no standalone pass/
 - [ ] **`codemap audit` verdict + thresholds** (v1.x) — `verdict: "pass" | "warn" | "fail"` driven by an `audit.deltas[<key>].{added_max, action}` field on the config object (`.codemap/config.{ts,js,json}`). Triggers: two consumers ship `jq`-based threshold scripts with similar shapes, OR one consumer asks with a concrete config sketch. Until then, raw deltas + consumer-side `jq` is the CI exit-code idiom. **Likely accelerant:** the Marketplace Action (next item) shipping is the most plausible path to firing the trigger — once `- uses: stainless-code/codemap@v1` is the dominant CI path, real `jq` threshold scripts will surface.
 - [ ] **GitHub Marketplace Action — publish + listing finish** — core Action implementation is in-tree: root `action.yml`, `query --ci`, `audit --format sarif` / `--ci`, package-manager detection, dogfood smoke, and opt-in `pr-comment` summary renderer have shipped. Remaining work is the release/listing slice: `MARKETPLACE.md`, `v1.0.0` / floating `v1` tags, Marketplace setup, sacrificial-repo smoke, and making `action-smoke` blocking once the Action tag exists. Action version stream is independent of CLI version (`package.json` currently drives CLI/npm version; Action publishes at its own `v1.0.0`). Plan: [`plans/github-marketplace-action.md`](./plans/github-marketplace-action.md). Effort: S.
 - [ ] **Churn × complexity hotspots** — `file_churn` table (git `log --numstat` over indexed paths, recency-weighted commits, optional trend) + bundled recipe **`churn-complexity-hotspots`** JOINing `symbols.complexity` for ranked refactor targets. Distinct from outcome alias `hotspots` → `fan-in`. Score is a recipe column, not a verdict ([Moat A](./roadmap.md#moats-load-bearing)). Plan: [`plans/churn-complexity-hotspots.md`](./plans/churn-complexity-hotspots.md). Effort: L–M.
-- [ ] **AST-hash duplication** — `symbols.body_hash` column (normalized AST hash via oxc, computed at parse time — Rust-native, fast) + bundled `duplicates` recipe joining on `body_hash` (`GROUP BY body_hash HAVING COUNT(*) > 1`). **Different shape from token-level suffix-array dupes** (catches structurally-identical functions, not copy-paste with renamed variables). Substrate addition — consumer writes the JOIN that decides "this is a problem"; no severity, no suppression-by-default. Plan: [`plans/ast-hash-duplication.md`](./plans/ast-hash-duplication.md). Effort: M.
+- [x] **AST-hash duplication** — `symbols.body_hash` (canonical body AST, identifiers → `$id`, literals → kind; function-shaped symbols; skip `body_line_count < 2`) + partial index + bundled `duplicates` recipe (per-symbol rows, CTE `GROUP BY`). **Different shape from token-level suffix-array dupes.** Contract: [architecture § `symbols` table](./architecture.md#symbols--functions-constants-classes-interfaces-types-enums-strict), [glossary § body_hash](./glossary.md#symbolsbody_hash--structural-duplicate-bodies). Effort: M.
 - [ ] **Falsifiable benchmark CI on named external fixtures** — structural-cost A/B (indexed queries vs `find` + `grep` + `Read`-loop discovery) on zod, fastify, vue-core, next.js. Numbers land in [`docs/benchmark.md`](./benchmark.md); headline figures surface in `MARKETPLACE.md` only after external runs land. Harness: [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness) + external fixture extension; pair with **Agent eval: quality × tokens × wall** for scored completion metrics. **Partial:** manual [`.github/workflows/agent-eval-external.yml`](../.github/workflows/agent-eval-external.yml) for in-repo fixture paths (not zod/fastify/nightly). Effort: M. **Self-index regression guardrail shipped** (#96): `bun run check:perf-baseline` + weekly scheduled workflow (demoted from PR hard gate — GHA runner variance).
 - [ ] **In-repo test bench scale (optional)** — if `fixtures/minimal` outgrows one corpus: add committed `fixtures/bench/` or rename `minimal`→`bench`. Harness map: [`testing-coverage.md`](./testing-coverage.md), [`fixtures/README.md`](../fixtures/README.md).
 

diff --git a/fixtures/CAPABILITIES.json b/fixtures/CAPABILITIES.json
@@ -175,6 +175,15 @@
       ],
       "setup": ["ingest-coverage"]
     },
+    {
+      "id": "duplication.body-hash",
+      "description": "symbols.body_hash structural fingerprint and duplicates recipe",
+      "fixtureFiles": [
+        "src/bench/duplicate-body-a.ts",
+        "src/bench/duplicate-body-b.ts"
+      ],
+      "goldenScenarios": ["duplicates"]
+    },
     {
       "id": "boundaries.suppressions",
       "description": "boundary_rules, suppressions, config-driven violations",