Skip to content

Commit 36106ff

Browse files
feat(duplicates): body_hash structural duplication detection (#178)
* feat(duplicates): body_hash structural dupes + harden pass Add symbols.body_hash (canonical body AST for function-shaped symbols), SCHEMA_VERSION 39, duplicates recipe, golden scenario, and agent rule row. Retire ast-hash-duplication plan into architecture/glossary/golden-queries. * harden: nullish return normalization + O(1) FD body_hash lookup Fix deferred correctness/perf items: return-position Literal:nullish for null/undefined/void 0/bare return; void 0 only (not all void); FD symbol index via markArrowSymbol at push time; docs parity and regression tests. * fix(duplicates): scope filters before grouping + doc parity Apply path_prefix and min_body_lines before duplicate_count aggregation so scoped queries report accurate group sizes. Align golden-queries wording with function-shaped symbol contract (getter/setter included). * harden: script probes, scoped duplicates tests, doc parity Wire duplication.body-hash agent-eval probe; fix spike-crap tier count after fixture symbols; add duplicates-recipe-scope regression tests; setter/nullish-scope unit tests; consumer doc caveats (LIMIT, async/gen).
1 parent bb8bae9 commit 36106ff

35 files changed

Lines changed: 1097 additions & 206 deletions

.changeset/ast-hash-duplication.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@stainless-code/codemap": minor
3+
---
4+
5+
Add structural duplicate detection: `symbols.body_hash` at index time (canonical function body AST) and bundled `duplicates` recipe. Function-shaped symbols only; trivial one-line bodies skipped. Triage collisions with `snippet` — shared control-flow skeletons can false-positive.

docs/architecture.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,7 @@ All base tables use `STRICT` mode; **`source_fts`** is an FTS5 virtual table (no
301301
| return_type | TEXT | Stringified return type for function-shaped symbols; NULL when unannotated or N/A |
302302
| is_async | INTEGER | 1 for async function-shaped symbols (`function`, `method`, arrow-assigned `function` kind) |
303303
| is_generator | INTEGER | 1 for generator function-shaped symbols |
304+
| body_hash | TEXT | SHA-256 hex of canonicalized function **body** AST (identifiers → `$id`, literals → kind only, absent returns → `Literal:nullish`). Populated for function-shaped symbols when `body_line_count >= 2`; NULL otherwise. Powers `duplicates` recipe. Partial index `idx_symbols_body_hash` |
304305

305306
### `calls` — Function-scoped call edges, deduped per file (`STRICT`)
306307

docs/glossary.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,10 @@ Per-function decision-point count (REAL column on `symbols`). Computed by the pa
147147

148148
SonarSource-inspired cognitive complexity (INTEGER on `symbols`) for the same function-shaped symbols as cyclomatic `complexity`. Penalizes nested control flow; computed in the same parser walk as McCabe. Recipes: `high-cognitive-complexity` (`min_score` default 15, Sonar rule threshold); `high-complexity-untested` includes the column while filtering on cyclomatic `complexity`.
149149

150+
### `symbols.body_hash` / structural duplicate bodies
151+
152+
SHA-256 hex of a canonicalized function **body** AST (not raw source). Normalization (v1): every identifier → `$id`; literals → kind only (`Literal:string`, …); absent returns (`null`, `undefined`, `void 0`, bare `return`) → `Literal:nullish`; template literals walked structurally. Populated for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`; NULL for trivial one-liners and non-functions. Recipe **`duplicates`** groups rows sharing a hash. Distinct from token-level suffix-array / copy-paste clone detectors — catches rename-insensitive structural twins; may false-positive on shared control-flow skeletons (triage with `snippet`).
153+
150154
### `source_fts` (FTS5 virtual table) / `--with-fts` / opt-in full-text
151155

152156
Opt-in FTS5 virtual table over file content (`tokenize='porter unicode61'`). Always created (near-zero space when empty); populated only when the resolved config has FTS5 enabled (`.codemap/config.ts` `fts5: true` OR `--with-fts` CLI flag at index time; CLI wins, logs stderr override). Demonstrates the FTS5 ⨯ `symbols``coverage` JOIN composability that ripgrep can't match — bundled recipe `text-in-deprecated-functions` exemplifies the JOIN. Toggle change auto-detects via `meta.fts5_enabled` and forces a full rebuild so `source_fts` is consistently populated. Stderr telemetry `[fts5] source_fts populated: <N> files / <X> KB` on first populate. Distinct from `coverage``source_fts` is an FTS5 **virtual** table; `coverage` is a regular `STRICT, WITHOUT ROWID` table. Default OFF preserves `.codemap/index.db` size for non-users (~30–50% growth on text-heavy projects).

docs/golden-queries.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,10 @@ Some bundled recipes add optional **`reason`** (TEXT) and **`evidence_json`** (T
7878

7979
`coverage-confirmed-dead` adds **`confidence`** (`high` \| `medium`) on each row — **`high`** when static dead and ingested `coverage_pct = 0`; **`medium`** when static dead but the symbol has no ingested coverage row. Also **`reason`**, **`caller_count`**. Goldens: `coverage-confirmed-dead` (post-ingest mix) and `coverage-confirmed-dead-no-ingest` (`preSetup: clear-coverage`, `everyRowFieldEquals` on `confidence: medium`).
8080

81+
### Duplication columns (`duplicates` recipe)
82+
83+
`duplicates` returns one row per function-shaped symbol in a **`body_hash`** collision group: **`name`**, **`kind`**, **`file_path`**, **`line_start`**, **`line_end`**, **`body_hash`**, **`body_line_count`**, **`duplicate_count`** (in-scope group size after `path_prefix` / `min_body_lines`). Substrate column **`symbols.body_hash`** is populated at index for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`. Goldens: `duplicates` (includes `src/bench/duplicate-body-{a,b}.ts` pair). False positives possible when unrelated functions share control-flow skeleton or sync vs async/generator bodies match — triage with `snippet`. Recipe caps at **50 rows** (no truncation marker).
84+
8185
---
8286

8387
## Status

docs/plans/ast-hash-duplication.md

Lines changed: 0 additions & 151 deletions
This file was deleted.

docs/roadmap.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@ Predicate-as-API only — enrich row shape and audit deltas; no standalone pass/
112112
- [ ] **`codemap audit` verdict + thresholds** (v1.x) — `verdict: "pass" | "warn" | "fail"` driven by an `audit.deltas[<key>].{added_max, action}` field on the config object (`.codemap/config.{ts,js,json}`). Triggers: two consumers ship `jq`-based threshold scripts with similar shapes, OR one consumer asks with a concrete config sketch. Until then, raw deltas + consumer-side `jq` is the CI exit-code idiom. **Likely accelerant:** the Marketplace Action (next item) shipping is the most plausible path to firing the trigger — once `- uses: stainless-code/codemap@v1` is the dominant CI path, real `jq` threshold scripts will surface.
113113
- [ ] **GitHub Marketplace Action — publish + listing finish** — core Action implementation is in-tree: root `action.yml`, `query --ci`, `audit --format sarif` / `--ci`, package-manager detection, dogfood smoke, and opt-in `pr-comment` summary renderer have shipped. Remaining work is the release/listing slice: `MARKETPLACE.md`, `v1.0.0` / floating `v1` tags, Marketplace setup, sacrificial-repo smoke, and making `action-smoke` blocking once the Action tag exists. Action version stream is independent of CLI version (`package.json` currently drives CLI/npm version; Action publishes at its own `v1.0.0`). Plan: [`plans/github-marketplace-action.md`](./plans/github-marketplace-action.md). Effort: S.
114114
- [ ] **Churn × complexity hotspots**`file_churn` table (git `log --numstat` over indexed paths, recency-weighted commits, optional trend) + bundled recipe **`churn-complexity-hotspots`** JOINing `symbols.complexity` for ranked refactor targets. Distinct from outcome alias `hotspots``fan-in`. Score is a recipe column, not a verdict ([Moat A](./roadmap.md#moats-load-bearing)). Plan: [`plans/churn-complexity-hotspots.md`](./plans/churn-complexity-hotspots.md). Effort: L–M.
115-
- [ ] **AST-hash duplication**`symbols.body_hash` column (normalized AST hash via oxc, computed at parse time — Rust-native, fast) + bundled `duplicates` recipe joining on `body_hash` (`GROUP BY body_hash HAVING COUNT(*) > 1`). **Different shape from token-level suffix-array dupes** (catches structurally-identical functions, not copy-paste with renamed variables). Substrate addition — consumer writes the JOIN that decides "this is a problem"; no severity, no suppression-by-default. Plan: [`plans/ast-hash-duplication.md`](./plans/ast-hash-duplication.md). Effort: M.
115+
- [x] **AST-hash duplication**`symbols.body_hash` (canonical body AST, identifiers → `$id`, literals → kind, absent returns → `Literal:nullish`; function-shaped symbols; skip `body_line_count < 2`) + partial index + bundled `duplicates` recipe (per-symbol rows, CTE `GROUP BY`). **Different shape from token-level suffix-array dupes.** Contract: [architecture § `symbols` table](./architecture.md#symbols--functions-constants-classes-interfaces-types-enums-strict), [glossary § body_hash](./glossary.md#symbolsbody_hash--structural-duplicate-bodies). Effort: M.
116116
- [ ] **Falsifiable benchmark CI on named external fixtures** — structural-cost A/B (indexed queries vs `find` + `grep` + `Read`-loop discovery) on zod, fastify, vue-core, next.js. Numbers land in [`docs/benchmark.md`](./benchmark.md); headline figures surface in `MARKETPLACE.md` only after external runs land. Harness: [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness) + external fixture extension; pair with **Agent eval: quality × tokens × wall** for scored completion metrics. **Partial:** manual [`.github/workflows/agent-eval-external.yml`](../.github/workflows/agent-eval-external.yml) for in-repo fixture paths (not zod/fastify/nightly). Effort: M. **Self-index regression guardrail shipped** (#96): `bun run check:perf-baseline` + weekly scheduled workflow (demoted from PR hard gate — GHA runner variance).
117117
- [ ] **In-repo test bench scale (optional)** — if `fixtures/minimal` outgrows one corpus: add committed `fixtures/bench/` or rename `minimal``bench`. Harness map: [`testing-coverage.md`](./testing-coverage.md), [`fixtures/README.md`](../fixtures/README.md).
118118

fixtures/CAPABILITIES.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,15 @@
175175
],
176176
"setup": ["ingest-coverage"]
177177
},
178+
{
179+
"id": "duplication.body-hash",
180+
"description": "symbols.body_hash structural fingerprint and duplicates recipe",
181+
"fixtureFiles": [
182+
"src/bench/duplicate-body-a.ts",
183+
"src/bench/duplicate-body-b.ts"
184+
],
185+
"goldenScenarios": ["duplicates"]
186+
},
178187
{
179188
"id": "boundaries.suppressions",
180189
"description": "boundary_rules, suppressions, config-driven violations",

0 commit comments

Comments
 (0)