Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/ast-hash-duplication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@stainless-code/codemap": minor
---

Add structural duplicate detection: `symbols.body_hash` at index time (canonical function body AST) and bundled `duplicates` recipe. Function-shaped symbols only; trivial one-line bodies skipped. Triage collisions with `snippet` — shared control-flow skeletons can false-positive.
1 change: 1 addition & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ All base tables use `STRICT` mode; **`source_fts`** is an FTS5 virtual table (no
| return_type | TEXT | Stringified return type for function-shaped symbols; NULL when unannotated or N/A |
| is_async | INTEGER | 1 for async function-shaped symbols (`function`, `method`, arrow-assigned `function` kind) |
| is_generator | INTEGER | 1 for generator function-shaped symbols |
| body_hash | TEXT | SHA-256 hex of canonicalized function **body** AST (identifiers → `$id`, literals → kind only, absent returns → `Literal:nullish`). Populated for function-shaped symbols when `body_line_count >= 2`; NULL otherwise. Powers `duplicates` recipe. Partial index `idx_symbols_body_hash` |

### `calls` — Function-scoped call edges, deduped per file (`STRICT`)

Expand Down
4 changes: 4 additions & 0 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,10 @@ Per-function decision-point count (REAL column on `symbols`). Computed by the pa

SonarSource-inspired cognitive complexity (INTEGER on `symbols`) for the same function-shaped symbols as cyclomatic `complexity`. Penalizes nested control flow; computed in the same parser walk as McCabe. Recipes: `high-cognitive-complexity` (`min_score` default 15, Sonar rule threshold); `high-complexity-untested` includes the column while filtering on cyclomatic `complexity`.

### `symbols.body_hash` / structural duplicate bodies

SHA-256 hex of a canonicalized function **body** AST (not raw source). Normalization (v1): every identifier → `$id`; literals → kind only (`Literal:string`, …); absent returns (`null`, `undefined`, `void 0`, bare `return`) → `Literal:nullish`; template literals walked structurally. Populated for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`; NULL for trivial one-liners and non-functions. Recipe **`duplicates`** groups rows sharing a hash. Distinct from token-level suffix-array / copy-paste clone detectors — catches rename-insensitive structural twins; may false-positive on shared control-flow skeletons (triage with `snippet`).

### `source_fts` (FTS5 virtual table) / `--with-fts` / opt-in full-text

Opt-in FTS5 virtual table over file content (`tokenize='porter unicode61'`). Always created (near-zero space when empty); populated only when the resolved config has FTS5 enabled (`.codemap/config.ts` `fts5: true` OR `--with-fts` CLI flag at index time; CLI wins, logs stderr override). Demonstrates the FTS5 ⨯ `symbols` ⨯ `coverage` JOIN composability that ripgrep can't match — bundled recipe `text-in-deprecated-functions` exemplifies the JOIN. Toggle change auto-detects via `meta.fts5_enabled` and forces a full rebuild so `source_fts` is consistently populated. Stderr telemetry `[fts5] source_fts populated: <N> files / <X> KB` on first populate. Distinct from `coverage` — `source_fts` is an FTS5 **virtual** table; `coverage` is a regular `STRICT, WITHOUT ROWID` table. Default OFF preserves `.codemap/index.db` size for non-users (~30–50% growth on text-heavy projects).
Expand Down
4 changes: 4 additions & 0 deletions docs/golden-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,10 @@ Some bundled recipes add optional **`reason`** (TEXT) and **`evidence_json`** (T

`coverage-confirmed-dead` adds **`confidence`** (`high` \| `medium`) on each row — **`high`** when static dead and ingested `coverage_pct = 0`; **`medium`** when static dead but the symbol has no ingested coverage row. Also **`reason`**, **`caller_count`**. Goldens: `coverage-confirmed-dead` (post-ingest mix) and `coverage-confirmed-dead-no-ingest` (`preSetup: clear-coverage`, `everyRowFieldEquals` on `confidence: medium`).

### Duplication columns (`duplicates` recipe)

`duplicates` returns one row per function-shaped symbol in a **`body_hash`** collision group: **`name`**, **`kind`**, **`file_path`**, **`line_start`**, **`line_end`**, **`body_hash`**, **`body_line_count`**, **`duplicate_count`** (in-scope group size after `path_prefix` / `min_body_lines`). Substrate column **`symbols.body_hash`** is populated at index for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`. Goldens: `duplicates` (includes `src/bench/duplicate-body-{a,b}.ts` pair). False positives possible when unrelated functions share control-flow skeleton or sync vs async/generator bodies match — triage with `snippet`. Recipe caps at **50 rows** (no truncation marker).

---

## Status
Expand Down
151 changes: 0 additions & 151 deletions docs/plans/ast-hash-duplication.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Predicate-as-API only — enrich row shape and audit deltas; no standalone pass/
- [ ] **`codemap audit` verdict + thresholds** (v1.x) — `verdict: "pass" | "warn" | "fail"` driven by an `audit.deltas[<key>].{added_max, action}` field on the config object (`.codemap/config.{ts,js,json}`). Triggers: two consumers ship `jq`-based threshold scripts with similar shapes, OR one consumer asks with a concrete config sketch. Until then, raw deltas + consumer-side `jq` is the CI exit-code idiom. **Likely accelerant:** the Marketplace Action (next item) shipping is the most plausible path to firing the trigger — once `- uses: stainless-code/codemap@v1` is the dominant CI path, real `jq` threshold scripts will surface.
- [ ] **GitHub Marketplace Action — publish + listing finish** — core Action implementation is in-tree: root `action.yml`, `query --ci`, `audit --format sarif` / `--ci`, package-manager detection, dogfood smoke, and opt-in `pr-comment` summary renderer have shipped. Remaining work is the release/listing slice: `MARKETPLACE.md`, `v1.0.0` / floating `v1` tags, Marketplace setup, sacrificial-repo smoke, and making `action-smoke` blocking once the Action tag exists. Action version stream is independent of CLI version (`package.json` currently drives CLI/npm version; Action publishes at its own `v1.0.0`). Plan: [`plans/github-marketplace-action.md`](./plans/github-marketplace-action.md). Effort: S.
- [ ] **Churn × complexity hotspots** — `file_churn` table (git `log --numstat` over indexed paths, recency-weighted commits, optional trend) + bundled recipe **`churn-complexity-hotspots`** JOINing `symbols.complexity` for ranked refactor targets. Distinct from outcome alias `hotspots` → `fan-in`. Score is a recipe column, not a verdict ([Moat A](./roadmap.md#moats-load-bearing)). Plan: [`plans/churn-complexity-hotspots.md`](./plans/churn-complexity-hotspots.md). Effort: L–M.
- [ ] **AST-hash duplication** — `symbols.body_hash` column (normalized AST hash via oxc, computed at parse time — Rust-native, fast) + bundled `duplicates` recipe joining on `body_hash` (`GROUP BY body_hash HAVING COUNT(*) > 1`). **Different shape from token-level suffix-array dupes** (catches structurally-identical functions, not copy-paste with renamed variables). Substrate addition — consumer writes the JOIN that decides "this is a problem"; no severity, no suppression-by-default. Plan: [`plans/ast-hash-duplication.md`](./plans/ast-hash-duplication.md). Effort: M.
- [x] **AST-hash duplication** — `symbols.body_hash` (canonical body AST, identifiers → `$id`, literals → kind, absent returns → `Literal:nullish`; function-shaped symbols; skip `body_line_count < 2`) + partial index + bundled `duplicates` recipe (per-symbol rows, CTE `GROUP BY`). **Different shape from token-level suffix-array dupes.** Contract: [architecture § `symbols` table](./architecture.md#symbols--functions-constants-classes-interfaces-types-enums-strict), [glossary § body_hash](./glossary.md#symbolsbody_hash--structural-duplicate-bodies). Effort: M.
- [ ] **Falsifiable benchmark CI on named external fixtures** — structural-cost A/B (indexed queries vs `find` + `grep` + `Read`-loop discovery) on zod, fastify, vue-core, next.js. Numbers land in [`docs/benchmark.md`](./benchmark.md); headline figures surface in `MARKETPLACE.md` only after external runs land. Harness: [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness) + external fixture extension; pair with **Agent eval: quality × tokens × wall** for scored completion metrics. **Partial:** manual [`.github/workflows/agent-eval-external.yml`](../.github/workflows/agent-eval-external.yml) for in-repo fixture paths (not zod/fastify/nightly). Effort: M. **Self-index regression guardrail shipped** (#96): `bun run check:perf-baseline` + weekly scheduled workflow (demoted from PR hard gate — GHA runner variance).
- [ ] **In-repo test bench scale (optional)** — if `fixtures/minimal` outgrows one corpus: add committed `fixtures/bench/` or rename `minimal`→`bench`. Harness map: [`testing-coverage.md`](./testing-coverage.md), [`fixtures/README.md`](../fixtures/README.md).

Expand Down
9 changes: 9 additions & 0 deletions fixtures/CAPABILITIES.json
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,15 @@
],
"setup": ["ingest-coverage"]
},
{
"id": "duplication.body-hash",
"description": "symbols.body_hash structural fingerprint and duplicates recipe",
"fixtureFiles": [
"src/bench/duplicate-body-a.ts",
"src/bench/duplicate-body-b.ts"
],
"goldenScenarios": ["duplicates"]
},
{
"id": "boundaries.suppressions",
"description": "boundary_rules, suppressions, config-driven violations",
Expand Down
Loading
Loading