Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/ast-hash-duplication.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
"@stainless-code/codemap": minor
---

Add structural duplicate detection: `symbols.body_hash` at index time (canonical function body AST) and bundled `duplicates` recipe. Function-shaped symbols only; trivial one-line bodies skipped.
1 change: 1 addition & 0 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ All base tables use `STRICT` mode; **`source_fts`** is an FTS5 virtual table (no
| return_type | TEXT | Stringified return type for function-shaped symbols; NULL when unannotated or N/A |
| is_async | INTEGER | 1 for async function-shaped symbols (`function`, `method`, arrow-assigned `function` kind) |
| is_generator | INTEGER | 1 for generator function-shaped symbols |
| body_hash | TEXT | SHA-256 hex of canonicalized function **body** AST (identifiers → `$id`, literals → kind only). Populated for function-shaped symbols when `body_line_count >= 2`; NULL otherwise. Powers `duplicates` recipe. Partial index `idx_symbols_body_hash` |

### `calls` — Function-scoped call edges, deduped per file (`STRICT`)

Expand Down
4 changes: 4 additions & 0 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,10 @@ Per-function decision-point count (REAL column on `symbols`). Computed by the pa

SonarSource-inspired cognitive complexity (INTEGER on `symbols`) for the same function-shaped symbols as cyclomatic `complexity`. Penalizes nested control flow; computed in the same parser walk as McCabe. Recipes: `high-cognitive-complexity` (`min_score` default 15, Sonar rule threshold); `high-complexity-untested` includes the column while filtering on cyclomatic `complexity`.

### `symbols.body_hash` / structural duplicate bodies

SHA-256 hex of a canonicalized function **body** AST (not raw source). Normalization (v1): every identifier → `$id`; literals → kind only (`Literal:string`, …); template literals walked structurally. Populated for function-shaped symbols (`function`, `method`, `getter`, `setter`) when `body_line_count >= 2`; NULL for trivial one-liners and non-functions. Recipe **`duplicates`** groups rows sharing a hash. Distinct from token-level suffix-array / copy-paste clone detectors — catches rename-insensitive structural twins; may false-positive on shared control-flow skeletons (triage with `snippet`).

### `source_fts` (FTS5 virtual table) / `--with-fts` / opt-in full-text

Opt-in FTS5 virtual table over file content (`tokenize='porter unicode61'`). Always created (near-zero space when empty); populated only when the resolved config has FTS5 enabled (`.codemap/config.ts` `fts5: true` OR `--with-fts` CLI flag at index time; CLI wins, logs stderr override). Demonstrates the FTS5 ⨯ `symbols` ⨯ `coverage` JOIN composability that ripgrep can't match — bundled recipe `text-in-deprecated-functions` exemplifies the JOIN. Toggle change auto-detects via `meta.fts5_enabled` and forces a full rebuild so `source_fts` is consistently populated. Stderr telemetry `[fts5] source_fts populated: <N> files / <X> KB` on first populate. Distinct from `coverage` — `source_fts` is an FTS5 **virtual** table; `coverage` is a regular `STRICT, WITHOUT ROWID` table. Default OFF preserves `.codemap/index.db` size for non-users (~30–50% growth on text-heavy projects).
Expand Down
4 changes: 4 additions & 0 deletions docs/golden-queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,10 @@ Some bundled recipes add optional **`reason`** (TEXT) and **`evidence_json`** (T

`coverage-confirmed-dead` adds **`confidence`** (`high` \| `medium`) on each row — **`high`** when static dead and ingested `coverage_pct = 0`; **`medium`** when static dead but the symbol has no ingested coverage row. Also **`reason`**, **`caller_count`**. Goldens: `coverage-confirmed-dead` (post-ingest mix) and `coverage-confirmed-dead-no-ingest` (`preSetup: clear-coverage`, `everyRowFieldEquals` on `confidence: medium`).

### Duplication columns (`duplicates` recipe)

`duplicates` returns one row per function-shaped symbol in a **`body_hash`** collision group: **`name`**, **`kind`**, **`file_path`**, **`line_start`**, **`line_end`**, **`body_hash`**, **`body_line_count`**, **`duplicate_count`** (group size). Substrate column **`symbols.body_hash`** is populated at index for named functions, arrows, and class methods when `body_line_count >= 2`. Goldens: `duplicates` (includes `src/bench/duplicate-body-{a,b}.ts` pair). False positives possible when unrelated functions share control-flow skeleton — triage with `snippet`.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Outdated

---

## Status
Expand Down
151 changes: 0 additions & 151 deletions docs/plans/ast-hash-duplication.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ Predicate-as-API only — enrich row shape and audit deltas; no standalone pass/
- [ ] **`codemap audit` verdict + thresholds** (v1.x) — `verdict: "pass" | "warn" | "fail"` driven by an `audit.deltas[<key>].{added_max, action}` field on the config object (`.codemap/config.{ts,js,json}`). Triggers: two consumers ship `jq`-based threshold scripts with similar shapes, OR one consumer asks with a concrete config sketch. Until then, raw deltas + consumer-side `jq` is the CI exit-code idiom. **Likely accelerant:** the Marketplace Action (next item) shipping is the most plausible path to firing the trigger — once `- uses: stainless-code/codemap@v1` is the dominant CI path, real `jq` threshold scripts will surface.
- [ ] **GitHub Marketplace Action — publish + listing finish** — core Action implementation is in-tree: root `action.yml`, `query --ci`, `audit --format sarif` / `--ci`, package-manager detection, dogfood smoke, and opt-in `pr-comment` summary renderer have shipped. Remaining work is the release/listing slice: `MARKETPLACE.md`, `v1.0.0` / floating `v1` tags, Marketplace setup, sacrificial-repo smoke, and making `action-smoke` blocking once the Action tag exists. Action version stream is independent of CLI version (`package.json` currently drives CLI/npm version; Action publishes at its own `v1.0.0`). Plan: [`plans/github-marketplace-action.md`](./plans/github-marketplace-action.md). Effort: S.
- [ ] **Churn × complexity hotspots** — `file_churn` table (git `log --numstat` over indexed paths, recency-weighted commits, optional trend) + bundled recipe **`churn-complexity-hotspots`** JOINing `symbols.complexity` for ranked refactor targets. Distinct from outcome alias `hotspots` → `fan-in`. Score is a recipe column, not a verdict ([Moat A](./roadmap.md#moats-load-bearing)). Plan: [`plans/churn-complexity-hotspots.md`](./plans/churn-complexity-hotspots.md). Effort: L–M.
- [ ] **AST-hash duplication** — `symbols.body_hash` column (normalized AST hash via oxc, computed at parse time — Rust-native, fast) + bundled `duplicates` recipe joining on `body_hash` (`GROUP BY body_hash HAVING COUNT(*) > 1`). **Different shape from token-level suffix-array dupes** (catches structurally-identical functions, not copy-paste with renamed variables). Substrate addition — consumer writes the JOIN that decides "this is a problem"; no severity, no suppression-by-default. Plan: [`plans/ast-hash-duplication.md`](./plans/ast-hash-duplication.md). Effort: M.
- [x] **AST-hash duplication** — `symbols.body_hash` (canonical body AST, identifiers → `$id`, literals → kind; function-shaped symbols; skip `body_line_count < 2`) + partial index + bundled `duplicates` recipe (per-symbol rows, CTE `GROUP BY`). **Different shape from token-level suffix-array dupes.** Contract: [architecture § `symbols` table](./architecture.md#symbols--functions-constants-classes-interfaces-types-enums-strict), [glossary § body_hash](./glossary.md#symbolsbody_hash--structural-duplicate-bodies). Effort: M.
- [ ] **Falsifiable benchmark CI on named external fixtures** — structural-cost A/B (indexed queries vs `find` + `grep` + `Read`-loop discovery) on zod, fastify, vue-core, next.js. Numbers land in [`docs/benchmark.md`](./benchmark.md); headline figures surface in `MARKETPLACE.md` only after external runs land. Harness: [benchmark § Agent eval harness](./benchmark.md#agent-eval-harness) + external fixture extension; pair with **Agent eval: quality × tokens × wall** for scored completion metrics. **Partial:** manual [`.github/workflows/agent-eval-external.yml`](../.github/workflows/agent-eval-external.yml) for in-repo fixture paths (not zod/fastify/nightly). Effort: M. **Self-index regression guardrail shipped** (#96): `bun run check:perf-baseline` + weekly scheduled workflow (demoted from PR hard gate — GHA runner variance).
- [ ] **In-repo test bench scale (optional)** — if `fixtures/minimal` outgrows one corpus: add committed `fixtures/bench/` or rename `minimal`→`bench`. Harness map: [`testing-coverage.md`](./testing-coverage.md), [`fixtures/README.md`](../fixtures/README.md).

Expand Down
9 changes: 9 additions & 0 deletions fixtures/CAPABILITIES.json
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,15 @@
],
"setup": ["ingest-coverage"]
},
{
"id": "duplication.body-hash",
"description": "symbols.body_hash structural fingerprint and duplicates recipe",
"fixtureFiles": [
"src/bench/duplicate-body-a.ts",
"src/bench/duplicate-body-b.ts"
],
"goldenScenarios": ["duplicates"]
},
{
"id": "boundaries.suppressions",
"description": "boundary_rules, suppressions, config-driven violations",
Expand Down
Loading
Loading