docs: dedupe cross-links; canonical fingerprints in architecture

SutuSebastian · SutuSebastian · commit ab04ab54557a · 2026-04-06T21:26:57.000+03:00
- Single layering pointer in README; CONTRIBUTING line merged with JSDoc note
- docs/README: one conventions + contributors line
- architecture § Schema: Fingerprints paragraph; files.content_hash points to it
- benchmark: Results blurb links to architecture; fix fixtures anchor; trim Key Takeaways
- why-codemap: one benchmark link from Solution
diff --git a/README.md b/README.md
@@ -73,7 +73,7 @@ await cm.index({ quiet: true });
 const rows = cm.query("SELECT name FROM symbols LIMIT 5");
 ```
 
-`createCodemap` configures a process-global runtime (`initCodemap`); only **one active project per process** is supported. Advanced: `runCodemapIndex` for an open DB handle. Layering (`cli` → `application` → infrastructure): [docs/architecture.md](docs/architecture.md).
+`createCodemap` configures a process-global runtime (`initCodemap`); only **one active project per process** is supported. Advanced: `runCodemapIndex` for an open DB handle. **Module layout:** [docs/architecture.md § Layering](docs/architecture.md#layering).
 
 ---
 
@@ -94,7 +94,7 @@ bun run check    # build + format:check + lint + test + typecheck
 bun run fix      # oxlint --fix, then oxfmt
 ```
 
-**Readability & DX:** Prefer clear names and small functions over cleverness. **Public API** surface (`createCodemap`, `Codemap`, config types, `runCodemapIndex`, adapter exports) should stay **documented with JSDoc** so consumers get good hovers and published `.d.ts` stay useful. **Layering** (`cli` → `application` → `adapters` / parsers → SQLite): see [docs/architecture.md](docs/architecture.md). More for contributors: [.github/CONTRIBUTING.md](.github/CONTRIBUTING.md).
+**Readability & DX:** Prefer clear names and small functions; keep **JSDoc** on public exports. [.github/CONTRIBUTING.md](.github/CONTRIBUTING.md) has contributor workflow and conventions.
 
 ---
 
diff --git a/docs/README.md b/docs/README.md
@@ -13,6 +13,4 @@ Technical docs for **[@stainless-code/codemap](https://github.com/stainless-code
 | [roadmap.md](./roadmap.md)             | Forward-looking backlog (not a `src/` inventory)                                        |
 | [why-codemap.md](./why-codemap.md)     | Why index + SQL for agents                                                              |
 
-**Conventions:** one topic per file; link with relative paths; no hardcoded symbol/file counts (use `codemap query` / `bun run dev query`); no source line numbers. **Contributors:** keep public API JSDoc useful; run `bun run check` — see [CONTRIBUTING](../.github/CONTRIBUTING.md).
-
-**Also:** [.gitignore](../.gitignore) (`.codemap.db`), [.oxfmtrc.json](../.oxfmtrc.json) / [.oxlintrc.json](../.oxlintrc.json), [.agents/](../.agents/) / [.cursor/](../.cursor/) — [CONTRIBUTING](../.github/CONTRIBUTING.md).
+**Conventions:** one topic per file; relative links; no symbol/file counts or source line numbers in docs (use `codemap query` / `bun run dev query` to measure). **Contributors:** `bun run check`, JSDoc on public API — [.github/CONTRIBUTING.md](../.github/CONTRIBUTING.md) (tooling, `.agents/` / `.cursor/`, `.gitignore` / format config).
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -141,21 +141,23 @@ The npm package exports **`createCodemap`**, **`Codemap`** (`query`, `index`), *
 
 ## Schema
 
-Current schema version: **2** — see [Schema Versioning](#schema-versioning) for details
+**Fingerprints:** incremental runs compare **`files.content_hash`** — SHA-256 hex of raw file bytes from [`src/hash.ts`](../src/hash.ts) (same on Node and Bun). Details in the **`files`** table below.
 
-All tables use `STRICT` mode. Tables marked with `WITHOUT ROWID` store data directly in the primary key B-tree. See [SQLite Performance Configuration](#sqlite-performance-configuration) for details.
+Current schema version: **2** — see [Schema Versioning](#schema-versioning) for details.
+
+All tables use `STRICT` mode. Tables marked with `WITHOUT ROWID` store data directly in the primary key B-tree. PRAGMAs and index design: [SQLite Performance Configuration](#sqlite-performance-configuration).
 
 ### `files` — Every indexed file (`STRICT`)
 
-| Column        | Type    | Description                                       |
-| ------------- | ------- | ------------------------------------------------- |
-| path          | TEXT PK | Relative path from project root                   |
-| content_hash  | TEXT    | SHA-256 hex (`src/hash.ts`, same on Node and Bun) |
-| size          | INTEGER | File size in bytes                                |
-| line_count    | INTEGER | Total lines                                       |
-| language      | TEXT    | `ts`, `tsx`, `css`, `md`, etc.                    |
-| last_modified | INTEGER | File mtime (epoch ms)                             |
-| indexed_at    | INTEGER | When this row was written                         |
+| Column        | Type    | Description                                    |
+| ------------- | ------- | ---------------------------------------------- |
+| path          | TEXT PK | Relative path from project root                |
+| content_hash  | TEXT    | SHA-256 hex — see **Fingerprints** at § Schema |
+| size          | INTEGER | File size in bytes                             |
+| line_count    | INTEGER | Total lines                                    |
+| language      | TEXT    | `ts`, `tsx`, `css`, `md`, etc.                 |
+| last_modified | INTEGER | File mtime (epoch ms)                          |
+| indexed_at    | INTEGER | When this row was written                      |
 
 ### `symbols` — Functions, variables, classes, interfaces, type aliases, enums (`STRICT`)
 
diff --git a/docs/benchmark.md b/docs/benchmark.md
@@ -62,7 +62,7 @@ Each scenario runs both approaches back-to-back on the same machine, same data.
 
 ## Results
 
-Example snapshot from `bun src/benchmark.ts` immediately after `bun src/index.ts --full` on **this repository** (small tree; many scenario result counts are zero — that is expected here). Numbers vary by machine and project shape. Settings: schema v2, SHA-256 content fingerprints (`src/hash.ts`), `db.query()` caching, covering/partial indexes, mmap, worker threads, deferred indexes, `batchInsert` helper.
+Example snapshot from `bun src/benchmark.ts` immediately after `bun src/index.ts --full` on **this repository** (small tree; many scenario counts are zero). Numbers vary by machine and project. Schema, indexes, and content fingerprints: [architecture.md § Schema](./architecture.md#schema).
 
 | Scenario                                | Index Time | Results | Trad. Time | Results | Files Read | Bytes Read | Speedup  |
 | --------------------------------------- | ---------- | ------- | ---------- | ------- | ---------- | ---------- | -------- |
@@ -76,7 +76,7 @@ Example snapshot from `bun src/benchmark.ts` immediately after `bun src/index.ts
 
 **Totals**: Index ~408µs vs Traditional ~26.7ms (**~65× overall** on a sample run). Traditional bytes read total ~393 KB (not megabytes) because the globbed sets are small.
 
-On a **large app** indexed via `--root`, the same queries typically return non-zero rows; the indexed side stays sub-millisecond while the traditional side reads megabytes for broad globs. [Fixtures (planned)](#fixtures-planned) describes the plan for CI-friendly trees.
+On a **large app** indexed via `--root`, the same queries typically return non-zero rows; the indexed side stays sub-millisecond while the traditional side reads megabytes for broad globs. Repeatable numbers: [Fixtures](#fixtures).
 
 ### Run-to-run variance
 
@@ -90,22 +90,15 @@ The indexed CSS scenario uses `ORDER BY name LIMIT 50` — see `benchmark.ts` fo
 
 ### Speed
 
-- **Symbol / component queries** — covering indexes resolve from the index B-tree; indexed time stays sub-millisecond while the traditional path reads every matching file for regex
-- **TODO markers** — pre-extracted markers across indexed file types vs a narrower traditional glob
-- **Imports** — `imports` table vs full-file scan for a given module prefix
-  Indexed SQL timings above are sub-millisecond per scenario. See [architecture.md § SQLite Performance Configuration](./architecture.md#sqlite-performance-configuration) for PRAGMAs and indexes.
+Indexed queries use **covering / partial indexes** on the SQLite side; the traditional path scales with **files read** and regex work. PRAGMAs and index design: [architecture.md § SQLite Performance Configuration](./architecture.md#sqlite-performance-configuration).
 
 ### Accuracy
 
-- **React components**: Index uses the same JSX/TSX component heuristic as the rest of the tool; regex “export” scans can over- or under-count vs `components`
-- **CSS tokens**: Indexed rows are structured; raw `--var` regexes often pick up duplicates and non-token matches
-- **TODO markers**: Index scans more configured extensions than a single glob in the benchmark’s traditional path
+Structured parsing vs regex tradeoffs (components, CSS, markers, imports): [why-codemap.md § Accuracy Gains](./why-codemap.md#accuracy-gains).
 
-See [why-codemap.md § Accuracy Gains](./why-codemap.md#accuracy-gains) for the full analysis.
+### Token impact (AI agents)
 
-### Token Impact (AI Agents)
-
-See [why-codemap.md § Token Efficiency](./why-codemap.md#token-efficiency) for the full analysis. On a large tree, the traditional approach can read tens of megabytes across scenarios; indexed queries return only matching rows.
+[why-codemap.md § Token Efficiency](./why-codemap.md#token-efficiency).
 
 ### Reindex Cost
 
diff --git a/docs/why-codemap.md b/docs/why-codemap.md
@@ -13,12 +13,10 @@ This burns context window, wastes tokens, slows response time, and produces less
 
 ## The Solution
 
-A pre-built SQLite index (`.codemap.db`) that extracts and structures code metadata at index time. Agents query it with SQL instead of scanning files. Build and query timings: [benchmark.md](./benchmark.md).
+A pre-built SQLite index (`.codemap.db`) that extracts and structures code metadata at index time. Agents query it with SQL instead of scanning files. Timings, scenarios, and methodology: [benchmark.md](./benchmark.md).
 
 ## Speed Gains
 
-Measured via `bun src/benchmark.ts` — see [benchmark.md](./benchmark.md) for full methodology.
-
 ### Headline pattern
 
 Indexed queries stay **sub-millisecond** per scenario on typical trees; the traditional path scales with **how many files** it must read and scan. On a large application, overall speedups on the order of **tens to hundreds ×** are common for structural questions; exact ratios depend on the project and hardware. Re-run the benchmark after major changes or when pointing `--root` at a different repo.