From 1eb6f01a11c2a137d27031ec94bd58a60fc8757b Mon Sep 17 00:00:00 2001
From: Sutu Sebastian <sebiitv@gmail.com>
Date: Fri, 1 May 2026 10:52:15 +0300
Subject: [PATCH 1/4] docs(plans): draft codemap-audit (B.5) plan
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Per docs/README.md Rule 3 (plans/<feature-name>.md, link from roadmap),
draft the design pass for the highest-leverage Tier B candidate from
docs/research/fallow.md before writing any code.

Plan covers:

- Snapshot-strategy trade-offs (worktree+full-index vs. B.6 baseline
  reuse vs. on-demand snapshot table) — recommends Option A (temp
  worktree under .codemap.audit-<sha>/) for v1; defers caching and
  perf optimization until a real consumer hits the wall.
- Built-in deltas for v1: files, dependencies, deprecated, visibility,
  barrels (top-N membership change), hot files (fan-in/fan-out top-N
  movement). Out: cycles, boundary crossings, markers, css_*.
- Verdict shape: pass/warn/fail with thresholds opt-in via
  codemap.config.audit; v1 emits raw deltas only (default pass).
  Exit codes 0/1/2 mirror `git diff --exit-code`.
- Composition table: --json / --summary work; --changed-since /
  --group-by / --baseline are mutex (different output shapes or
  semantics).
- Tracer-bullet sequence: 7 commits for end-to-end ship.
- Open questions surfaced rather than guessed — worktree location,
  empty-diff warning, per-delta actions, perf ceiling.

Roadmap entry added pointing at the plan; backlog item moved to top
since it's now actively designed.
---
 docs/plans/codemap-audit.md | 194 ++++++++++++++++++++++++++++++++++++
 docs/roadmap.md             |   1 +
 2 files changed, 195 insertions(+)
 create mode 100644 docs/plans/codemap-audit.md
diff --git a/docs/plans/codemap-audit.md b/docs/plans/codemap-audit.md
new file mode 100644
index 00000000..a9354c8c
--- /dev/null
+++ b/docs/plans/codemap-audit.md
@@ -0,0 +1,194 @@
+# Plan — `codemap audit --base <ref>`
+
+> Two-snapshot structural-drift verdict for a PR / branch. Adopted from [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) — explicitly the "single highest-leverage candidate" of that scan.
+
+**Status:** Open — design pass; not yet implemented.
+**Cross-refs:** [`docs/research/fallow.md`](../research/fallow.md) (motivation) · [`docs/architecture.md` § CLI usage](../architecture.md#cli-usage) (where wiring lands) · [`.agents/lessons.md`](../../.agents/lessons.md) (changesets bump policy).
+
+---
+
+## 1. Goal
+
+One command returns a structured verdict for what changed between a base ref and `HEAD`:
+
+```text
+codemap audit --base origin/main [--json] [--summary]
+↓
+{
+  "verdict": "pass" | "warn" | "fail",
+  "base": { "ref": "origin/main", "sha": "<sha>", "indexed_at": <ms> },
+  "head": { "sha": "<sha>", "indexed_at": <ms> },
+  "deltas": {
+    "files":        { "added": [...], "removed": [...] },
+    "dependencies": { "added": [...], "removed": [...] },
+    "deprecated":   { "added": [...], "removed": [...] },
+    "visibility":   { "added": [...], "removed": [...] },
+    "barrels":      { "movements": [...] },
+    "hot_files":    { "movements": [...] }
+  }
+}
+```
+
+Wraps existing recipes; doesn't grow a new analysis layer. Stays consistent with codemap's structural-index thesis ([`docs/why-codemap.md` § What Codemap is not](../why-codemap.md#what-codemap-is-not)).
+
+## 2. Non-goals (v1)
+
+- **Dead-code / duplication / complexity verdicts.** Those are fallow's territory and a non-goal per [`docs/roadmap.md` § Non-goals (v1)](../roadmap.md#non-goals-v1).
+- **Code-quality scoring / grading.** No "code health 87/100" output.
+- **Auto-fix / SARIF output.** Separate concerns — SARIF is B.8, auto-fix is explicitly out (D.14 in the research note).
+- **Cross-repo audit** (audit `origin/main` of project A from a checkout of project B). Out of scope; reuse `--root` for the simpler "audit a different tree" case.
+- **Continuous mode.** One-shot CLI, same as `codemap query`.
+
+## 3. Snapshot strategy
+
+The verdict is a diff between two indexed snapshots. Three credible architectures:
+
+### Option A: Temp DB on the base ref (worktree-style)
+
+```text
+1. git worktree add /tmp/codemap-audit-<sha> <base-ref>
+2. codemap --root /tmp/codemap-audit-<sha> --full   # builds .codemap.db there
+3. Open both DBs, run delta queries cross-DB, emit verdict.
+4. git worktree remove /tmp/codemap-audit-<sha>
+```
+
+**Pros:** Same code path as a normal index run on the base; no special "snapshot" abstraction; deltas are pure SQL across two attached DBs; reproducible regardless of how `HEAD` evolves.
+
+**Cons:** Spawns a worktree + full reindex per audit (cold cost ~seconds for codemap-sized projects, more for large monorepos). Disk churn under `/tmp`.
+
+### Option B: In-memory base via the existing `query_baselines` table (B.6 reuse)
+
+```text
+1. On main, periodically: for each "tracked" recipe, codemap query --save-baseline -r <id>.
+2. On a PR branch: codemap audit --base <name> diffs the live query results against the saved snapshots.
+```
+
+**Pros:** Zero new infra — reuses B.6 directly. Snapshots are addressable / nameable. No cold reindex.
+
+**Cons:** Requires baselines to be saved at the right moment (git-hook or CI step). Doesn't capture deltas the user didn't pre-baseline. Doesn't naturally express "deltas in the dependency graph as a whole" — only as far as recipes go.
+
+### Option C: On-demand snapshot table for the audit (hybrid)
+
+```text
+1. codemap audit --base <ref> reads <ref> from git, computes audit-shaped queries against the
+   *checked-out* tree at <ref> (using `git show <ref>:<file>` or `git archive` to materialise
+   files in memory / a temp dir), populates a tiny in-DB `audit_snapshot` table with just the
+   columns needed for the deltas (no full reindex).
+2. Diff in SQL; drop the snapshot table.
+```
+
+**Pros:** No worktree spawn; no extra infra in main code paths; deltas are scoped to what the audit needs.
+
+**Cons:** Implementing a "mini-indexer" that runs only the queries we need at <ref> is more code than (A) and the abstraction doesn't transfer.
+
+### Recommendation
+
+**Start with Option A** (temp worktree + full index). Reasons:
+
+1. Simplest to implement correctly — no new abstractions; the existing `--full --root /tmp/...` path already works.
+2. Cold cost on codemap (~150 files) is sub-second; on JordanCoin-sized projects (~few thousand files) still under 5s. Acceptable for "run on PR" usage.
+3. Future optimisation: cache `<sha> → /tmp/codemap-audit-<sha>/.codemap.db` so repeated audits on the same base hit the cache.
+4. Doesn't entangle the audit with B.6's user-facing baseline workflow (which has different semantics: user-named, hand-saved).
+
+**Reconsider Option B** if Option A's perf becomes a problem AND audits are happening in tight loops (e.g. file-watch trigger).
+
+## 4. Built-in deltas (v1)
+
+Each delta wraps an existing query / recipe. All structural — no new analysis layer.
+
+| Delta key      | What it surfaces                                                                                                                     | Source                                                                                                                     |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------- |
+| `files`        | New / deleted indexed files                                                                                                          | `SELECT path FROM files` (set diff)                                                                                        |
+| `dependencies` | New / deleted edges in the file-to-file dependency graph                                                                             | `SELECT from_path, to_path FROM dependencies` (set diff)                                                                   |
+| `deprecated`   | New / removed `@deprecated` symbols                                                                                                  | `--recipe deprecated-symbols` (set diff)                                                                                   |
+| `visibility`   | New / removed visibility-tagged symbols (`@internal` / `@beta` / `@alpha` / `@private` — `@public` is the surface itself, not noise) | `SELECT name, kind, visibility, file_path FROM symbols WHERE visibility IS NOT NULL AND visibility != 'public'` (set diff) |
+| `barrels`      | Files that crossed an export-count threshold (e.g. <10 → ≥10)                                                                        | `--recipe barrel-files` (compare top-N membership)                                                                         |
+| `hot_files`    | Files that gained / lost rank in the fan-in or fan-out top-15                                                                        | `--recipe fan-in` / `--recipe fan-out` (compare top-N membership)                                                          |
+
+**Out of v1** (reconsider once shipped):
+
+- `cycles` — needs cycle detection on the dependency graph; not a recipe today
+- `boundary_crossings` — needs a project-supplied glob list (similar to the future `audit-pr-architecture` skill kit); no canonical source
+- `markers` — TODO/FIXME drift is noisy and project-specific
+- `css_*` deltas — narrow audience; defer
+
+## 5. Verdict shape
+
+`pass | warn | fail` derived from per-delta thresholds. **Defaults exposed but conservative:**
+
+| Delta | Default threshold                               |
+| ----- | ----------------------------------------------- |
+| any   | `pass` (thresholds are opt-in via config in v1) |
+
+In other words: **v1 emits raw deltas only**. The verdict is always `pass` unless the user opts in via `codemap.config.*`. Reasoning: structural deltas don't have a universally-meaningful threshold ("how many new dependency edges is too many?" depends entirely on the project), and the research note explicitly biases toward "first pass exposes raw deltas only and lets the consumer set thresholds."
+
+### Threshold config (v1.x)
+
+Once per-project use surfaces concrete thresholds, fold into `codemap.config.*`:
+
+```ts
+// codemap.config.ts
+export default defineConfig({
+  audit: {
+    deltas: {
+      dependencies: { added_max: 50, action: "warn" },
+      deprecated: { added_max: 0, action: "fail" }, // any new @deprecated fails
+      visibility: { added_max: 5, action: "warn" },
+    },
+    // verdict reduction: highest action wins (fail > warn > pass)
+  },
+});
+```
+
+Validated via existing `codemapUserConfigSchema` (Zod) — see [`docs/architecture.md` § User config](../architecture.md#user-config). Schema additions are minor changesets per [`.agents/lessons.md` "changesets bump policy"](../../.agents/lessons.md) (no `.codemap.db` impact).
+
+## 6. Composition with existing flags
+
+| Flag                             | Behaviour with `audit`                                                                                             |
+| -------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
+| `--json`                         | Default for the verdict shape; non-JSON falls back to `console.table` per delta + a one-line verdict summary.      |
+| `--summary`                      | Collapses every delta to `{added: N, removed: N}`; verdict + base/head metadata stay. Useful for CI status checks. |
+| `--changed-since`                | **Mutex** — `audit` is itself a "changed-since" operation; combining would be confusing. Parser-level error.       |
+| `--group-by`                     | **Mutex** — verdict shape is already structured; bucketing is the consumer's job on the output JSON.               |
+| `--save-baseline` / `--baseline` | **Mutex** — different snapshot semantics (B.6 is user-named; audit is base-ref-driven).                            |
+| `--recipe`                       | N/A — `audit` isn't a `query` subcommand; it's its own top-level command.                                          |
+
+## 7. CLI surface
+
+```text
+codemap audit --base <ref> [--json] [--summary] [--root <dir>] [--config <file>]
+```
+
+- `--base <ref>` — required. Any committish (`origin/main`, `HEAD~5`, sha, tag).
+- `--root` / `--config` / `--help` / `-h` — same shape as the rest of the CLI (handled by `bootstrap`).
+- Exit codes: **0** on `pass`, **1** on `warn`, **2** on `fail`. (CI-friendly; mirrors `git diff --exit-code`.)
+
+## 8. Tracer-bullet sequence
+
+Per [`.agents/rules/tracer-bullets`](../../.agents/rules/tracer-bullets.md), commit each slice end-to-end:
+
+1. **CLI scaffold** — `codemap audit --help` works; `--base <ref>` parsed; `runAuditCmd` calls a stub that returns `{verdict: "pass", deltas: {}}`. Smoke + commit.
+2. **Worktree + base index** — Option A spawn-and-index implementation; assert two `.codemap.db` files exist. Commit.
+3. **First delta — `files`** — minimal end-to-end vertical slice: open both DBs, set-diff `path`, emit `{files: {added, removed}}`. Smoke + commit.
+4. **Remaining deltas** — `dependencies`, `deprecated`, `visibility`, `barrels`, `hot_files` — each as a separate commit so individual tests can be reviewed.
+5. **Threshold config** — Zod schema additions + verdict reduction; default `pass` until user opts in. Commit.
+6. **Docs + agents update** — `architecture.md § Audit wiring`, glossary entry, README CLI block, rule + skill across `.agents/` and `templates/agents/` (Rule 10). Commit.
+7. **Changeset** — patch (no schema bump). Commit.
+
+Estimated total: 1–2 days end-to-end across ~7 commits.
+
+## 9. Open questions
+
+- **Should the temp worktree live under `.codemap/audit-<sha>/` (project-local) or `/tmp/codemap-audit-<sha>` (system temp)?** Project-local is gitignorable via the existing `.codemap.*` glob (works only if the dir is named `.codemap.audit-<sha>`); system temp is auto-cleaned but loses the cache benefit across reboots. **Lean: project-local, naming `.codemap.audit-<sha>` so the existing gitignore covers it.**
+- **Should `audit` warn when `<base>` and `HEAD` are identical?** Almost certainly user error (probably wanted `--base origin/main` not `--base HEAD`). Surface a warning, exit 0 with empty deltas.
+- **Should the verdict include `actions` per delta key?** Recipe `actions` (Tier A.1) attach to row sets; an audit delta is a higher-level concept. v1 punts; v1.x can add `audit.actions: { dependencies: "review-coupling-spike" }` if patterns emerge.
+- **Cross-snapshot performance ceiling.** At what project size does Option A become unacceptable (>30s)? Need a benchmark fixture; defer until a real consumer hits the wall.
+
+## 10. References
+
+- Motivation: [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) ("single highest-leverage candidate").
+- Snapshot primitive prior art: PR #30 — `query_baselines` table + `--save-baseline` / `--baseline`.
+- Composition: PR #26 — Tier A flags (`--summary` / `--changed-since` / `--group-by` / per-row `actions`).
+- Visibility column prior art: PR #28 — `symbols.visibility` (B.7).
+- CLI conventions: [`docs/architecture.md` § CLI usage](../architecture.md#cli-usage).
+- Doc lifecycle: this file follows the **Plan** type per [`docs/README.md` § Document Lifecycle](../README.md#document-lifecycle) — **delete on ship**, lift the canonical bits into `architecture.md` per Rule 2.
diff --git a/docs/roadmap.md b/docs/roadmap.md
index a8819df3..91c3c61a 100644
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -36,6 +36,7 @@ Codemap stays a structural-index primitive that other tools can consume. Out of
 
 ## Backlog
 
+- [ ] **`codemap audit --base <ref>`** — two-snapshot structural-drift verdict for a PR / branch (new files / deps / `@deprecated` / visibility / barrel / hot-file deltas; `pass`/`warn`/`fail` exit codes). Plan: [`plans/codemap-audit.md`](./plans/codemap-audit.md). Builds on B.6 (snapshot primitive), B.7 (`visibility`), Tier A flags (composition).
 - [ ] **MCP** server wrapping `query` — single stdio tool first (`query` SQL string → JSON rows), then expand to `recipe`, `list_recipes`, `schema`, `index`. Resources expose the bundled `SKILL.md` and recipe catalog
 - [ ] **HTTP API** — `codemap serve [--port] [--host 127.0.0.1]` exposing `POST /query`, `GET /recipes`, `GET /recipes/:id`, `GET /schema`, `GET /context`. Bind to loopback by default; reject non-loopback unless `--host` overridden. Unblocks tools that don't speak MCP yet
 - [ ] **Recipes-as-content registry** — pair every bundled recipe in `src/cli/query-recipes.ts` with a sibling `.md` (or YAML frontmatter) describing _when to use, follow-up SQL_; surface in `--recipes-json`. Plus **project-local recipes** loaded from `.codemap/recipes/*.{sql,md}` so teams can ship internal SQL without an adapter API

From 906ecba13a8ae4c636643d7a9446b30bd2b547b0 Mon Sep 17 00:00:00 2001
From: Sutu Sebastian <sebiitv@gmail.com>
Date: Fri, 1 May 2026 11:03:44 +0300
Subject: [PATCH 2/4] docs(agents): adopt grill-me +
 improve-codebase-architecture skills (mattpocock)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two Tier 3 maintainer-only skills sourced from mattpocock/skills:

grill-me — pure interview pattern. Walk a design tree branch by branch,
recommend an answer per question, ask one at a time. 8-line skill, zero
cost when not invoked. Filled a gap visible in the codemap-audit plan
(this PR's first commit): I made many decisions by myself; grill-me would
have surfaced them for second opinion before they crystallised in the doc.

improve-codebase-architecture — Ousterhout-style deepening vocabulary
(module / interface / seam / adapter / depth / leverage / locality), the
deletion test, "one adapter = hypothetical seam, two = real," dependency
categories (DEEPENING.md), and parallel-sub-agent "Design It Twice"
interface exploration (INTERFACE-DESIGN.md). Translated CONTEXT.md /
docs/adr/ references → docs/glossary.md / docs/plans/ to fit codemap's
existing docs framework (per Rule 9 + Rule 3); ADR-offer flow dropped
since codemap lifts decisions from plans into architecture.md per Rule 2.
Companion files (LANGUAGE.md, DEEPENING.md, INTERFACE-DESIGN.md) are
verbatim — they don't reference CONTEXT.md / ADRs.

Both adopted as maintainer-only (under .agents/skills/ + .cursor/skills/
symlinks per agents-first-convention). Not added to templates/agents/
since that surface ships only the codemap rule + skill — same precedent
as PR #25 for audit-pr-architecture / docs-governance / etc.

agents-tier-system Tier 3 list updated with both skills + the existing
docs-governance and docs-lifecycle-sweep entries that were missing.

Composes with grill-me from improve-codebase-architecture's grilling-loop
step (deepening candidates get grilled, not auto-accepted). Skipped
grill-with-docs (the third skill in the upstream "grill" family) — it
requires standing up CONTEXT.md / docs/adr/ infrastructure that conflicts
with codemap's lift-to-architecture-then-delete-the-plan lifecycle.
---
 .agents/rules/agents-tier-system.md           |  2 +-
 .agents/skills/grill-me/SKILL.md              | 12 +++
 .../DEEPENING.md                              | 37 +++++++++
 .../INTERFACE-DESIGN.md                       | 44 ++++++++++
 .../improve-codebase-architecture/LANGUAGE.md | 53 ++++++++++++
 .../improve-codebase-architecture/SKILL.md    | 80 +++++++++++++++++++
 .cursor/skills/grill-me                       |  1 +
 .cursor/skills/improve-codebase-architecture  |  1 +
 8 files changed, 229 insertions(+), 1 deletion(-)
 create mode 100644 .agents/skills/grill-me/SKILL.md
 create mode 100644 .agents/skills/improve-codebase-architecture/DEEPENING.md
 create mode 100644 .agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md
 create mode 100644 .agents/skills/improve-codebase-architecture/LANGUAGE.md
 create mode 100644 .agents/skills/improve-codebase-architecture/SKILL.md
 create mode 120000 .cursor/skills/grill-me
 create mode 120000 .cursor/skills/improve-codebase-architecture

diff --git a/.agents/rules/agents-tier-system.md b/.agents/rules/agents-tier-system.md
index 1053c21d..8a7ac913 100644
--- a/.agents/rules/agents-tier-system.md
+++ b/.agents/rules/agents-tier-system.md
@@ -50,7 +50,7 @@ Today's Tier-2 rules:
 
 Pure intent-triggered. The skill description is detailed enough that Cursor surfaces it on relevant phrases. No always-on cost.
 
-Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered (e.g. `audit-pr-architecture`, `docs-lifecycle-sweep` in this repo; `improve-codebase-architecture`, `gritql-codemods`, `ubiquitous-language` in larger codebases).
+Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
 
 ## Authoring guidelines
 
diff --git a/.agents/skills/grill-me/SKILL.md b/.agents/skills/grill-me/SKILL.md
new file mode 100644
index 00000000..3345f3cc
--- /dev/null
+++ b/.agents/skills/grill-me/SKILL.md
@@ -0,0 +1,12 @@
+---
+name: grill-me
+description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
+---
+
+Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
+
+Ask the questions one at a time, waiting for feedback before continuing.
+
+If a question can be answered by exploring the codebase, explore the codebase instead. In this repo, that means querying [`codemap`](../codemap/SKILL.md) (the structural index) before reaching for `Grep` or `Read` — see the [`codemap` rule](../../rules/codemap.md).
+
+When agreement crystallises on a question that affects an in-flight `docs/plans/<name>.md`, write the answer into the plan inline as you go — don't batch them up. The plan doc is the durable record; the chat transcript is not.
diff --git a/.agents/skills/improve-codebase-architecture/DEEPENING.md b/.agents/skills/improve-codebase-architecture/DEEPENING.md
new file mode 100644
index 00000000..c52fdfd9
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/DEEPENING.md
@@ -0,0 +1,37 @@
+# Deepening
+
+How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**.
+
+## Dependency categories
+
+When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam.
+
+### 1. In-process
+
+Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed.
+
+### 2. Local-substitutable
+
+Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface.
+
+### 3. Remote but owned (Ports & Adapters)
+
+Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter.
+
+Recommendation shape: _"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."_
+
+### 4. True external (Mock)
+
+Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter.
+
+## Seam discipline
+
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection.
+- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them.
+
+## Testing strategy: replace, don't layer
+
+- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them.
+- Write new tests at the deepened module's interface. The **interface is the test surface**.
+- Tests assert on observable outcomes through the interface, not internal state.
+- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface.
diff --git a/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md b/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md
new file mode 100644
index 00000000..7d69c405
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/INTERFACE-DESIGN.md
@@ -0,0 +1,44 @@
+# Interface Design
+
+When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best.
+
+Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md) — **module**, **interface**, **seam**, **adapter**, **leverage**.
+
+## Process
+
+### 1. Frame the problem space
+
+Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate:
+
+- The constraints any new interface would need to satisfy
+- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md))
+- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete
+
+Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel.
+
+### 2. Spawn sub-agents
+
+Spawn 3+ sub-agents in parallel using the Agent / Task tool. Each must produce a **radically different** interface for the deepened module.
+
+Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint:
+
+- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point."
+- Agent 2: "Maximise flexibility — support many use cases and extension."
+- Agent 3: "Optimise for the most common caller — make the default case trivial."
+- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies."
+
+Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and [`docs/glossary.md`](../../../docs/glossary.md) vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language.
+
+Each sub-agent outputs:
+
+1. Interface (types, methods, params — plus invariants, ordering, error modes)
+2. Usage example showing how callers use it
+3. What the implementation hides behind the seam
+4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md))
+5. Trade-offs — where leverage is high, where it's thin
+
+### 3. Present and compare
+
+Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**.
+
+After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu.
diff --git a/.agents/skills/improve-codebase-architecture/LANGUAGE.md b/.agents/skills/improve-codebase-architecture/LANGUAGE.md
new file mode 100644
index 00000000..dd9b60fe
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/LANGUAGE.md
@@ -0,0 +1,53 @@
+# Language
+
+Shared vocabulary for every suggestion this skill makes. Use these terms exactly — don't substitute "component," "service," "API," or "boundary." Consistent language is the whole point.
+
+## Terms
+
+**Module**
+Anything with an interface and an implementation. Deliberately scale-agnostic — applies equally to a function, class, package, or tier-spanning slice.
+_Avoid_: unit, component, service.
+
+**Interface**
+Everything a caller must know to use the module correctly. Includes the type signature, but also invariants, ordering constraints, error modes, required configuration, and performance characteristics.
+_Avoid_: API, signature (too narrow — those refer only to the type-level surface).
+
+**Implementation**
+What's inside a module — its body of code. Distinct from **Adapter**: a thing can be a small adapter with a large implementation (a Postgres repo) or a large adapter with a small implementation (an in-memory fake). Reach for "adapter" when the seam is the topic; "implementation" otherwise.
+
+**Depth**
+Leverage at the interface — the amount of behaviour a caller (or test) can exercise per unit of interface they have to learn. A module is **deep** when a large amount of behaviour sits behind a small interface. A module is **shallow** when the interface is nearly as complex as the implementation.
+
+**Seam** _(from Michael Feathers)_
+A place where you can alter behaviour without editing in that place. The _location_ at which a module's interface lives. Choosing where to put the seam is its own design decision, distinct from what goes behind it.
+_Avoid_: boundary (overloaded with DDD's bounded context).
+
+**Adapter**
+A concrete thing that satisfies an interface at a seam. Describes _role_ (what slot it fills), not substance (what's inside).
+
+**Leverage**
+What callers get from depth. More capability per unit of interface they have to learn. One implementation pays back across N call sites and M tests.
+
+**Locality**
+What maintainers get from depth. Change, bugs, knowledge, and verification concentrate at one place rather than spreading across callers. Fix once, fixed everywhere.
+
+## Principles
+
+- **Depth is a property of the interface, not the implementation.** A deep module can be internally composed of small, mockable, swappable parts — they just aren't part of the interface. A module can have **internal seams** (private to its implementation, used by its own tests) as well as the **external seam** at its interface.
+- **The deletion test.** Imagine deleting the module. If complexity vanishes, the module wasn't hiding anything (it was a pass-through). If complexity reappears across N callers, the module was earning its keep.
+- **The interface is the test surface.** Callers and tests cross the same seam. If you want to test _past_ the interface, the module is probably the wrong shape.
+- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a seam unless something actually varies across it.
+
+## Relationships
+
+- A **Module** has exactly one **Interface** (the surface it presents to callers and tests).
+- **Depth** is a property of a **Module**, measured against its **Interface**.
+- A **Seam** is where a **Module**'s **Interface** lives.
+- An **Adapter** sits at a **Seam** and satisfies the **Interface**.
+- **Depth** produces **Leverage** for callers and **Locality** for maintainers.
+
+## Rejected framings
+
+- **Depth as ratio of implementation-lines to interface-lines** (Ousterhout): rewards padding the implementation. We use depth-as-leverage instead.
+- **"Interface" as the TypeScript `interface` keyword or a class's public methods**: too narrow — interface here includes every fact a caller must know.
+- **"Boundary"**: overloaded with DDD's bounded context. Say **seam** or **interface**.
diff --git a/.agents/skills/improve-codebase-architecture/SKILL.md b/.agents/skills/improve-codebase-architecture/SKILL.md
new file mode 100644
index 00000000..91d53c25
--- /dev/null
+++ b/.agents/skills/improve-codebase-architecture/SKILL.md
@@ -0,0 +1,80 @@
+---
+name: improve-codebase-architecture
+description: Find deepening opportunities in the codebase, informed by the domain language in docs/glossary.md and the architecture in docs/architecture.md. Use when the user wants to improve architecture, find refactoring opportunities, consolidate tightly-coupled modules, or make a codebase more testable and AI-navigable.
+---
+
+# Improve Codebase Architecture
+
+Surface architectural friction and propose **deepening opportunities** — refactors that turn shallow modules into deep ones. The aim is testability and AI-navigability.
+
+## Glossary
+
+Use these terms exactly in every suggestion. Consistent language is the point — don't drift into "component," "service," "API," or "boundary." Full definitions in [LANGUAGE.md](LANGUAGE.md).
+
+- **Module** — anything with an interface and an implementation (function, class, package, slice).
+- **Interface** — everything a caller must know to use the module: types, invariants, error modes, ordering, config. Not just the type signature.
+- **Implementation** — the code inside.
+- **Depth** — leverage at the interface: a lot of behaviour behind a small interface. **Deep** = high leverage. **Shallow** = interface nearly as complex as the implementation.
+- **Seam** — where an interface lives; a place behaviour can be altered without editing in place. (Use this, not "boundary.")
+- **Adapter** — a concrete thing satisfying an interface at a seam.
+- **Leverage** — what callers get from depth.
+- **Locality** — what maintainers get from depth: change, bugs, knowledge concentrated in one place.
+
+Key principles (see [LANGUAGE.md](LANGUAGE.md) for the full list):
+
+- **Deletion test**: imagine deleting the module. If complexity vanishes, it was a pass-through. If complexity reappears across N callers, it was earning its keep.
+- **The interface is the test surface.**
+- **One adapter = hypothetical seam. Two adapters = real seam.**
+
+This skill is _informed_ by the project's domain model. The domain language in [`docs/glossary.md`](../../../docs/glossary.md) gives names to good seams; the layering described in [`docs/architecture.md`](../../../docs/architecture.md) records the structural decisions the skill should not re-litigate.
+
+## Process
+
+### 1. Explore
+
+Read [`docs/glossary.md`](../../../docs/glossary.md) (canonical domain terms) and the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) (canonical layering / wiring) first.
+
+Then walk the codebase via [`codemap`](../codemap/SKILL.md) — the structural SQLite index. Per the [`codemap` rule](../../rules/codemap.md), querying the index beats grepping for symbol-shaped questions:
+
+```bash
+codemap query --json "SELECT name, signature, file_path FROM symbols WHERE file_path LIKE 'src/cli/%' AND kind = 'function'"
+codemap query --json "SELECT from_path, COUNT(*) AS deps FROM dependencies GROUP BY from_path ORDER BY deps DESC LIMIT 10"
+codemap query --json -r barrel-files
+```
+
+Don't follow rigid heuristics — explore organically and note where you experience friction:
+
+- Where does understanding one concept require bouncing between many small modules?
+- Where are modules **shallow** — interface nearly as complex as the implementation?
+- Where have pure functions been extracted just for testability, but the real bugs hide in how they're called (no **locality**)?
+- Where do tightly-coupled modules leak across their seams?
+- Which parts of the codebase are untested, or hard to test through their current interface?
+
+Apply the **deletion test** to anything you suspect is shallow: would deleting it concentrate complexity, or just move it? A "yes, concentrates" is the signal you want.
+
+### 2. Present candidates
+
+Present a numbered list of deepening opportunities. For each candidate:
+
+- **Files** — which files/modules are involved
+- **Problem** — why the current architecture is causing friction
+- **Solution** — plain English description of what would change
+- **Benefits** — explained in terms of locality and leverage, and also in how tests would improve
+
+**Use [`docs/glossary.md`](../../../docs/glossary.md) vocabulary for the domain, and [LANGUAGE.md](LANGUAGE.md) vocabulary for the architecture.** If the glossary defines `barrel file`, talk about "the barrel-file detection module" — not "the FooBarHandler," and not "the barrel service."
+
+**Architecture conflicts**: if a candidate contradicts [`docs/architecture.md` § Layering](../../../docs/architecture.md#layering), only surface it when the friction is real enough to warrant revisiting that layering. Mark it clearly (e.g. _"contradicts architecture.md § Layering — but worth reopening because…"_). Don't list every theoretical refactor the layering forbids.
+
+Do NOT propose interfaces yet. Ask the user: "Which of these would you like to explore?"
+
+### 3. Grilling loop
+
+Once the user picks a candidate, drop into a grilling conversation (per [`grill-me`](../grill-me/SKILL.md)). Walk the design tree with them — constraints, dependencies, the shape of the deepened module, what sits behind the seam, what tests survive.
+
+Side effects happen inline as decisions crystallize:
+
+- **Naming a deepened module after a concept not in `docs/glossary.md`?** Add the term to the glossary right there per [`docs/README.md` Rule 9](../../../docs/README.md). Disambiguations (TS shape vs SQL table, etc.) take priority.
+- **Sharpening a fuzzy term during the conversation?** Update `docs/glossary.md` right there.
+- **Surfacing a structural decision worth recording?** If the candidate becomes a planned refactor, draft `docs/plans/<topic>.md` per [`docs/README.md` Rule 3](../../../docs/README.md). Codemap doesn't ship ADRs — decisions of record lift into [`docs/architecture.md`](../../../docs/architecture.md) on ship per [`docs/README.md` Rule 2](../../../docs/README.md), and the plan file is deleted.
+- **Want to explore alternative interfaces for the deepened module?** See [INTERFACE-DESIGN.md](INTERFACE-DESIGN.md).
+- **Sub-rules for what counts as a "deepening" candidate**: see [DEEPENING.md](DEEPENING.md).
diff --git a/.cursor/skills/grill-me b/.cursor/skills/grill-me
new file mode 120000
index 00000000..eea91a86
--- /dev/null
+++ b/.cursor/skills/grill-me
@@ -0,0 +1 @@
+../../.agents/skills/grill-me
\ No newline at end of file
diff --git a/.cursor/skills/improve-codebase-architecture b/.cursor/skills/improve-codebase-architecture
new file mode 120000
index 00000000..be3dac9e
--- /dev/null
+++ b/.cursor/skills/improve-codebase-architecture
@@ -0,0 +1 @@
+../../.agents/skills/improve-codebase-architecture
\ No newline at end of file

From 1037469407d0e894c95b689177e9f147ca8fe504 Mon Sep 17 00:00:00 2001
From: Sutu Sebastian <sebiitv@gmail.com>
Date: Fri, 1 May 2026 11:14:50 +0300
Subject: [PATCH 3/4] docs(agents): adopt diagnose + write-a-skill skills
 (mattpocock)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two more Tier 3 maintainer-only skills sourced from mattpocock/skills:

diagnose — disciplined 6-phase loop for hard bugs and perf regressions
(reproduce → minimise → hypothesise → instrument → fix → cleanup). Core
thesis: "build the right feedback loop, and the bug is 90% fixed."
Translation: explore-the-codebase step now points at codemap (per the
codemap rule's STOP-before-grep) and docs/glossary.md (per Rule 9
canonical terms); ADR mention dropped (no ADR infra in this repo);
Phase-5 seam discipline cross-references improve-codebase-architecture
(adopted in 906ecba); Phase-6 cleanup includes a one-line lessons.md
append per the lessons-rule discipline. scripts/hitl-loop.template.sh
ships verbatim — no codemap-specific assumptions.

write-a-skill — meta-skill for creating new skills. Translation: front
section explicitly cites our agents-first-convention (file layout) and
agents-tier-system (tier choice + durability), plus the
maintainer-only-vs-shipped distinction (precedent: PR #25). Examples
cite codemap precedents (improve-codebase-architecture for the
companion-files split; pr-comment-fact-check for single-file). Review
checklist adapted: tier choice + rule-pairing decision + tier-list
update added.

Both adopted as maintainer-only (.agents/skills/ + .cursor/skills/
symlinks per agents-first-convention). Not added to templates/agents/
— consumer surface stays codemap-skill-only.

agents-tier-system Tier 3 list updated: diagnose, write-a-skill added
(alongside grill-me + improve-codebase-architecture from the prior
commit). Skipped grill-with-docs (requires standing up CONTEXT.md /
docs/adr/ infra; conflicts with codemap's lift-to-architecture-and-
delete-the-plan lifecycle).
---
 .agents/rules/agents-tier-system.md           |   2 +-
 .agents/skills/diagnose/SKILL.md              | 116 ++++++++++++
 .../diagnose/scripts/hitl-loop.template.sh    |  41 ++++
 .agents/skills/write-a-skill/SKILL.md         | 176 ++++++++++++++++++
 .cursor/skills/diagnose                       |   1 +
 .cursor/skills/write-a-skill                  |   1 +
 6 files changed, 336 insertions(+), 1 deletion(-)
 create mode 100644 .agents/skills/diagnose/SKILL.md
 create mode 100755 .agents/skills/diagnose/scripts/hitl-loop.template.sh
 create mode 100644 .agents/skills/write-a-skill/SKILL.md
 create mode 120000 .cursor/skills/diagnose
 create mode 120000 .cursor/skills/write-a-skill

diff --git a/.agents/rules/agents-tier-system.md b/.agents/rules/agents-tier-system.md
index 8a7ac913..d2910c61 100644
--- a/.agents/rules/agents-tier-system.md
+++ b/.agents/rules/agents-tier-system.md
@@ -50,7 +50,7 @@ Today's Tier-2 rules:
 
 Pure intent-triggered. The skill description is detailed enough that Cursor surfaces it on relevant phrases. No always-on cost.
 
-Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
+Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `diagnose`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`, `write-a-skill`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
 
 ## Authoring guidelines
 
diff --git a/.agents/skills/diagnose/SKILL.md b/.agents/skills/diagnose/SKILL.md
new file mode 100644
index 00000000..50278d51
--- /dev/null
+++ b/.agents/skills/diagnose/SKILL.md
@@ -0,0 +1,116 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, query [`codemap`](../codemap/SKILL.md) (the structural SQLite index) before reaching for `Grep` or `Read` per the [`codemap` rule](../../rules/codemap.md) — symbol-shaped questions ("where is X defined?", "what calls X?") have direct answers in the `symbols` / `calls` tables. Read the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) to ground the mental model of layering, and check [`docs/glossary.md`](../../../docs/glossary.md) for canonical domain terms (file types, recipe ids, schema columns).
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. Codemap convention: `src/**/<name>.test.ts` for unit + integration; `fixtures/golden/` for query-shape regressions; `bun test <file>` runs them.
+2. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. Examples: `bun src/index.ts query --json …` against `fixtures/minimal/`, golden runner under `scripts/query-golden.ts`.
+3. **Replay a captured trace.** Save a real `.codemap.db` / config / fixture file to disk; replay it through the code path in isolation.
+4. **Throwaway harness.** Spin up a minimal subset (one parser, one DB connection) that exercises the bug code path with a single function call.
+5. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+6. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+7. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. The B.6 baseline machinery (`codemap query --save-baseline` / `--baseline`) is built for exactly this — use it.
+8. **HITL bash script.** Last resort. If a human must click or copy a value out of the IDE, drive _them_ with [`scripts/hitl-loop.template.sh`](scripts/hitl-loop.template.sh) so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps, broken `.codemap.db`), or (c) permission to add temporary instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If `<X>` is the cause, then `<Y>` will make the bug disappear / `<Z>` will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just changed #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan, `--performance` flag for index runs), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it (per the [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) vocabulary).
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-…]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+- [ ] If the post-mortem yields a permanent insight, append a one-line entry to [`.agents/lessons.md`](../../lessons.md) per the lessons-rule discipline
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
diff --git a/.agents/skills/diagnose/scripts/hitl-loop.template.sh b/.agents/skills/diagnose/scripts/hitl-loop.template.sh
new file mode 100755
index 00000000..b67c86bf
--- /dev/null
+++ b/.agents/skills/diagnose/scripts/hitl-loop.template.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step    "<instruction>"         → show instruction, wait for Enter
+#   capture VAR "<question>"        → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
diff --git a/.agents/skills/write-a-skill/SKILL.md b/.agents/skills/write-a-skill/SKILL.md
new file mode 100644
index 00000000..a0f9611d
--- /dev/null
+++ b/.agents/skills/write-a-skill/SKILL.md
@@ -0,0 +1,176 @@
+---
+name: write-a-skill
+description: Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill (or asks "how do I write a skill?", "draft a SKILL.md for X").
+---
+
+# Writing Skills
+
+Discipline for authoring `.agents/skills/<name>/SKILL.md` files in this repo.
+
+## Repo conventions you must respect
+
+Before drafting any skill in codemap, internalise these (they trump anything in this skill):
+
+- **File layout** — [`agents-first-convention`](../../rules/agents-first-convention.md): the source-of-truth file is `.agents/skills/<name>/SKILL.md`; the `.cursor/skills/<name>` entry is a **symlink** back. Never put original content under `.cursor/`.
+- **Tier choice** — [`agents-tier-system`](../../rules/agents-tier-system.md): every new skill is Tier 1 (always-on, paired with a rule), Tier 2 (auto-attached to a glob, paired with a rule), or Tier 3 (discoverable, no rule). **Skills with `NEVER` / `ALWAYS` clauses deserve a rule pairing.** Pure intent-trigger skills (no hard "must" clauses) stay Tier 3.
+- **Maintainer-only vs shipped** — `.agents/skills/` is the dev-side mirror; `templates/agents/skills/` is what `codemap agents init` ships to npm consumers. The bundled template surface today is **only** the `codemap` skill — every other skill in `.agents/skills/` is maintainer-only (precedent: PR #25). Don't add a skill to `templates/agents/` unless it's something every consumer of the published package would want.
+
+## Process
+
+### 1. Gather requirements
+
+Ask the user:
+
+- What task / domain does the skill cover?
+- What specific use cases should it handle?
+- Does it need executable scripts (under `scripts/`) or just instructions?
+- Any reference materials to include?
+- **Tier choice**: does the skill have always-on principles (any `NEVER` / `ALWAYS` clauses)? If yes, it deserves a Tier-1 or Tier-2 rule pairing per [`agents-tier-system`](../../rules/agents-tier-system.md).
+
+### 2. Draft the skill
+
+Create:
+
+- `SKILL.md` with concise instructions (under 100 lines if possible — see "When to split" below)
+- Companion files (`LANGUAGE.md`, `REFERENCE.md`, `EXAMPLES.md`, etc.) when content exceeds 100 lines or has distinct domains
+- `scripts/<name>.{sh,ts}` when a deterministic operation is invoked repeatedly (saves tokens vs generated code)
+
+Use [`grill-me`](../grill-me/SKILL.md) on yourself to surface decisions before you write — what's the trigger phrase shape? What's the boundary with adjacent skills? What's the durability test (does this skill still read correctly six months from now)?
+
+### 3. Wire the file layout
+
+```bash
+# Source of truth
+.agents/skills/<name>/SKILL.md
+
+# Cursor symlink (per agents-first-convention)
+ln -s ../../.agents/skills/<name> .cursor/skills/<name>
+```
+
+### 4. Update the tier list
+
+Add the skill to the relevant list in [`agents-tier-system.md`](../../rules/agents-tier-system.md) so the inventory stays accurate.
+
+### 5. Review
+
+Ask the user:
+
+- Does this cover your use cases?
+- Anything missing or unclear?
+- Should any section be more / less detailed?
+
+Run the [Review checklist](#review-checklist) before declaring done.
+
+## Skill structure
+
+```text
+.agents/skills/<name>/
+├── SKILL.md              # Main instructions (required)
+├── LANGUAGE.md           # Vocabulary the skill enforces (if any)
+├── REFERENCE.md          # Detailed docs (if SKILL.md exceeds ~100 lines)
+├── EXAMPLES.md           # Usage examples (if needed)
+└── scripts/              # Utility scripts (if needed)
+    └── helper.sh
+```
+
+## SKILL.md template
+
+```md
+---
+name: skill-name
+description: Brief description of capability. Use when [specific triggers — verbs and nouns the user is likely to say, plus contexts where the skill applies].
+---
+
+# Skill Name
+
+## Quick start
+
+[Minimal working example — what the user does on first invocation]
+
+## Workflows
+
+[Step-by-step processes with checklists for complex tasks]
+
+## Advanced features
+
+[Link to companion files: See [REFERENCE.md](REFERENCE.md) / [LANGUAGE.md](LANGUAGE.md)]
+```
+
+## Description requirements
+
+The description is **the only thing the agent sees** when deciding which skill to load. It's surfaced in the discoverable-skills list alongside every other installed skill. Get this right or your skill never fires.
+
+**Goal**: Give the agent just enough info to know:
+
+1. What capability this skill provides
+2. When / why to trigger it (specific keywords, contexts, file types)
+
+**Format**:
+
+- Max ~1024 chars
+- Write in third person
+- First sentence: what it does
+- Second sentence: "Use when [specific triggers]"
+- Include the verbs and nouns the user is likely to say (per [`agents-tier-system` § Tier 3 description](../../rules/agents-tier-system.md))
+
+**Good example**:
+
+```text
+Triage and fact-check PR review comments against the actual codebase, project rules, and skills. Use when the user asks to address PR comments, respond to reviewer feedback, check if a comment is correct, fact-check a reviewer's claim, decide which comments to push back on, or sort hallucinated suggestions from real ones. Triggers on phrases like "check PR comments", "are these comments right".
+```
+
+**Bad example**:
+
+```text
+Helps with PRs.
+```
+
+The bad example gives the agent no way to distinguish this from any other PR-adjacent skill.
+
+## When to add scripts
+
+Add utility scripts under `scripts/` when:
+
+- Operation is deterministic (validation, formatting, bisection harness)
+- Same code would be generated repeatedly across invocations
+- Errors need explicit handling that's tedious to re-derive
+
+Scripts save tokens and improve reliability vs generated code.
+
+## When to split files
+
+Split into companion files when:
+
+- `SKILL.md` exceeds ~100 lines
+- Content has distinct domains (vocabulary vs process vs templates)
+- Advanced features are rarely needed and would balloon the main file
+
+Cite codemap precedents:
+
+- [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) splits into `LANGUAGE.md` (vocab), `DEEPENING.md` (sub-rules), `INTERFACE-DESIGN.md` (parallel-sub-agent pattern).
+- [`pr-comment-fact-check`](../pr-comment-fact-check/SKILL.md) stays single-file because every section is in-flow process.
+
+## Durability discipline
+
+Per [`agents-tier-system` § Authoring discipline: durability](../../rules/agents-tier-system.md):
+
+- **Don't cite specific audit / plan / research filenames as canonical examples.** Plans are mortal under [`docs-lifecycle-sweep`](../docs-lifecycle-sweep/SKILL.md). Use shape placeholders (`<topic>.md`) instead.
+- **Don't cite specific commit hashes or PR numbers as the only path to context.** Summarise inline.
+- **Don't cite source-code line numbers.** Reference symbols by name.
+
+If the skill still reads correctly six months from now after every doc you didn't write got rewritten, it's durable.
+
+## Review checklist
+
+After drafting, verify:
+
+- [ ] Description includes triggers ("Use when…")
+- [ ] `SKILL.md` under 100 lines OR has split companion files
+- [ ] No time-sensitive info (no "as of 2026-04…")
+- [ ] Consistent terminology — drift kills clarity
+- [ ] Concrete examples included
+- [ ] Cross-references one level deep (don't chain `SKILL.md → REFERENCE.md → DEEP-DIVE.md → REFERENCE2.md`)
+- [ ] File layout follows [`agents-first-convention`](../../rules/agents-first-convention.md) (`.agents/` source + `.cursor/` symlink)
+- [ ] Tier choice documented per [`agents-tier-system`](../../rules/agents-tier-system.md); rule pairing added if the skill has `NEVER` / `ALWAYS` clauses
+- [ ] Skill listed in the appropriate tier section of `agents-tier-system.md`
+- [ ] Decision recorded in the PR description: maintainer-only (`.agents/` only) vs shipped (`templates/agents/` too)
diff --git a/.cursor/skills/diagnose b/.cursor/skills/diagnose
new file mode 120000
index 00000000..7d4b7c9e
--- /dev/null
+++ b/.cursor/skills/diagnose
@@ -0,0 +1 @@
+../../.agents/skills/diagnose
\ No newline at end of file
diff --git a/.cursor/skills/write-a-skill b/.cursor/skills/write-a-skill
new file mode 120000
index 00000000..8e09e460
--- /dev/null
+++ b/.cursor/skills/write-a-skill
@@ -0,0 +1 @@
+../../.agents/skills/write-a-skill
\ No newline at end of file

From c2b16c29a3f290a5a8851d35847902b4b6da4b25 Mon Sep 17 00:00:00 2001
From: Sutu Sebastian <sebiitv@gmail.com>
Date: Fri, 1 May 2026 12:39:15 +0300
Subject: [PATCH 4/4] docs(plans): refine codemap-audit plan via grill-me
 dogfood (8 Qs resolved)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Used the just-adopted grill-me skill against the plan to surface
decisions I'd waved away in the first draft. Eight Qs walked, all
agreed answers written into the plan inline (per grill-me discipline).

Q1 Snapshot strategy → ship BOTH modes; v1 = --baseline <name>
   (B.6 reuse), v1.x = --base <ref> (worktree+reindex). Mutex.
   Was: Option A only. Now: both shipped, B first because it's the
   real use case (delta-against-saved-state) and reuses existing
   infra without worktree code.

Q2 Built-in deltas → 3 deltas in v1 (files / dependencies /
   deprecated), not 6. visibility / barrels / hot_files / cycles /
   boundary_crossings / markers / css_* deferred with explicit
   triggers to revisit. Was: padding the v1 surface.

Q3 Verdict shape → DROPPED from v1 entirely. No `verdict` field, no
   exit codes 1/2, no codemap.config.audit schema. Raw deltas only.
   Consumers compose --json + jq for CI exit codes; v1.x ships
   verdict + thresholds when real consumer config patterns emerge.
   Was: opt-in thresholds with always-pass default (placeholder).

Q4 Diff identity → each delta defines its own canonical projection
   (fixed SELECT … ORDER BY) and validates baseline column-set
   membership before diffing. Baseline's stored `sql` is informational.
   Isolates audit from underlying-table schema drift (e.g. the v4→v5
   visibility column bump).

Q5 Terminal output → `git status`-style: terse on no-drift, per-delta
   tables only when non-empty. --summary collapses to one line.
   Mirrors the JSON shape with row-arrays-vs-counts distinction.

Q6 File layout → split src/cli/cmd-audit.ts (CLI) + src/application/
   audit-engine.ts (engine). Mirrors existing cmd-index.ts ↔
   index-engine.ts seam. Engine testable independent of CLI shape;
   v1.x --base <ref> slice becomes mechanical.

Q7 Same-SHA warning → NO. Renderer's metadata header already exposes
   baseline.git_ref so user can spot the case. Adding a heuristic
   warning would be noise + meaningful code for low signal. Settled
   open question struck from §9.

Q8 Index freshness → audit auto-runs `runCodemapIndex({mode:
   "incremental"})` as a prelude. Same discipline as the codemap
   rule's "re-index after editing source." Sub-second when no source
   changed. --no-index opts out for the rare frozen-DB CI case.

Net result: v1 is meaningfully smaller (3 deltas, no verdict, no
schema additions, no worktree code) and ships faster (~½ day across
~6 commits). Each deferred piece has an explicit trigger so the
deferral isn't open-ended hand-waving.
---
 docs/plans/codemap-audit.md | 338 ++++++++++++++++++++++++++----------
 1 file changed, 243 insertions(+), 95 deletions(-)

diff --git a/docs/plans/codemap-audit.md b/docs/plans/codemap-audit.md
index a9354c8c..75bef4bd 100644
--- a/docs/plans/codemap-audit.md
+++ b/docs/plans/codemap-audit.md
@@ -1,6 +1,6 @@
-# Plan — `codemap audit --base <ref>`
+# Plan — `codemap audit`
 
-> Two-snapshot structural-drift verdict for a PR / branch. Adopted from [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) — explicitly the "single highest-leverage candidate" of that scan.
+> Two-snapshot structural-drift verdict for a PR / branch. **v1 ships `--baseline <name>`** (diff against a B.6 saved baseline); **v1.x adds `--base <ref>`** (worktree+reindex). Adopted from [`docs/research/fallow.md` § Tier B B.5](../research/fallow.md) — explicitly the "single highest-leverage candidate" of that scan.
 
 **Status:** Open — design pass; not yet implemented.
 **Cross-refs:** [`docs/research/fallow.md`](../research/fallow.md) (motivation) · [`docs/architecture.md` § CLI usage](../architecture.md#cli-usage) (where wiring lands) · [`.agents/lessons.md`](../../.agents/lessons.md) (changesets bump policy).
@@ -9,26 +9,27 @@
 
 ## 1. Goal
 
-One command returns a structured verdict for what changed between a base ref and `HEAD`:
+One command returns the structural deltas between a saved snapshot (or a git ref) and the current `HEAD` index:
 
 ```text
-codemap audit --base origin/main [--json] [--summary]
+codemap audit --baseline <name>     # diff vs a B.6-style saved baseline (v1)
+codemap audit --base <ref>          # diff vs a worktree+reindex of <ref> (v1.x)
 ↓
 {
-  "verdict": "pass" | "warn" | "fail",
-  "base": { "ref": "origin/main", "sha": "<sha>", "indexed_at": <ms> },
+  "base": { "source": "baseline" | "ref", "name": "...", "sha": "...", "indexed_at": <ms> },
   "head": { "sha": "<sha>", "indexed_at": <ms> },
   "deltas": {
     "files":        { "added": [...], "removed": [...] },
     "dependencies": { "added": [...], "removed": [...] },
-    "deprecated":   { "added": [...], "removed": [...] },
-    "visibility":   { "added": [...], "removed": [...] },
-    "barrels":      { "movements": [...] },
-    "hot_files":    { "movements": [...] }
+    "deprecated":   { "added": [...], "removed": [...] }
   }
 }
 ```
 
+**v1 ships raw deltas only** — no `verdict` field, exit 0 on success regardless of delta size. A native verdict (`pass | warn | fail` with `codemap.config.audit` thresholds) is a v1.x slice; until then, consumers compose `--json` + `jq` for CI exit codes (one-liner). Rationale in [§5 Verdict shape](#5-verdict-shape).
+
+**v1 auto-runs an incremental index before every audit** so `head` reflects the current source tree. `--no-index` opts out (audit a frozen DB). Rationale in [§7 CLI surface](#7-cli-surface).
+
 Wraps existing recipes; doesn't grow a new analysis layer. Stays consistent with codemap's structural-index thesis ([`docs/why-codemap.md` § What Codemap is not](../why-codemap.md#what-codemap-is-not)).
 
 ## 2. Non-goals (v1)
@@ -39,150 +40,297 @@ Wraps existing recipes; doesn't grow a new analysis layer. Stays consistent with
 - **Cross-repo audit** (audit `origin/main` of project A from a checkout of project B). Out of scope; reuse `--root` for the simpler "audit a different tree" case.
 - **Continuous mode.** One-shot CLI, same as `codemap query`.
 
-## 3. Snapshot strategy
+## 3. Snapshot strategy — two modes, ship Option B first
 
-The verdict is a diff between two indexed snapshots. Three credible architectures:
+The verdict is a diff between two indexed snapshots. There are two valid sources for the "before" snapshot, and they solve subtly different problems — **so codemap audit ships both modes** (mutex, pick one per invocation).
 
-### Option A: Temp DB on the base ref (worktree-style)
+| Mode                     | Best at                                                                                                                                                                                                                 | CLI                               |
+| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------- |
+| **B — baseline reuse**   | "What's drifted vs a snapshot I deliberately took **then**" — fast, no cold reindex, reproducible because the snapshot is frozen in `.codemap.db`                                                                       | `codemap audit --baseline <name>` |
+| **A — worktree+reindex** | "What's drifted vs an arbitrary ref I name **now**" — no pre-baseline needed, but spawns a worktree + full reindex per audit, and is sensitive to clone staleness (`origin/main` may be hours behind the actual remote) | `codemap audit --base <ref>`      |
 
-```text
-1. git worktree add /tmp/codemap-audit-<sha> <base-ref>
-2. codemap --root /tmp/codemap-audit-<sha> --full   # builds .codemap.db there
-3. Open both DBs, run delta queries cross-DB, emit verdict.
-4. git worktree remove /tmp/codemap-audit-<sha>
-```
+### Decision: ship **Option B first** (v1), Option A in v1.x
 
-**Pros:** Same code path as a normal index run on the base; no special "snapshot" abstraction; deltas are pure SQL across two attached DBs; reproducible regardless of how `HEAD` evolves.
+Reasons:
 
-**Cons:** Spawns a worktree + full reindex per audit (cold cost ~seconds for codemap-sized projects, more for large monorepos). Disk churn under `/tmp`.
+1. **Cheaper to ship.** Option B reuses the B.6 `query_baselines` table verbatim — no worktree code, no cold-reindex perf concern, no `git fetch` staleness handling.
+2. **Most acute pain is delta-against-saved-state.** Real workflow: `codemap query --save-baseline -r <recipe>` on `main` → branch → refactor → `codemap audit --baseline <recipe>`. This is what B.6 was built for; audit just collapses recipe-by-recipe baselines into one verdict.
+3. **`--base <ref>` is genuinely a different shape.** It needs a fetch-or-fail prelude, a worktree spawn, a temp `.codemap.db` build, and cleanup. Each adds CLI surface and bug surface; deferring lets us validate the verdict / threshold / delta shape under B before committing to the worktree path.
+4. **Cache benefit of Option A only matters at scale.** Codemap-sized projects index in sub-second; the cache benefit of `<sha> → /tmp/codemap-audit-<sha>/.codemap.db` only pays back on multi-thousand-file repos. Defer until a real consumer hits it.
 
-### Option B: In-memory base via the existing `query_baselines` table (B.6 reuse)
+### Option C: dropped
 
-```text
-1. On main, periodically: for each "tracked" recipe, codemap query --save-baseline -r <id>.
-2. On a PR branch: codemap audit --base <name> diffs the live query results against the saved snapshots.
-```
+Earlier draft included a third "on-demand snapshot table" hybrid. Killed during planning: it's a mini-indexer that doesn't transfer to other use cases and adds the code-volume of Option A without its conceptual simplicity. Re-revisit only if both A and B prove insufficient.
 
-**Pros:** Zero new infra — reuses B.6 directly. Snapshots are addressable / nameable. No cold reindex.
+### v1 `--baseline` mechanics
 
-**Cons:** Requires baselines to be saved at the right moment (git-hook or CI step). Doesn't capture deltas the user didn't pre-baseline. Doesn't naturally express "deltas in the dependency graph as a whole" — only as far as recipes go.
+- The baseline must already exist in `query_baselines` (saved by `codemap query --save-baseline`). If not, exit 1 with `codemap: no baseline named "<name>". Use --baselines to list.` (same error shape as `codemap query --baseline`).
+- Audit doesn't introduce its own baseline-save side effect — the user explicitly opts in via `--save-baseline`. Single source of truth for "snapshot lives here" stays the B.6 surface.
+- The verdict's `base.source` is `"baseline"`; `base.name` is the baseline name; `base.sha` is the baseline's recorded `git_ref`; `base.indexed_at` is the baseline's `created_at`.
 
-### Option C: On-demand snapshot table for the audit (hybrid)
+### v1.x `--base <ref>` mechanics (when shipped later)
 
-```text
-1. codemap audit --base <ref> reads <ref> from git, computes audit-shaped queries against the
-   *checked-out* tree at <ref> (using `git show <ref>:<file>` or `git archive` to materialise
-   files in memory / a temp dir), populates a tiny in-DB `audit_snapshot` table with just the
-   columns needed for the deltas (no full reindex).
-2. Diff in SQL; drop the snapshot table.
-```
+- Spawn a worktree under `.codemap.audit-<sha>/` (gitignored by the existing `.codemap.*` glob).
+- `codemap --full --root .codemap.audit-<sha>` builds the temp DB.
+- Diff queries run cross-DB; results pasted into the same verdict shape with `base.source = "ref"`.
+- Cleanup removes the worktree (cache decision deferred — see open questions §9).
+- `--base` and `--baseline` are mutex (one snapshot source per invocation).
 
-**Pros:** No worktree spawn; no extra infra in main code paths; deltas are scoped to what the audit needs.
+## 4. Built-in deltas (v1)
 
-**Cons:** Implementing a "mini-indexer" that runs only the queries we need at <ref> is more code than (A) and the abstraction doesn't transfer.
+Each delta wraps an existing query / recipe. All structural — no new analysis layer. **v1 ships three deltas only**; the rest are deferred (each carries an explicit trigger so we don't re-litigate from scratch).
 
-### Recommendation
+| Delta key      | What it surfaces                                         | Baseline source contract                                                                              |
+| -------------- | -------------------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
+| `files`        | New / deleted indexed files                              | Baseline must come from `SELECT path FROM files` (or `--recipe files-hashes` — same `path` column).   |
+| `dependencies` | New / deleted edges in the file-to-file dependency graph | Baseline must come from `SELECT from_path, to_path FROM dependencies` (no `DISTINCT` — composite PK). |
+| `deprecated`   | New / removed `@deprecated` symbols                      | Baseline must come from `--recipe deprecated-symbols`.                                                |
 
-**Start with Option A** (temp worktree + full index). Reasons:
+### Delta function shape
 
-1. Simplest to implement correctly — no new abstractions; the existing `--full --root /tmp/...` path already works.
-2. Cold cost on codemap (~150 files) is sub-second; on JordanCoin-sized projects (~few thousand files) still under 5s. Acceptable for "run on PR" usage.
-3. Future optimisation: cache `<sha> → /tmp/codemap-audit-<sha>/.codemap.db` so repeated audits on the same base hit the cache.
-4. Doesn't entangle the audit with B.6's user-facing baseline workflow (which has different semantics: user-named, hand-saved).
+Each delta defines its own **canonical projection** (a fixed `SELECT … ORDER BY …`) and runs that projection on both sides of the diff. The baseline's stored `sql` is informational — **not replayed**. This isolates the audit from underlying-table schema drift (e.g. SCHEMA_VERSION 4 → 5 added `symbols.visibility`; baselines saved before the bump must still diff cleanly).
 
-**Reconsider Option B** if Option A's perf becomes a problem AND audits are happening in tight loops (e.g. file-watch trigger).
+Per-delta canonical projection:
 
-## 4. Built-in deltas (v1)
+| Delta          | Canonical SQL (run on both baseline-projection AND current DB)                                              |
+| -------------- | ----------------------------------------------------------------------------------------------------------- |
+| `files`        | `SELECT path FROM files ORDER BY path`                                                                      |
+| `dependencies` | `SELECT from_path, to_path FROM dependencies ORDER BY from_path, to_path`                                   |
+| `deprecated`   | `SELECT name, kind, file_path FROM symbols WHERE doc_comment LIKE '%@deprecated%' ORDER BY file_path, name` |
+
+Each delta function:
+
+1. Loads the named baseline via `getQueryBaseline(db, name)` (B.6 helper from `db.ts`).
+2. Parses `rows_json` to row objects.
+3. **Validates baseline column-set membership.** The delta's canonical projection has a fixed required-columns list (e.g. `dependencies` requires `from_path`, `to_path`). If any required column is missing from the baseline rows, surface a clean error:
 
-Each delta wraps an existing query / recipe. All structural — no new analysis layer.
+   ```
+   codemap audit: baseline "<name>" is missing required columns
+   for delta "<delta-key>": got [<actual>], need [<required>].
+   Re-save with: codemap query --save-baseline=<name> -r <recipe>
+   ```
 
-| Delta key      | What it surfaces                                                                                                                     | Source                                                                                                                     |
-| -------------- | ------------------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------- |
-| `files`        | New / deleted indexed files                                                                                                          | `SELECT path FROM files` (set diff)                                                                                        |
-| `dependencies` | New / deleted edges in the file-to-file dependency graph                                                                             | `SELECT from_path, to_path FROM dependencies` (set diff)                                                                   |
-| `deprecated`   | New / removed `@deprecated` symbols                                                                                                  | `--recipe deprecated-symbols` (set diff)                                                                                   |
-| `visibility`   | New / removed visibility-tagged symbols (`@internal` / `@beta` / `@alpha` / `@private` — `@public` is the surface itself, not noise) | `SELECT name, kind, visibility, file_path FROM symbols WHERE visibility IS NOT NULL AND visibility != 'public'` (set diff) |
-| `barrels`      | Files that crossed an export-count threshold (e.g. <10 → ≥10)                                                                        | `--recipe barrel-files` (compare top-N membership)                                                                         |
-| `hot_files`    | Files that gained / lost rank in the fan-in or fan-out top-15                                                                        | `--recipe fan-in` / `--recipe fan-out` (compare top-N membership)                                                          |
+4. **Projects baseline rows** to the canonical column subset (extra columns are dropped — agents can still inspect the full baseline via `codemap query --baselines`).
+5. Runs the canonical SQL against the current DB.
+6. Set-diffs via the existing `diffRows` helper from `cmd-query.ts` (multiset, identity = canonical `JSON.stringify(row)` over the projected columns).
+7. Returns `{added: [...], removed: [...]}` — projected rows only.
 
-**Out of v1** (reconsider once shipped):
+This means a baseline saved from `--recipe deprecated-symbols` (which returns 6 columns) and a baseline saved from a leaner ad-hoc `SELECT name, kind, file_path FROM symbols WHERE doc_comment LIKE '%@deprecated%'` both work — as long as the required column set is satisfied. Schema bumps that add columns also keep working — the projection drops the new columns. Schema bumps that remove a required column would break the delta — that's the intended behaviour (the delta's contract has changed).
 
-- `cycles` — needs cycle detection on the dependency graph; not a recipe today
-- `boundary_crossings` — needs a project-supplied glob list (similar to the future `audit-pr-architecture` skill kit); no canonical source
-- `markers` — TODO/FIXME drift is noisy and project-specific
-- `css_*` deltas — narrow audience; defer
+### Deferred — add later when needed
+
+| Delta                | Why deferred (v1)                                                                                                                                                         | Trigger to revisit                                                                            |
+| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
+| `visibility`         | Already covered by `codemap query --baseline visibility-tags` from B.6 directly; v1 audit doesn't add much on top.                                                        | A consumer wants visibility deltas in the same JSON envelope as `files` / `dependencies`.     |
+| `barrels`            | "Top-N membership change" has fuzzy threshold semantics ("rank movement" vs "joined / left top-20"). Defer until a clear semantic emerges from real use.                  | Two consumers ask for "this file just became a barrel" as a verdict-shaping signal.           |
+| `hot_files`          | Same fuzzy-threshold problem as `barrels` (fan-in / fan-out top-N movement).                                                                                              | Same.                                                                                         |
+| `cycles`             | Needs cycle detection on `dependencies`; not a recipe today.                                                                                                              | Cycle detection lands as a recipe (or PRAGMA-driven SQL); audit consumes it.                  |
+| `boundary_crossings` | Needs a project-supplied glob list (the [`audit-pr-architecture`](../../.agents/skills/audit-pr-architecture/SKILL.md) skill's § 2 territory); no canonical source today. | The `audit-pr-architecture` skill formalises a per-repo "boundaries" config codemap can read. |
+| `markers`            | TODO / FIXME drift is noisy and project-specific.                                                                                                                         | A consumer asks for it explicitly.                                                            |
+| `css_*` deltas       | Narrow audience.                                                                                                                                                          | Same.                                                                                         |
+
+**Adding a delta later is mechanical** (one delta function + one threshold-config field + one test + one doc note). **Removing one is harder** (consumer config has thresholds for it; removing breaks user setups). Defer-by-default.
 
 ## 5. Verdict shape
 
-`pass | warn | fail` derived from per-delta thresholds. **Defaults exposed but conservative:**
+**v1 ships no `verdict` field.** Exit 0 on success regardless of delta size. The output envelope is `{base, head, deltas}` — adding `verdict` later is purely additive and forward-compatible.
+
+### Why no verdict in v1
+
+1. **Honesty about what we know.** Structural deltas don't have a universally-meaningful threshold ("how many new dependency edges is too many?" depends entirely on the project). Inventing defaults or shipping a placeholder both pretend we do.
+2. **Real consumers shape the config, not me guessing.** When two consumers ship `jq`-based CI scripts with similar threshold shapes, that pattern becomes the v1.x schema. Until then, no schema commitment.
+3. **fallow already covers the code-quality verdict use case.** A consumer who wants `pass/warn/fail` on dead code, dupes, or complexity runs `fallow audit --base origin/main` — that's fallow's product class ([`docs/roadmap.md` § Non-goals](../roadmap.md#non-goals-v1)). Codemap audit's job is the **structural-delta** signal fallow can't see (new dependency edges, new files, new `@deprecated` drift).
+4. **Cheap consumer-side bridge.** `codemap audit --baseline X --json | jq -e '.deltas.dependencies.added | length <= 50'` exits 1 when the threshold trips. CI-driven thresholds work today without codemap shipping the verdict.
 
-| Delta | Default threshold                               |
-| ----- | ----------------------------------------------- |
-| any   | `pass` (thresholds are opt-in via config in v1) |
+### v1.x trigger to revisit
 
-In other words: **v1 emits raw deltas only**. The verdict is always `pass` unless the user opts in via `codemap.config.*`. Reasoning: structural deltas don't have a universally-meaningful threshold ("how many new dependency edges is too many?" depends entirely on the project), and the research note explicitly biases toward "first pass exposes raw deltas only and lets the consumer set thresholds."
+Add the native verdict + threshold config when **either** of:
 
-### Threshold config (v1.x)
+- Two consumers independently ship `jq`-based threshold scripts with similar shapes (the pattern crystallises the config schema).
+- One consumer asks for native thresholds with a concrete config sketch.
 
-Once per-project use surfaces concrete thresholds, fold into `codemap.config.*`:
+### Sketch (informational, not v1 commitment)
+
+When the trigger fires, the shape will likely look like:
 
 ```ts
-// codemap.config.ts
+// codemap.config.ts (v1.x — NOT shipped in v1)
 export default defineConfig({
   audit: {
     deltas: {
       dependencies: { added_max: 50, action: "warn" },
       deprecated: { added_max: 0, action: "fail" }, // any new @deprecated fails
-      visibility: { added_max: 5, action: "warn" },
     },
     // verdict reduction: highest action wins (fail > warn > pass)
   },
 });
 ```
 
-Validated via existing `codemapUserConfigSchema` (Zod) — see [`docs/architecture.md` § User config](../architecture.md#user-config). Schema additions are minor changesets per [`.agents/lessons.md` "changesets bump policy"](../../.agents/lessons.md) (no `.codemap.db` impact).
+Validated via existing `codemapUserConfigSchema` (Zod) — see [`docs/architecture.md` § User config](../architecture.md#user-config). Schema additions are minor changesets per [`.agents/lessons.md` "changesets bump policy"](../../.agents/lessons.md) (no `.codemap.db` impact). Exit codes 0/1/2 ship together with `verdict` — never half-shipped.
 
 ## 6. Composition with existing flags
 
-| Flag                             | Behaviour with `audit`                                                                                             |
-| -------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
-| `--json`                         | Default for the verdict shape; non-JSON falls back to `console.table` per delta + a one-line verdict summary.      |
-| `--summary`                      | Collapses every delta to `{added: N, removed: N}`; verdict + base/head metadata stay. Useful for CI status checks. |
-| `--changed-since`                | **Mutex** — `audit` is itself a "changed-since" operation; combining would be confusing. Parser-level error.       |
-| `--group-by`                     | **Mutex** — verdict shape is already structured; bucketing is the consumer's job on the output JSON.               |
-| `--save-baseline` / `--baseline` | **Mutex** — different snapshot semantics (B.6 is user-named; audit is base-ref-driven).                            |
-| `--recipe`                       | N/A — `audit` isn't a `query` subcommand; it's its own top-level command.                                          |
+| Flag                | Behaviour with `audit`                                                                                                                                                                 |
+| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `--json`            | Emits the `{base, head, deltas}` envelope. See [§7.1 Output shapes](#71-output-shapes) for the terminal-mode (no `--json`) layout.                                                     |
+| `--summary`         | Collapses every delta in the output to counts: with `--json` → `deltas.<key>.{added: N, removed: N}`; without → a single line. See [§7.1](#71-output-shapes).                          |
+| `--baseline <name>` | **Snapshot source** — diff against the named B.6 baseline. v1 default mode.                                                                                                            |
+| `--base <ref>`      | **Snapshot source** — diff against a worktree+reindex of `<ref>`. v1.x. **Mutex with `--baseline`** (one snapshot source per invocation).                                              |
+| `--save-baseline`   | **N/A** — audit doesn't save baselines. Use `codemap query --save-baseline -r <recipe>` first, then `codemap audit --baseline <name>`. Single source of truth for snapshots stays B.6. |
+| `--changed-since`   | **Mutex** — `audit` is itself a "changed-since" operation; combining would be confusing.                                                                                               |
+| `--group-by`        | **Mutex** — output shape is already structured; bucketing is the consumer's job on the output JSON.                                                                                    |
+| `--no-index`        | **Skip the auto-incremental-index prelude.** Default is to re-index first so `head` is fresh; `--no-index` audits the DB as-is.                                                        |
+| `--recipe`          | N/A — `audit` isn't a `query` subcommand. The v1 deltas internally pin canonical SQL (per §4) — not user-selectable.                                                                   |
 
 ## 7. CLI surface
 
 ```text
-codemap audit --base <ref> [--json] [--summary] [--root <dir>] [--config <file>]
+# v1 (ships first):
+codemap audit --baseline <name> [--json] [--summary] [--no-index] [--root <dir>] [--config <file>]
+
+# v1.x (ships after v1 validates the delta shape):
+codemap audit --base <ref>      [--json] [--summary] [--no-index] [--root <dir>] [--config <file>]
 ```
 
-- `--base <ref>` — required. Any committish (`origin/main`, `HEAD~5`, sha, tag).
+- `--baseline <name>` — v1. Required (or `--base <ref>` once shipped). Name must exist in `query_baselines`; saved by `codemap query --save-baseline`.
+- `--base <ref>` — v1.x. Any committish (`origin/main`, `HEAD~5`, sha, tag).
+- **`--baseline` and `--base` are mutex** — exactly one snapshot source per invocation.
+- `--no-index` — skip the auto-incremental-index prelude (see below). Default audits a fresh `head` snapshot.
 - `--root` / `--config` / `--help` / `-h` — same shape as the rest of the CLI (handled by `bootstrap`).
-- Exit codes: **0** on `pass`, **1** on `warn`, **2** on `fail`. (CI-friendly; mirrors `git diff --exit-code`.)
+- **Exit codes (v1):** `0` on success, `1` on bootstrap / DB / baseline-not-found errors. No verdict-driven exit codes until v1.x ships `verdict`.
+
+### Auto-incremental-index prelude
+
+Before computing deltas, `runAuditCmd` calls `runCodemapIndex({ mode: "incremental" })` (the same code path as a bare `codemap` invocation). Reasons:
+
+1. **Same discipline as the codemap rule.** Agents are already told "After completing a step that modified source files, re-index before making any further queries." The audit is a query consumer; auto-indexing treats it the same way.
+2. **Cheap when there's nothing to do.** Incremental indexing is sub-second when no source has changed since last index — git-diff narrows the set to zero.
+3. **Avoids silent staleness.** Without the prelude, an agent that runs `audit` after editing source but before re-indexing would get a `head` snapshot that's older than the changes it just made. The deltas would lie.
+4. **`--no-index` escape hatch** for the rare case of "audit a frozen DB without touching files" (e.g. CI fetches a pre-built `.codemap.db` artifact and just wants the diff).
+
+The prelude reuses `runCodemapIndex` from `application/run-index.ts` — no new code for the indexing step itself, just a single-call wrapper in `cmd-audit.ts`.
+
+### 7.1 Output shapes
+
+Mirrors `git status` — terse on the common (no-drift) case, expressive when there's actual signal. Three output modes from the same data:
+
+**Terminal mode (no `--json`), no drift:**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → no drift across files / dependencies / deprecated.
+```
+
+**Terminal mode (no `--json`), with drift:**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → drift: files +1/-0, dependencies +3/-2, deprecated +1/-0
+
+  files (+1):
+    ┌─────────┬──────────────────────────┐
+    │ (index) │ path                     │
+    ├─────────┼──────────────────────────┤
+    │ 0       │ src/cli/cmd-audit.ts     │
+    └─────────┴──────────────────────────┘
+
+  dependencies (+3 / -2):
+    [console.table here]
+
+  deprecated (+1):
+    [console.table here]
+```
+
+`console.table` blocks are emitted **only for deltas with rows** — empty deltas don't print a `(no results)` placeholder (would be three of them in the no-drift case, all noise).
+
+**`--summary` (no `--json`):**
+
+```text
+audit "pre-refactor" (saved 2 days ago @ abc1234, 152 rows)
+  → drift: files +1/-0, dependencies +3/-2, deprecated +1/-0
+```
+
+Same one-line summary as terminal mode's drift header — no per-delta tables.
+
+**`--summary --json`:**
+
+```json
+{
+  "base": {
+    "source": "baseline",
+    "name": "pre-refactor",
+    "sha": "abc1234",
+    "indexed_at": 1714557600000
+  },
+  "head": { "sha": "def5678", "indexed_at": 1714560000000 },
+  "deltas": {
+    "files": { "added": 1, "removed": 0 },
+    "dependencies": { "added": 3, "removed": 2 },
+    "deprecated": { "added": 1, "removed": 0 }
+  }
+}
+```
+
+Counts replace the row arrays; envelope is otherwise identical to the full `--json` shape.
 
 ## 8. Tracer-bullet sequence
 
-Per [`.agents/rules/tracer-bullets`](../../.agents/rules/tracer-bullets.md), commit each slice end-to-end:
+Per [`.agents/rules/tracer-bullets`](../../.agents/rules/tracer-bullets.md), commit each slice end-to-end. **v1 ships only `--baseline <name>` (Option B).** `--base <ref>` (Option A) ships in a separate v1.x PR.
+
+### File layout
+
+The audit splits along codemap's existing `cli/` ↔ `application/` seam — same shape as `cmd-index.ts` ↔ `application/index-engine.ts`:
+
+| File                                   | Responsibility                                                                                                                                                                                                         |
+| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `src/cli/cmd-audit.ts`                 | argv parse (`--baseline`, `--json`, `--summary`), delegation to `runAudit`, terminal-mode renderer (per §7.1).                                                                                                         |
+| `src/application/audit-engine.ts`      | Delta registry (key → canonical SQL + required columns), baseline column-set validation, per-delta diff functions, the `{base, head, deltas}` envelope assembly. Exported entry point: `runAudit({db, baselineName})`. |
+| `src/cli/cmd-audit.test.ts`            | argv → option-bag tests (parser shape, mutex errors, etc.).                                                                                                                                                            |
+| `src/application/audit-engine.test.ts` | Engine tests — exercise `runAudit` against in-memory DB + canned baselines; assert envelope shape and the column-set-validation error path.                                                                            |
+
+The split:
+
+- **Mirrors existing layering** (`cli/cmd-index.ts` ↔ `application/index-engine.ts`) — architectural consistency.
+- **Makes the engine testable independent of CLI shape** — `audit-engine.test.ts` doesn't care about argv.
+- **Makes the v1.x `--base <ref>` slice mechanical** — worktree+reindex code lives in `cmd-audit.ts` (CLI orchestration); the engine just gets a different `db` handle pointing at the temp DB.
+- **Forward-compatible with a programmatic `Codemap.audit()` method** if `api.ts` ever exposes it.
+
+### v1 tracer-bullet sequence — `--baseline <name>`
+
+1. **CLI scaffold** — `cmd-audit.ts` + `audit-engine.ts` skeletons. `codemap audit --help` works; `--baseline <name>` and `--no-index` parsed; auto-incremental-index prelude wired (calls `runCodemapIndex({ mode: "incremental" })` unless `--no-index`); `runAudit` returns `{base: {source: "baseline", ...}, head: {...}, deltas: {}}` stub. Smoke + commit.
+2. **Delta registry + first delta — `files`** — engine grows the canonical-projection registry (`{key, sql, requiredColumns}`); `files` delta implements load-baseline → validate-columns → project → diff via `diffRows`. CLI renders one terminal-mode block. Commit.
+3. **Remaining deltas** — `dependencies`, `deprecated` — each as a separate commit. Each adds one registry entry + one delta function + tests. Renderer extends naturally.
+4. **Terminal-mode polish** — implement the no-drift / drift / `--summary` output shapes from §7.1; `cmd-audit.test.ts` covers all three.
+5. **Docs + agents update** — `architecture.md § Audit wiring`, glossary entry, README CLI block, rule + skill across `.agents/` and `templates/agents/` (Rule 10). Commit.
+6. **Changeset** — patch (no schema bump; reuses existing `query_baselines` table). Commit.
+
+Estimated total: ~1 day end-to-end across ~6 commits. The threshold-config / verdict step is **explicitly out** of v1 (see §5).
+
+### v1.x — `--base <ref>` (separate PR)
+
+1. Worktree spawn + temp-DB build (`codemap --full --root .codemap.audit-<sha>`).
+2. Cross-DB delta queries (same delta definitions as v1, swap snapshot source).
+3. Cleanup + cache decision (see open question §9).
+4. Docs + Rule 10 update.
+5. Changeset.
+
+Defers until: (a) v1 validates the delta shape under real use, AND (b) at least one consumer asks for "audit against an arbitrary ref I haven't pre-baselined."
+
+### v1.x — `verdict` + threshold config (separate PR, separate trigger)
+
+Independent slice from `--base <ref>`. Triggers and shape sketched in [§5 Verdict shape](#5-verdict-shape).
+
+## 9. Open questions (v1.x)
 
-1. **CLI scaffold** — `codemap audit --help` works; `--base <ref>` parsed; `runAuditCmd` calls a stub that returns `{verdict: "pass", deltas: {}}`. Smoke + commit.
-2. **Worktree + base index** — Option A spawn-and-index implementation; assert two `.codemap.db` files exist. Commit.
-3. **First delta — `files`** — minimal end-to-end vertical slice: open both DBs, set-diff `path`, emit `{files: {added, removed}}`. Smoke + commit.
-4. **Remaining deltas** — `dependencies`, `deprecated`, `visibility`, `barrels`, `hot_files` — each as a separate commit so individual tests can be reviewed.
-5. **Threshold config** — Zod schema additions + verdict reduction; default `pass` until user opts in. Commit.
-6. **Docs + agents update** — `architecture.md § Audit wiring`, glossary entry, README CLI block, rule + skill across `.agents/` and `templates/agents/` (Rule 10). Commit.
-7. **Changeset** — patch (no schema bump). Commit.
+These all defer to v1.x or later — none block the v1 ship.
 
-Estimated total: 1–2 days end-to-end across ~7 commits.
+- **Worktree location for `--base <ref>`** — `.codemap.audit-<sha>/` (project-local; gitignored by the existing `.codemap.*` glob) vs `/tmp/codemap-audit-<sha>` (system-temp; auto-cleaned but loses cache across reboots). **Lean: project-local, named to match the gitignore.** Settled when v1.x ships.
+- **`actions` per delta key** — recipe `actions` (Tier A.1) attach to row sets; an audit delta is a higher-level concept. v1 doesn't include `actions` at all (no verdict either — see §5). v1.x can add `audit.actions: { dependencies: "review-coupling-spike" }` if patterns emerge.
+- **Cross-snapshot performance ceiling for `--base <ref>`** — at what project size does the worktree+full-reindex path become unacceptable (>30s)? Needs a benchmark fixture; defer until a real consumer hits the wall.
 
-## 9. Open questions
+### Settled during the design pass
 
-- **Should the temp worktree live under `.codemap/audit-<sha>/` (project-local) or `/tmp/codemap-audit-<sha>` (system temp)?** Project-local is gitignorable via the existing `.codemap.*` glob (works only if the dir is named `.codemap.audit-<sha>`); system temp is auto-cleaned but loses the cache benefit across reboots. **Lean: project-local, naming `.codemap.audit-<sha>` so the existing gitignore covers it.**
-- **Should `audit` warn when `<base>` and `HEAD` are identical?** Almost certainly user error (probably wanted `--base origin/main` not `--base HEAD`). Surface a warning, exit 0 with empty deltas.
-- **Should the verdict include `actions` per delta key?** Recipe `actions` (Tier A.1) attach to row sets; an audit delta is a higher-level concept. v1 punts; v1.x can add `audit.actions: { dependencies: "review-coupling-spike" }` if patterns emerge.
-- **Cross-snapshot performance ceiling.** At what project size does Option A become unacceptable (>30s)? Need a benchmark fixture; defer until a real consumer hits the wall.
+- **Should `audit` warn when `<base>` and `HEAD` are identical?** **No.** The renderer's metadata header (`baseline "X" (saved 2 days ago @ abc1234, 152 rows)`) already exposes the baseline's `git_ref`; the user can spot a same-SHA mistake from the existing output. Adding a warning would be noise in the common case (zero deltas after a small change is exactly what you want) and heuristic-driven in the edge cases ("divergent baseline" requires merge-base inspection — meaningful code for a low-signal warning). Reconsider only if a real consumer reports losing time to it.
 
 ## 10. References