Skip to content

Commit 035ee23

Browse files
docs: codemap-audit (B.5) plan + adopt grill-me & improve-codebase-architecture skills (#32)
Two unrelated docs changes batched: ## 1. Plan: `codemap audit --base <ref>` (B.5) Per `docs/README.md` Rule 3 (plans live in `plans/<feature-name>.md`, link from `roadmap.md`), drafts the design for **B.5** before writing any code. The research note explicitly calls this "the single highest-leverage candidate this refresh." | Decision | v1 | | --- | --- | | **Snapshot strategy** | Temp worktree + full reindex under `.codemap.audit-<sha>/` (gitignored by the existing `.codemap.*` glob). Defers caching / perf-tuning until a real consumer hits the wall. | | **Built-in deltas** | `files`, `dependencies`, `deprecated`, `visibility`, `barrels`, `hot_files`. Each wraps an existing recipe — no new analysis layer. | | **Verdict** | `pass` / `warn` / `fail` with thresholds **opt-in via `codemap.config.audit`**. v1 emits raw deltas only (default `pass`). | | **Exit codes** | `0` / `1` / `2` — mirrors `git diff --exit-code`. | | **Composition** | `--json` / `--summary` work; `--changed-since` / `--group-by` / `--save-baseline` / `--baseline` are mutex (different shapes / semantics). | | **Tracer-bullet sequence** | 7 commits: scaffold → worktree → first delta → remaining deltas → threshold config → docs+agents (Rule 10) → changeset. | Both prerequisites just merged on `main`: B.6 (PR #30) proves the snapshot-in-DB primitive; B.7 (PR #28) provides the `symbols.visibility` column the `visibility` delta needs. ## 2. Adopt two Tier 3 skills from [`mattpocock/skills`](https://github.com/mattpocock/skills) Sourced after evaluating three skills mid-thread; the two adopted ones earn their always-zero-cost slot: | Skill | What | | --- | --- | | **`grill-me`** | 8-line interview-pattern skill. Walk a design tree branch by branch, recommend an answer per question, ask one at a time. Filled the gap visible in commit 1's plan: I made many decisions by myself; `grill-me` would have surfaced them for second opinion before they crystallised. | | **`improve-codebase-architecture`** | Ousterhout-style deepening vocabulary (`module / interface / seam / adapter / depth / leverage / locality`), the deletion test, "one adapter = hypothetical seam, two = real," dependency categories (`DEEPENING.md`), and parallel-sub-agent "Design It Twice" interface exploration (`INTERFACE-DESIGN.md`). | Both are maintainer-only (under `.agents/skills/` + `.cursor/skills/` symlinks per `agents-first-convention`). **Not added to `templates/agents/`** — same precedent as PR #25 (consumer surface ships only the codemap rule + skill). ### Translation notes `improve-codebase-architecture/SKILL.md` adapted at three points to fit codemap's docs framework (the upstream version assumes `CONTEXT.md` + `docs/adr/`; we have neither): - `CONTEXT.md` references → `docs/glossary.md` (Rule 9 already enforces glossary updates per PR). - `docs/adr/` references → `docs/plans/<topic>.md` (Rule 3 — but plans are mortal; decisions of record lift to `architecture.md` per Rule 2 then the plan is deleted). - "Offer ADR on rejection" step → dropped. Codemap doesn't keep decision records; the closest is "lift to architecture.md." Companion files (`LANGUAGE.md`, `DEEPENING.md`, `INTERFACE-DESIGN.md`) ship **verbatim** — none reference `CONTEXT.md` or ADRs. `grill-me/SKILL.md` extended with two short codemap-specific notes: prefer `codemap` over `Grep` when exploring (per the `codemap` rule), and write crystallised answers into the in-flight `docs/plans/<name>.md` inline (Rule 3). ### Skipped - **`grill-with-docs`** (the third skill in the upstream "grill" family) — requires standing up CONTEXT.md / `docs/adr/` infrastructure that conflicts with the lift-to-architecture-then-delete-the-plan lifecycle codemap already runs. The salvageable ADR 3-criteria gate is recorded in this conversation; lift if codemap ever needs ADRs. ### Tier 3 list updated `.agents/rules/agents-tier-system.md` Tier 3 list extended with both new skills, and the previously-missing `docs-governance` + `docs-lifecycle-sweep` entries from PR #25. ## Test plan - [x] `bun run check` green (no behavior changed; pure docs + skills). - [x] All cross-references resolve (plan → research → architecture / lessons; skill files → glossary.md / architecture.md / codemap rule / each other). - [x] `.cursor/skills/{grill-me,improve-codebase-architecture}` symlinks resolve. - [x] Plan calls itself out as **Plan** type per `docs/README.md § Document Lifecycle` — delete on ship, lift to `architecture.md`. - [ ] CI green.
1 parent a309d52 commit 035ee23

15 files changed

Lines changed: 907 additions & 1 deletion

File tree

.agents/rules/agents-tier-system.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Today's Tier-2 rules:
5050

5151
Pure intent-triggered. The skill description is detailed enough that Cursor surfaces it on relevant phrases. No always-on cost.
5252

53-
Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered (e.g. `audit-pr-architecture`, `docs-lifecycle-sweep` in this repo; `improve-codebase-architecture`, `gritql-codemods`, `ubiquitous-language` in larger codebases).
53+
Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `diagnose`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`, `write-a-skill`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
5454

5555
## Authoring guidelines
5656

.agents/skills/diagnose/SKILL.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
---
2+
name: diagnose
3+
description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
4+
---
5+
6+
# Diagnose
7+
8+
A discipline for hard bugs. Skip phases only when explicitly justified.
9+
10+
When exploring the codebase, query [`codemap`](../codemap/SKILL.md) (the structural SQLite index) before reaching for `Grep` or `Read` per the [`codemap` rule](../../rules/codemap.md) — symbol-shaped questions ("where is X defined?", "what calls X?") have direct answers in the `symbols` / `calls` tables. Read the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) to ground the mental model of layering, and check [`docs/glossary.md`](../../../docs/glossary.md) for canonical domain terms (file types, recipe ids, schema columns).
11+
12+
## Phase 1 — Build a feedback loop
13+
14+
**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
15+
16+
Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
17+
18+
### Ways to construct one — try them in roughly this order
19+
20+
1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. Codemap convention: `src/**/<name>.test.ts` for unit + integration; `fixtures/golden/` for query-shape regressions; `bun test <file>` runs them.
21+
2. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. Examples: `bun src/index.ts query --json …` against `fixtures/minimal/`, golden runner under `scripts/query-golden.ts`.
22+
3. **Replay a captured trace.** Save a real `.codemap.db` / config / fixture file to disk; replay it through the code path in isolation.
23+
4. **Throwaway harness.** Spin up a minimal subset (one parser, one DB connection) that exercises the bug code path with a single function call.
24+
5. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
25+
6. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
26+
7. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. The B.6 baseline machinery (`codemap query --save-baseline` / `--baseline`) is built for exactly this — use it.
27+
8. **HITL bash script.** Last resort. If a human must click or copy a value out of the IDE, drive _them_ with [`scripts/hitl-loop.template.sh`](scripts/hitl-loop.template.sh) so the loop is still structured. Captured output feeds back to you.
28+
29+
Build the right feedback loop, and the bug is 90% fixed.
30+
31+
### Iterate on the loop itself
32+
33+
Treat the loop as a product. Once you have _a_ loop, ask:
34+
35+
- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
36+
- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
37+
- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
38+
39+
A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
40+
41+
### Non-deterministic bugs
42+
43+
The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
44+
45+
### When you genuinely cannot build a loop
46+
47+
Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps, broken `.codemap.db`), or (c) permission to add temporary instrumentation. Do **not** proceed to hypothesise without a loop.
48+
49+
Do not proceed to Phase 2 until you have a loop you believe in.
50+
51+
## Phase 2 — Reproduce
52+
53+
Run the loop. Watch the bug appear.
54+
55+
Confirm:
56+
57+
- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
58+
- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
59+
- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
60+
61+
Do not proceed until you reproduce the bug.
62+
63+
## Phase 3 — Hypothesise
64+
65+
Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
66+
67+
Each hypothesis must be **falsifiable**: state the prediction it makes.
68+
69+
> Format: "If `<X>` is the cause, then `<Y>` will make the bug disappear / `<Z>` will make it worse."
70+
71+
If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
72+
73+
**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just changed #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
74+
75+
## Phase 4 — Instrument
76+
77+
Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
78+
79+
Tool preference:
80+
81+
1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
82+
2. **Targeted logs** at the boundaries that distinguish hypotheses.
83+
3. Never "log everything and grep".
84+
85+
**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
86+
87+
**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan, `--performance` flag for index runs), then bisect. Measure first, fix second.
88+
89+
## Phase 5 — Fix + regression test
90+
91+
Write the regression test **before the fix** — but only if there is a **correct seam** for it (per the [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) vocabulary).
92+
93+
A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
94+
95+
**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
96+
97+
If a correct seam exists:
98+
99+
1. Turn the minimised repro into a failing test at that seam.
100+
2. Watch it fail.
101+
3. Apply the fix.
102+
4. Watch it pass.
103+
5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
104+
105+
## Phase 6 — Cleanup + post-mortem
106+
107+
Required before declaring done:
108+
109+
- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
110+
- [ ] Regression test passes (or absence of seam is documented)
111+
- [ ] All `[DEBUG-…]` instrumentation removed (`grep` the prefix)
112+
- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
113+
- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
114+
- [ ] If the post-mortem yields a permanent insight, append a one-line entry to [`.agents/lessons.md`](../../lessons.md) per the lessons-rule discipline
115+
116+
**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#!/usr/bin/env bash
2+
# Human-in-the-loop reproduction loop.
3+
# Copy this file, edit the steps below, and run it.
4+
# The agent runs the script; the user follows prompts in their terminal.
5+
#
6+
# Usage:
7+
# bash hitl-loop.template.sh
8+
#
9+
# Two helpers:
10+
# step "<instruction>" → show instruction, wait for Enter
11+
# capture VAR "<question>" → show question, read response into VAR
12+
#
13+
# At the end, captured values are printed as KEY=VALUE for the agent to parse.
14+
15+
set -euo pipefail
16+
17+
step() {
18+
printf '\n>>> %s\n' "$1"
19+
read -r -p " [Enter when done] " _
20+
}
21+
22+
capture() {
23+
local var="$1" question="$2" answer
24+
printf '\n>>> %s\n' "$question"
25+
read -r -p " > " answer
26+
printf -v "$var" '%s' "$answer"
27+
}
28+
29+
# --- edit below ---------------------------------------------------------
30+
31+
step "Open the app at http://localhost:3000 and sign in."
32+
33+
capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
34+
35+
capture ERROR_MSG "Paste the error message (or 'none'):"
36+
37+
# --- edit above ---------------------------------------------------------
38+
39+
printf '\n--- Captured ---\n'
40+
printf 'ERRORED=%s\n' "$ERRORED"
41+
printf 'ERROR_MSG=%s\n' "$ERROR_MSG"

.agents/skills/grill-me/SKILL.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
name: grill-me
3+
description: Interview the user relentlessly about a plan or design until reaching shared understanding, resolving each branch of the decision tree. Use when user wants to stress-test a plan, get grilled on their design, or mentions "grill me".
4+
---
5+
6+
Interview me relentlessly about every aspect of this plan until we reach a shared understanding. Walk down each branch of the design tree, resolving dependencies between decisions one-by-one. For each question, provide your recommended answer.
7+
8+
Ask the questions one at a time, waiting for feedback before continuing.
9+
10+
If a question can be answered by exploring the codebase, explore the codebase instead. In this repo, that means querying [`codemap`](../codemap/SKILL.md) (the structural index) before reaching for `Grep` or `Read` — see the [`codemap` rule](../../rules/codemap.md).
11+
12+
When agreement crystallises on a question that affects an in-flight `docs/plans/<name>.md`, write the answer into the plan inline as you go — don't batch them up. The plan doc is the durable record; the chat transcript is not.
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Deepening
2+
3+
How to deepen a cluster of shallow modules safely, given its dependencies. Assumes the vocabulary in [LANGUAGE.md](LANGUAGE.md)**module**, **interface**, **seam**, **adapter**.
4+
5+
## Dependency categories
6+
7+
When assessing a candidate for deepening, classify its dependencies. The category determines how the deepened module is tested across its seam.
8+
9+
### 1. In-process
10+
11+
Pure computation, in-memory state, no I/O. Always deepenable — merge the modules and test through the new interface directly. No adapter needed.
12+
13+
### 2. Local-substitutable
14+
15+
Dependencies that have local test stand-ins (PGLite for Postgres, in-memory filesystem). Deepenable if the stand-in exists. The deepened module is tested with the stand-in running in the test suite. The seam is internal; no port at the module's external interface.
16+
17+
### 3. Remote but owned (Ports & Adapters)
18+
19+
Your own services across a network boundary (microservices, internal APIs). Define a **port** (interface) at the seam. The deep module owns the logic; the transport is injected as an **adapter**. Tests use an in-memory adapter. Production uses an HTTP/gRPC/queue adapter.
20+
21+
Recommendation shape: _"Define a port at the seam, implement an HTTP adapter for production and an in-memory adapter for testing, so the logic sits in one deep module even though it's deployed across a network."_
22+
23+
### 4. True external (Mock)
24+
25+
Third-party services (Stripe, Twilio, etc.) you don't control. The deepened module takes the external dependency as an injected port; tests provide a mock adapter.
26+
27+
## Seam discipline
28+
29+
- **One adapter means a hypothetical seam. Two adapters means a real one.** Don't introduce a port unless at least two adapters are justified (typically production + test). A single-adapter seam is just indirection.
30+
- **Internal seams vs external seams.** A deep module can have internal seams (private to its implementation, used by its own tests) as well as the external seam at its interface. Don't expose internal seams through the interface just because tests use them.
31+
32+
## Testing strategy: replace, don't layer
33+
34+
- Old unit tests on shallow modules become waste once tests at the deepened module's interface exist — delete them.
35+
- Write new tests at the deepened module's interface. The **interface is the test surface**.
36+
- Tests assert on observable outcomes through the interface, not internal state.
37+
- Tests should survive internal refactors — they describe behaviour, not implementation. If a test has to change when the implementation changes, it's testing past the interface.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Interface Design
2+
3+
When the user wants to explore alternative interfaces for a chosen deepening candidate, use this parallel sub-agent pattern. Based on "Design It Twice" (Ousterhout) — your first idea is unlikely to be the best.
4+
5+
Uses the vocabulary in [LANGUAGE.md](LANGUAGE.md)**module**, **interface**, **seam**, **adapter**, **leverage**.
6+
7+
## Process
8+
9+
### 1. Frame the problem space
10+
11+
Before spawning sub-agents, write a user-facing explanation of the problem space for the chosen candidate:
12+
13+
- The constraints any new interface would need to satisfy
14+
- The dependencies it would rely on, and which category they fall into (see [DEEPENING.md](DEEPENING.md))
15+
- A rough illustrative code sketch to ground the constraints — not a proposal, just a way to make the constraints concrete
16+
17+
Show this to the user, then immediately proceed to Step 2. The user reads and thinks while the sub-agents work in parallel.
18+
19+
### 2. Spawn sub-agents
20+
21+
Spawn 3+ sub-agents in parallel using the Agent / Task tool. Each must produce a **radically different** interface for the deepened module.
22+
23+
Prompt each sub-agent with a separate technical brief (file paths, coupling details, dependency category from [DEEPENING.md](DEEPENING.md), what sits behind the seam). The brief is independent of the user-facing problem-space explanation in Step 1. Give each agent a different design constraint:
24+
25+
- Agent 1: "Minimize the interface — aim for 1–3 entry points max. Maximise leverage per entry point."
26+
- Agent 2: "Maximise flexibility — support many use cases and extension."
27+
- Agent 3: "Optimise for the most common caller — make the default case trivial."
28+
- Agent 4 (if applicable): "Design around ports & adapters for cross-seam dependencies."
29+
30+
Include both [LANGUAGE.md](LANGUAGE.md) vocabulary and [`docs/glossary.md`](../../../docs/glossary.md) vocabulary in the brief so each sub-agent names things consistently with the architecture language and the project's domain language.
31+
32+
Each sub-agent outputs:
33+
34+
1. Interface (types, methods, params — plus invariants, ordering, error modes)
35+
2. Usage example showing how callers use it
36+
3. What the implementation hides behind the seam
37+
4. Dependency strategy and adapters (see [DEEPENING.md](DEEPENING.md))
38+
5. Trade-offs — where leverage is high, where it's thin
39+
40+
### 3. Present and compare
41+
42+
Present designs sequentially so the user can absorb each one, then compare them in prose. Contrast by **depth** (leverage at the interface), **locality** (where change concentrates), and **seam placement**.
43+
44+
After comparing, give your own recommendation: which design you think is strongest and why. If elements from different designs would combine well, propose a hybrid. Be opinionated — the user wants a strong read, not a menu.

0 commit comments

Comments
 (0)