docs(agents): adopt diagnose + write-a-skill skills (mattpocock)

SutuSebastian · SutuSebastian · commit 1037469407d0 · 2026-05-01T11:14:50.000+03:00
Two more Tier 3 maintainer-only skills sourced from mattpocock/skills: diagnose — disciplined 6-phase loop for hard bugs and perf regressions (reproduce → minimise → hypothesise → instrument → fix → cleanup). Core thesis: "build the right feedback loop, and the bug is 90% fixed." Translation: explore-the-codebase step now points at codemap (per the codemap rule's STOP-before-grep) and docs/glossary.md (per Rule 9 canonical terms); ADR mention dropped (no ADR infra in this repo); Phase-5 seam discipline cross-references improve-codebase-architecture (adopted in 906ecba); Phase-6 cleanup includes a one-line lessons.md append per the lessons-rule discipline. scripts/hitl-loop.template.sh ships verbatim — no codemap-specific assumptions. write-a-skill — meta-skill for creating new skills. Translation: front section explicitly cites our agents-first-convention (file layout) and agents-tier-system (tier choice + durability), plus the maintainer-only-vs-shipped distinction (precedent: PR #25). Examples cite codemap precedents (improve-codebase-architecture for the companion-files split; pr-comment-fact-check for single-file). Review checklist adapted: tier choice + rule-pairing decision + tier-list update added. Both adopted as maintainer-only (.agents/skills/ + .cursor/skills/ symlinks per agents-first-convention). Not added to templates/agents/ — consumer surface stays codemap-skill-only. agents-tier-system Tier 3 list updated: diagnose, write-a-skill added (alongside grill-me + improve-codebase-architecture from the prior commit). Skipped grill-with-docs (requires standing up CONTEXT.md / docs/adr/ infra; conflicts with codemap's lift-to-architecture-and- delete-the-plan lifecycle).
diff --git a/.agents/rules/agents-tier-system.md b/.agents/rules/agents-tier-system.md
@@ -50,7 +50,7 @@ Today's Tier-2 rules:
 
 Pure intent-triggered. The skill description is detailed enough that Cursor surfaces it on relevant phrases. No always-on cost.
 
-Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
+Skills stay rule-less when the work is **explicitly invoked** by the user, not pattern-triggered. Today: `audit-pr-architecture`, `diagnose`, `docs-governance`, `docs-lifecycle-sweep`, `grill-me`, `improve-codebase-architecture`, `write-a-skill`. (Skills like `gritql-codemods` and `ubiquitous-language` would also fit this tier if adopted.)
 
 ## Authoring guidelines
 
diff --git a/.agents/skills/diagnose/SKILL.md b/.agents/skills/diagnose/SKILL.md
@@ -0,0 +1,116 @@
+---
+name: diagnose
+description: Disciplined diagnosis loop for hard bugs and performance regressions. Reproduce → minimise → hypothesise → instrument → fix → regression-test. Use when user says "diagnose this" / "debug this", reports a bug, says something is broken/throwing/failing, or describes a performance regression.
+---
+
+# Diagnose
+
+A discipline for hard bugs. Skip phases only when explicitly justified.
+
+When exploring the codebase, query [`codemap`](../codemap/SKILL.md) (the structural SQLite index) before reaching for `Grep` or `Read` per the [`codemap` rule](../../rules/codemap.md) — symbol-shaped questions ("where is X defined?", "what calls X?") have direct answers in the `symbols` / `calls` tables. Read the relevant section of [`docs/architecture.md`](../../../docs/architecture.md) to ground the mental model of layering, and check [`docs/glossary.md`](../../../docs/glossary.md) for canonical domain terms (file types, recipe ids, schema columns).
+
+## Phase 1 — Build a feedback loop
+
+**This is the skill.** Everything else is mechanical. If you have a fast, deterministic, agent-runnable pass/fail signal for the bug, you will find the cause — bisection, hypothesis-testing, and instrumentation all just consume that signal. If you don't have one, no amount of staring at code will save you.
+
+Spend disproportionate effort here. **Be aggressive. Be creative. Refuse to give up.**
+
+### Ways to construct one — try them in roughly this order
+
+1. **Failing test** at whatever seam reaches the bug — unit, integration, e2e. Codemap convention: `src/**/<name>.test.ts` for unit + integration; `fixtures/golden/` for query-shape regressions; `bun test <file>` runs them.
+2. **CLI invocation** with a fixture input, diffing stdout against a known-good snapshot. Examples: `bun src/index.ts query --json …` against `fixtures/minimal/`, golden runner under `scripts/query-golden.ts`.
+3. **Replay a captured trace.** Save a real `.codemap.db` / config / fixture file to disk; replay it through the code path in isolation.
+4. **Throwaway harness.** Spin up a minimal subset (one parser, one DB connection) that exercises the bug code path with a single function call.
+5. **Property / fuzz loop.** If the bug is "sometimes wrong output", run 1000 random inputs and look for the failure mode.
+6. **Bisection harness.** If the bug appeared between two known states (commit, dataset, version), automate "boot at state X, check, repeat" so you can `git bisect run` it.
+7. **Differential loop.** Run the same input through old-version vs new-version (or two configs) and diff outputs. The B.6 baseline machinery (`codemap query --save-baseline` / `--baseline`) is built for exactly this — use it.
+8. **HITL bash script.** Last resort. If a human must click or copy a value out of the IDE, drive _them_ with [`scripts/hitl-loop.template.sh`](scripts/hitl-loop.template.sh) so the loop is still structured. Captured output feeds back to you.
+
+Build the right feedback loop, and the bug is 90% fixed.
+
+### Iterate on the loop itself
+
+Treat the loop as a product. Once you have _a_ loop, ask:
+
+- Can I make it faster? (Cache setup, skip unrelated init, narrow the test scope.)
+- Can I make the signal sharper? (Assert on the specific symptom, not "didn't crash".)
+- Can I make it more deterministic? (Pin time, seed RNG, isolate filesystem, freeze network.)
+
+A 30-second flaky loop is barely better than no loop. A 2-second deterministic loop is a debugging superpower.
+
+### Non-deterministic bugs
+
+The goal is not a clean repro but a **higher reproduction rate**. Loop the trigger 100×, parallelise, add stress, narrow timing windows, inject sleeps. A 50%-flake bug is debuggable; 1% is not — keep raising the rate until it's debuggable.
+
+### When you genuinely cannot build a loop
+
+Stop and say so explicitly. List what you tried. Ask the user for: (a) access to whatever environment reproduces it, (b) a captured artifact (HAR file, log dump, core dump, screen recording with timestamps, broken `.codemap.db`), or (c) permission to add temporary instrumentation. Do **not** proceed to hypothesise without a loop.
+
+Do not proceed to Phase 2 until you have a loop you believe in.
+
+## Phase 2 — Reproduce
+
+Run the loop. Watch the bug appear.
+
+Confirm:
+
+- [ ] The loop produces the failure mode the **user** described — not a different failure that happens to be nearby. Wrong bug = wrong fix.
+- [ ] The failure is reproducible across multiple runs (or, for non-deterministic bugs, reproducible at a high enough rate to debug against).
+- [ ] You have captured the exact symptom (error message, wrong output, slow timing) so later phases can verify the fix actually addresses it.
+
+Do not proceed until you reproduce the bug.
+
+## Phase 3 — Hypothesise
+
+Generate **3–5 ranked hypotheses** before testing any of them. Single-hypothesis generation anchors on the first plausible idea.
+
+Each hypothesis must be **falsifiable**: state the prediction it makes.
+
+> Format: "If `<X>` is the cause, then `<Y>` will make the bug disappear / `<Z>` will make it worse."
+
+If you cannot state the prediction, the hypothesis is a vibe — discard or sharpen it.
+
+**Show the ranked list to the user before testing.** They often have domain knowledge that re-ranks instantly ("we just changed #3"), or know hypotheses they've already ruled out. Cheap checkpoint, big time saver. Don't block on it — proceed with your ranking if the user is AFK.
+
+## Phase 4 — Instrument
+
+Each probe must map to a specific prediction from Phase 3. **Change one variable at a time.**
+
+Tool preference:
+
+1. **Debugger / REPL inspection** if the env supports it. One breakpoint beats ten logs.
+2. **Targeted logs** at the boundaries that distinguish hypotheses.
+3. Never "log everything and grep".
+
+**Tag every debug log** with a unique prefix, e.g. `[DEBUG-a4f2]`. Cleanup at the end becomes a single grep. Untagged logs survive; tagged logs die.
+
+**Perf branch.** For performance regressions, logs are usually wrong. Instead: establish a baseline measurement (timing harness, `performance.now()`, profiler, query plan, `--performance` flag for index runs), then bisect. Measure first, fix second.
+
+## Phase 5 — Fix + regression test
+
+Write the regression test **before the fix** — but only if there is a **correct seam** for it (per the [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) vocabulary).
+
+A correct seam is one where the test exercises the **real bug pattern** as it occurs at the call site. If the only available seam is too shallow (single-caller test when the bug needs multiple callers, unit test that can't replicate the chain that triggered the bug), a regression test there gives false confidence.
+
+**If no correct seam exists, that itself is the finding.** Note it. The codebase architecture is preventing the bug from being locked down. Flag this for the next phase.
+
+If a correct seam exists:
+
+1. Turn the minimised repro into a failing test at that seam.
+2. Watch it fail.
+3. Apply the fix.
+4. Watch it pass.
+5. Re-run the Phase 1 feedback loop against the original (un-minimised) scenario.
+
+## Phase 6 — Cleanup + post-mortem
+
+Required before declaring done:
+
+- [ ] Original repro no longer reproduces (re-run the Phase 1 loop)
+- [ ] Regression test passes (or absence of seam is documented)
+- [ ] All `[DEBUG-…]` instrumentation removed (`grep` the prefix)
+- [ ] Throwaway prototypes deleted (or moved to a clearly-marked debug location)
+- [ ] The hypothesis that turned out correct is stated in the commit / PR message — so the next debugger learns
+- [ ] If the post-mortem yields a permanent insight, append a one-line entry to [`.agents/lessons.md`](../../lessons.md) per the lessons-rule discipline
+
+**Then ask: what would have prevented this bug?** If the answer involves architectural change (no good test seam, tangled callers, hidden coupling) hand off to [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) with the specifics. Make the recommendation **after** the fix is in, not before — you have more information now than when you started.
diff --git a/.agents/skills/diagnose/scripts/hitl-loop.template.sh b/.agents/skills/diagnose/scripts/hitl-loop.template.sh
@@ -0,0 +1,41 @@
+#!/usr/bin/env bash
+# Human-in-the-loop reproduction loop.
+# Copy this file, edit the steps below, and run it.
+# The agent runs the script; the user follows prompts in their terminal.
+#
+# Usage:
+#   bash hitl-loop.template.sh
+#
+# Two helpers:
+#   step    "<instruction>"         → show instruction, wait for Enter
+#   capture VAR "<question>"        → show question, read response into VAR
+#
+# At the end, captured values are printed as KEY=VALUE for the agent to parse.
+
+set -euo pipefail
+
+step() {
+  printf '\n>>> %s\n' "$1"
+  read -r -p "    [Enter when done] " _
+}
+
+capture() {
+  local var="$1" question="$2" answer
+  printf '\n>>> %s\n' "$question"
+  read -r -p "    > " answer
+  printf -v "$var" '%s' "$answer"
+}
+
+# --- edit below ---------------------------------------------------------
+
+step "Open the app at http://localhost:3000 and sign in."
+
+capture ERRORED "Click the 'Export' button. Did it throw an error? (y/n)"
+
+capture ERROR_MSG "Paste the error message (or 'none'):"
+
+# --- edit above ---------------------------------------------------------
+
+printf '\n--- Captured ---\n'
+printf 'ERRORED=%s\n' "$ERRORED"
+printf 'ERROR_MSG=%s\n' "$ERROR_MSG"
diff --git a/.agents/skills/write-a-skill/SKILL.md b/.agents/skills/write-a-skill/SKILL.md
@@ -0,0 +1,176 @@
+---
+name: write-a-skill
+description: Create new agent skills with proper structure, progressive disclosure, and bundled resources. Use when user wants to create, write, or build a new skill (or asks "how do I write a skill?", "draft a SKILL.md for X").
+---
+
+# Writing Skills
+
+Discipline for authoring `.agents/skills/<name>/SKILL.md` files in this repo.
+
+## Repo conventions you must respect
+
+Before drafting any skill in codemap, internalise these (they trump anything in this skill):
+
+- **File layout** — [`agents-first-convention`](../../rules/agents-first-convention.md): the source-of-truth file is `.agents/skills/<name>/SKILL.md`; the `.cursor/skills/<name>` entry is a **symlink** back. Never put original content under `.cursor/`.
+- **Tier choice** — [`agents-tier-system`](../../rules/agents-tier-system.md): every new skill is Tier 1 (always-on, paired with a rule), Tier 2 (auto-attached to a glob, paired with a rule), or Tier 3 (discoverable, no rule). **Skills with `NEVER` / `ALWAYS` clauses deserve a rule pairing.** Pure intent-trigger skills (no hard "must" clauses) stay Tier 3.
+- **Maintainer-only vs shipped** — `.agents/skills/` is the dev-side mirror; `templates/agents/skills/` is what `codemap agents init` ships to npm consumers. The bundled template surface today is **only** the `codemap` skill — every other skill in `.agents/skills/` is maintainer-only (precedent: PR #25). Don't add a skill to `templates/agents/` unless it's something every consumer of the published package would want.
+
+## Process
+
+### 1. Gather requirements
+
+Ask the user:
+
+- What task / domain does the skill cover?
+- What specific use cases should it handle?
+- Does it need executable scripts (under `scripts/`) or just instructions?
+- Any reference materials to include?
+- **Tier choice**: does the skill have always-on principles (any `NEVER` / `ALWAYS` clauses)? If yes, it deserves a Tier-1 or Tier-2 rule pairing per [`agents-tier-system`](../../rules/agents-tier-system.md).
+
+### 2. Draft the skill
+
+Create:
+
+- `SKILL.md` with concise instructions (under 100 lines if possible — see "When to split" below)
+- Companion files (`LANGUAGE.md`, `REFERENCE.md`, `EXAMPLES.md`, etc.) when content exceeds 100 lines or has distinct domains
+- `scripts/<name>.{sh,ts}` when a deterministic operation is invoked repeatedly (saves tokens vs generated code)
+
+Use [`grill-me`](../grill-me/SKILL.md) on yourself to surface decisions before you write — what's the trigger phrase shape? What's the boundary with adjacent skills? What's the durability test (does this skill still read correctly six months from now)?
+
+### 3. Wire the file layout
+
+```bash
+# Source of truth
+.agents/skills/<name>/SKILL.md
+
+# Cursor symlink (per agents-first-convention)
+ln -s ../../.agents/skills/<name> .cursor/skills/<name>
+```
+
+### 4. Update the tier list
+
+Add the skill to the relevant list in [`agents-tier-system.md`](../../rules/agents-tier-system.md) so the inventory stays accurate.
+
+### 5. Review
+
+Ask the user:
+
+- Does this cover your use cases?
+- Anything missing or unclear?
+- Should any section be more / less detailed?
+
+Run the [Review checklist](#review-checklist) before declaring done.
+
+## Skill structure
+
+```text
+.agents/skills/<name>/
+├── SKILL.md              # Main instructions (required)
+├── LANGUAGE.md           # Vocabulary the skill enforces (if any)
+├── REFERENCE.md          # Detailed docs (if SKILL.md exceeds ~100 lines)
+├── EXAMPLES.md           # Usage examples (if needed)
+└── scripts/              # Utility scripts (if needed)
+    └── helper.sh
+```
+
+## SKILL.md template
+
+```md
+---
+name: skill-name
+description: Brief description of capability. Use when [specific triggers — verbs and nouns the user is likely to say, plus contexts where the skill applies].
+---
+
+# Skill Name
+
+## Quick start
+
+[Minimal working example — what the user does on first invocation]
+
+## Workflows
+
+[Step-by-step processes with checklists for complex tasks]
+
+## Advanced features
+
+[Link to companion files: See [REFERENCE.md](REFERENCE.md) / [LANGUAGE.md](LANGUAGE.md)]
+```
+
+## Description requirements
+
+The description is **the only thing the agent sees** when deciding which skill to load. It's surfaced in the discoverable-skills list alongside every other installed skill. Get this right or your skill never fires.
+
+**Goal**: Give the agent just enough info to know:
+
+1. What capability this skill provides
+2. When / why to trigger it (specific keywords, contexts, file types)
+
+**Format**:
+
+- Max ~1024 chars
+- Write in third person
+- First sentence: what it does
+- Second sentence: "Use when [specific triggers]"
+- Include the verbs and nouns the user is likely to say (per [`agents-tier-system` § Tier 3 description](../../rules/agents-tier-system.md))
+
+**Good example**:
+
+```text
+Triage and fact-check PR review comments against the actual codebase, project rules, and skills. Use when the user asks to address PR comments, respond to reviewer feedback, check if a comment is correct, fact-check a reviewer's claim, decide which comments to push back on, or sort hallucinated suggestions from real ones. Triggers on phrases like "check PR comments", "are these comments right".
+```
+
+**Bad example**:
+
+```text
+Helps with PRs.
+```
+
+The bad example gives the agent no way to distinguish this from any other PR-adjacent skill.
+
+## When to add scripts
+
+Add utility scripts under `scripts/` when:
+
+- Operation is deterministic (validation, formatting, bisection harness)
+- Same code would be generated repeatedly across invocations
+- Errors need explicit handling that's tedious to re-derive
+
+Scripts save tokens and improve reliability vs generated code.
+
+## When to split files
+
+Split into companion files when:
+
+- `SKILL.md` exceeds ~100 lines
+- Content has distinct domains (vocabulary vs process vs templates)
+- Advanced features are rarely needed and would balloon the main file
+
+Cite codemap precedents:
+
+- [`improve-codebase-architecture`](../improve-codebase-architecture/SKILL.md) splits into `LANGUAGE.md` (vocab), `DEEPENING.md` (sub-rules), `INTERFACE-DESIGN.md` (parallel-sub-agent pattern).
+- [`pr-comment-fact-check`](../pr-comment-fact-check/SKILL.md) stays single-file because every section is in-flow process.
+
+## Durability discipline
+
+Per [`agents-tier-system` § Authoring discipline: durability](../../rules/agents-tier-system.md):
+
+- **Don't cite specific audit / plan / research filenames as canonical examples.** Plans are mortal under [`docs-lifecycle-sweep`](../docs-lifecycle-sweep/SKILL.md). Use shape placeholders (`<topic>.md`) instead.
+- **Don't cite specific commit hashes or PR numbers as the only path to context.** Summarise inline.
+- **Don't cite source-code line numbers.** Reference symbols by name.
+
+If the skill still reads correctly six months from now after every doc you didn't write got rewritten, it's durable.
+
+## Review checklist
+
+After drafting, verify:
+
+- [ ] Description includes triggers ("Use when…")
+- [ ] `SKILL.md` under 100 lines OR has split companion files
+- [ ] No time-sensitive info (no "as of 2026-04…")
+- [ ] Consistent terminology — drift kills clarity
+- [ ] Concrete examples included
+- [ ] Cross-references one level deep (don't chain `SKILL.md → REFERENCE.md → DEEP-DIVE.md → REFERENCE2.md`)
+- [ ] File layout follows [`agents-first-convention`](../../rules/agents-first-convention.md) (`.agents/` source + `.cursor/` symlink)
+- [ ] Tier choice documented per [`agents-tier-system`](../../rules/agents-tier-system.md); rule pairing added if the skill has `NEVER` / `ALWAYS` clauses
+- [ ] Skill listed in the appropriate tier section of `agents-tier-system.md`
+- [ ] Decision recorded in the PR description: maintainer-only (`.agents/` only) vs shipped (`templates/agents/` too)
diff --git a/.cursor/skills/diagnose b/.cursor/skills/diagnose
@@ -0,0 +1 @@
+../../.agents/skills/diagnose
diff --git a/.cursor/skills/write-a-skill b/.cursor/skills/write-a-skill
@@ -0,0 +1 @@
+../../.agents/skills/write-a-skill