Skip to content

Commit 989a35b

Browse files
author
jgstern-agent
committed
docs(playbooks): cruft audit + introduce cruft-audit-playbook
Apply the cruft audit findings (cruft_audit_05022026_0736.md): - Delete TRACKER_SYNC_PENDING manual-cleanup workaround in process-validation-queue-with-bakeoffs-and-uat (WI-nutin / PR #3427 shipped fcntl.flock + auto-recovery, the workaround targets a problem that no longer exists). - Delete historical NOTE about ADR-0025/0026 reclassification in fundamental-concept-audit (the renamed audit-findings 0001/0002 are referenced just above; bucket-boundary rule lives in docs/adr/README.md). - Trim "now" temporal qualifier from cycle-includes-reflect parenthetical in bakeoff-broad-priorities and bakeoff-deep-priorities. - Trim "The user's repeated pushback this session was a signal —" prefix on the over-filing anti-pattern bullet (session-anchored with no recoverable referent; the lesson alone is the teaching). - Trim "Diminishing returns are real." from changelog-audit budget bullet (truism redundant with concrete surrounding rules). - Trim "of this session" from KNOWN_LANGS hard rule (recoverable shape fully extracted in the rest of the sentence). Introduce cruft-audit-playbook.md capturing the methodology used: two-pass (syntactic grep + semantic read) mediated by an interactive interview, calibration loop, and the cruft / trim / not-cruft / doc- consistency taxonomy. Add concise essentialization to AGENTS.md and register the playbook in on_transcript_change.py PLAYBOOKS. Signed-off-by: jgstern-agent <josh-agent@iterabloom.com>
1 parent f711393 commit 989a35b

9 files changed

Lines changed: 197 additions & 14 deletions

.agent/agent_playbooks_protocols_sops_skills/bakeoff-broad-priorities.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<!-- SPDX-License-Identifier: AGPL-3.0-or-later -->
22
### BROAD Mode Priority Queue:
3-
1. **Reflect on bakeoff results:** After each cycle, run `./scripts/bakeoff-broad-reflect` then `./scripts/bakeoff-broad-reflect aggregate` to synthesize findings. This is the primary feedback signal for coverage gaps. (`cycle` now includes reflect automatically; use `--skip-reflect` for fast iteration only.)
3+
1. **Reflect on bakeoff results:** After each cycle, run `./scripts/bakeoff-broad-reflect` then `./scripts/bakeoff-broad-reflect aggregate` to synthesize findings. This is the primary feedback signal for coverage gaps. (`cycle` includes reflect automatically; use `--skip-reflect` for fast iteration only.)
44
2. **Aggregate across sessions:** When prior sessions have reflect data, run `./scripts/bakeoff-broad-reflect aggregate` to surface cross-session trends. **Binary rule on a CONVERGED bakeoff:** if the tracker has any ready items, aggregate is NOT required — prefer tracker work and only return to aggregation after the backlog drains. Aggregation is only the natural next step when the bakeoff is not converged, or when it is converged AND the tracker is empty.
55
3. **Linkers:** post-call-graph-time edge recovery across four subcategories — Protocol (framework-agnostic pattern matching: HTTP URL / SQL / pub-sub topic / event name), Bridge (language-pair FFI and runtime-bridging conventions: JNI, wasm_bindgen, Tauri IPC, Cgo, pyffi, …), Framework (framework-specific dispatch: decorator registries, DI containers, ORM method dispatch, React JSX composition, middleware chains), and Infrastructure (structural utilities: containment, inheritance, module-import resolution). See [ADR-0003-ext: Linker Subcategory Restoration](../../docs/adr/0003-linker-subcategory-restoration.md) for the subcategory taxonomy and [docs/LINKERS.md](../../docs/LINKERS.md) for the 45-linker catalogue. Prioritise by expected false-positive-reduction volume on the current prospector corpus (INV-nimuj), not by novelty of language pair — within-language Framework-subcategory gaps empirically dominate cross-language gaps by ~10× in dead-code FP volume.
66
4. **Frameworks** (see `docs/FRAMEWORKS.md` for comprehensive list, 150+ frameworks): Pattern detection is YAML-driven metadata enrichment that tags symbols with concept metadata (`route`, `task`, `middleware`, `model`, …). Feeds Framework-subcategory linkers, which consume the tagged concepts and emit dispatch edges. Each new framework YAML typically pairs with one or more Framework-subcategory linkers.

.agent/agent_playbooks_protocols_sops_skills/bakeoff-deep-priorities.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<!-- SPDX-License-Identifier: AGPL-3.0-or-later -->
22
### DEEP Mode Priority Queue:
33
When in DEEP mode, focus on feature quality rather than coverage breadth:
4-
1. **Reflect on bakeoff results:** After each cycle, run `./scripts/bakeoff-deep-reflect` then `./scripts/bakeoff-deep-reflect aggregate` to assess developer usefulness. This IS the mode's core feedback loop — reflecting on whether outputs help developers is the entire point of DEEP mode. (`cycle` now includes reflect automatically; use `--skip-reflect` for fast iteration only.)
4+
1. **Reflect on bakeoff results:** After each cycle, run `./scripts/bakeoff-deep-reflect` then `./scripts/bakeoff-deep-reflect aggregate` to assess developer usefulness. This IS the mode's core feedback loop — reflecting on whether outputs help developers is the entire point of DEEP mode. (`cycle` includes reflect automatically; use `--skip-reflect` for fast iteration only.)
55
2. **Aggregate across sessions:** Run `./scripts/bakeoff-deep-reflect aggregate --all` and `./scripts/bakeoff-deep compare <A> <B>` to track improvement trajectories. **Binary rule on a CONVERGED bakeoff:** if the tracker has any ready items, aggregate is NOT required — prefer tracker work and only return to aggregation after the backlog drains. Aggregation is only the natural next step when the bakeoff is not converged, or when it is converged AND the tracker is empty.
66
3. **Slice quality:** Does forward slice capture actual dependencies?
77
4. **Reverse slice:** Does it correctly identify callers?

.agent/agent_playbooks_protocols_sops_skills/changelog-audit-playbook.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ This prevents drift — if you start editing without a plan, you risk reorganizi
113113

114114
This prevents context-window overload from holding a 200+ line section in working memory. Each edit is small, self-contained, and verifiable. If you need to move content between subsections (e.g., merging IO catalog items), do the deletion and insertion as two sequential edits.
115115

116-
**Budget:** Spend no more than 3 rounds of organization edits per changelog. If the section still feels messy after 3 rounds, it's good enough. Diminishing returns are real. Phase 2 should take 10-15 minutes per changelog — if you've been editing for 20+ minutes on one, stop.
116+
**Budget:** Spend no more than 3 rounds of organization edits per changelog. If the section still feels messy after 3 rounds, it's good enough. Phase 2 should take 10-15 minutes per changelog — if you've been editing for 20+ minutes on one, stop.
117117

118118
### Guard Rails
119119

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
<!-- SPDX-License-Identifier: AGPL-3.0-or-later -->
2+
# Cruft Audit Playbook
3+
4+
A procedure for systematically removing dead text from prompts (AGENTS.md, playbooks, hook summaries) without trimming the content that's actually doing work. Cruft accumulates over time as transitions happen ("we used to X, now we Y"), workarounds outlive the bugs they worked around, and session-anchored references point at sessions readers can't access. The hard part is distinguishing dead text from prose that anchors a rule, supplies rationale, or teaches a hard-to-articulate skill.
5+
6+
## When to run
7+
8+
- **Periodic, ~quarterly.** Cruft accumulates silently — a calendar trigger surfaces it.
9+
- **After a transition lands** that deprecates a workaround (e.g., a structural fix replacing a hand-cleanup procedure). Search the playbooks for "until X lands" / "until that ships" referring to the now-shipped X.
10+
- **When a prompt feels stale.** Subjective signal but real — when reading a playbook produces "wait, is this still right?" more than once, audit it.
11+
- **When the human says "is there cruft we could remove?"** Common request after a season of fast change.
12+
13+
NOT a substitute for the conceptual-leak audit (`what-if-we-dont-know-what-the-fuck-we-are-talking-about-audit`). Cruft audit asks "is this text dead?". Concept audit asks "is this category coherent?". Different question, different mode.
14+
15+
## The methodology
16+
17+
Two complementary passes, **both** mediated by interactive interview with the human:
18+
19+
### Pass 1: Syntactic grep
20+
21+
Cheap, surfaces obvious candidates:
22+
23+
```bash
24+
# History markers
25+
grep -rn -E "\b(deprecated|previously|no longer|originally|formerly|legacy|obsolete)\b" \
26+
.agent/agent_playbooks_protocols_sops_skills/ AGENTS.md
27+
28+
# Temporal qualifiers (require word boundaries — many real "now"s exist in prose)
29+
grep -rn -E "\bnow\b" .agent/agent_playbooks_protocols_sops_skills/ AGENTS.md
30+
31+
# Session-anchored references
32+
grep -rn -E "this (session|PR|investigation|run)|earlier today" \
33+
.agent/agent_playbooks_protocols_sops_skills/ AGENTS.md
34+
35+
# Stale "until X" workarounds
36+
grep -rn -E "until [^.]*(ships|lands|merges|fixes)" \
37+
.agent/agent_playbooks_protocols_sops_skills/ AGENTS.md
38+
39+
# FIXMEs / TODOs in prompt text
40+
grep -rn -E "\b(FIXME|TODO|XXX)\b" .agent/agent_playbooks_protocols_sops_skills/ AGENTS.md
41+
```
42+
43+
Verify referenced things still exist / are still in their claimed state:
44+
- Files referenced (`docs/X.md`, `scripts/Y`) — `test -f`
45+
- Tracker items cited as "until <ID> lands" — `scripts/tracker show <ID>` to confirm status
46+
- Configuration knobs and worker module paths — confirm they exist and are still referenced from the code
47+
48+
### Pass 2: Semantic read
49+
50+
What the grep won't catch — overexplanation, dead anecdotes, hypothetical cases that no longer happen, fallbacks for files that now always exist. Read each playbook section and ask:
51+
52+
1. **Reachability.** In this repo's actual configuration, is this branch / option / fallback ever entered?
53+
2. **Recoverable referent.** When the prose mentions "this session" / "the user said" / "earlier", does the reader have the content (quote, link, extracted pattern), or is it self-citation pointing at a session they can't access?
54+
3. **Teaching content.** Does the prose extract a recoverable pattern shape, supply rationale, or describe a hard-to-articulate generative skill? Or is it pure historical record of a specific past event?
55+
4. **Rule already encoded.** Does the surrounding concrete rule already deliver the lesson this prose is restating?
56+
57+
### Critical: interactive interview mediates both passes
58+
59+
Neither pass alone produces accurate verdicts. The same word is sometimes cruft and sometimes load-bearing — `(deprecated)` next to a tracker status is anchoring a term that still appears in live data; `(deprecated)` next to a feature that's been removed entirely is dead text. You cannot tell from the regex hit alone.
60+
61+
**Do not auto-apply syntactic-pass hits.** Surface them as candidates with full context (file path, line range, **5 lines of surrounding context**, why-flagged tag) and have the human classify each.
62+
63+
## Calibration loop
64+
65+
The taxonomy below is the synthesis of one such session. The session itself was a calibration loop:
66+
67+
1. **Round 1: 5-7 candidates, no opinion from auditor.** The human classifies (`cruft` / `not cruft` / `explain`) and supplies the reasoning when classifying. The auditor updates its mental model.
68+
2. **Round 2: 5-7 candidates, with auditor opinion + reasoning.** The human confirms or corrects. Auditor refines.
69+
3. **Round 3 (optional): 5 more, with opinions.** If the hit rate is high (≥4/5 confirmed), proceed.
70+
4. **Autonomous pass.** Auditor produces the full report applying the calibrated taxonomy.
71+
72+
Two procedural notes from the session:
73+
74+
- **Show 5 lines of surrounding context, not just the candidate.** A flagged sentence in isolation reads ambiguously; in context the verdict is often clear.
75+
- **Read every word in the flagged area, not just the snippet.** In one round the auditor flagged a workaround for a fixed bug but missed the session-anchored prefix sitting two lines above the flagged content.
76+
- **One candidate at a time during calibration.** Batches of 6 overwhelmed the human reviewer; one-at-a-time was the workable cadence.
77+
78+
## The calibrated taxonomy
79+
80+
### Cruft (delete)
81+
82+
- **Stale `until X lands` workarounds** where X has shipped. Verify by tracker lookup or by checking whether the bug being worked around can still be reproduced.
83+
- **Workarounds for problems the tooling now handles automatically.** The skill being taught lives inside the code; the prose is redundant scaffolding.
84+
- **Session-anchored references with no recoverable referent.** "The user's repeated pushback this session was a signal — every new item costs queue-management overhead." Drop the prefix; keep the lesson.
85+
- **Pure historical records of specific past events with no extracted teaching shape.** "ADR-0025 and ADR-0026 were originally filed as ADRs in error and have been reclassified" — the renamed artifacts are referenced just above, the rule lives in another doc, the NOTE is pure history.
86+
87+
### Trim (one-word / short-phrase removal, surrounding content fine)
88+
89+
- **Stale temporal qualifiers** ("now", "currently", "still", "recently") that presuppose reader memory of a transition.
90+
- Example: "(`cycle` now includes reflect automatically; use `--skip-reflect` for fast iteration only.)" — drop "now"; the rest is current API affordance.
91+
- **Session-anchor prefixes/parentheticals** when the surrounding sentence already extracts the recoverable pattern.
92+
- Example: "The most expensive mistake of this session was a hand-rolled `KNOWN_LANGS` set that omitted `jsonnet` and `rst`, producing 3,000+ false-flag invalid-language nodes." — the recoverable shape (KNOWN_LANGS / jsonnet+rst / 3,000+) is the teaching; the "of this session" qualifier is dead. Replace with de-anchored framing.
93+
- **Truism reminders** that are redundant with concrete surrounding rules.
94+
- Example: a "Diminishing returns are real." sentence inside a budget rule that already says "good enough after 3 rounds, stop at 20 minutes."
95+
- Test: removing the truism, would the reader lose information? If no, trim. Truisms ARE valuable in isolation — they are trim candidates only when the concrete encoding is already present.
96+
97+
### Not cruft (keep)
98+
99+
- **Worked-example anchors** that survive future state changes. "first UAT, do not modify" remains accurate even after subsequent UAT campaigns ship.
100+
- **Rationale or consequences beyond the rule.** "The full transcript lives on disk; you can search it freely; you never have to re-run the command" — repeats the rule reductively, but supplies *why* it matters and what an agent gains.
101+
- **Concrete-situation illustrations.** Heuristic checklists where every bullet resolves to the same action are doing pattern-recognition work, not redundancy. "Will the output fit on one screen? If no, **redirect to a file**." The bullets prime the agent to recognize entry-shapes.
102+
- **Live-data anchors.** Terms that still appear in extant tracker items, archives, or code — even when the term is "deprecated" in policy. The reader will encounter the term and need a referent.
103+
- **Generative/teaching prose** for hard-to-articulate skills. "If `docs/blind-spots.md` does not yet exist, take 5 minutes to consider what the new frame *almost* assumes — what edge cases or alternative shapes the new structure makes harder to express." Even when the file currently exists, the fallback teaches *how* to do the activity de novo.
104+
- **Fallbacks** for templatization / fork / fresh-clone scenarios — reachable under plausible future configurations.
105+
- **Borderline cases default to keep.** When cost is small (one phrase) and the case could go either way, the default is keep. The audit revisits next cycle.
106+
107+
### Doc consistency issues (separate category — only flag when likely to derail)
108+
109+
A numbering mismatch (overview says "seven phases", body has eight) is a doc bug, not cruft. Only flag if (a) the inconsistency would meaningfully derail an agent reading the doc AND (b) the fix has bounded ripple effects. Cost-benefit usually doesn't favor making it a finding.
110+
111+
## What this audit does NOT cover
112+
113+
- **Conceptual leaks** — single field smuggling unrelated information. That's the fundamental-concept-audit.
114+
- **Stale rules in code/tooling** that the prose accurately describes — those are tooling work items, not prompt cruft.
115+
- **Coverage gaps** in the playbook system — that's a different audit ("are we missing a playbook for X?").
116+
117+
## Output format
118+
119+
Produce an audit report (do not auto-apply). The report file lives at `~/hypergumbo_lab_notebook/cruft_audit_<MMDDYYYY_HHMM>.md`. Structure:
120+
121+
1. **Summary.** Coverage scope, count by verdict (cruft / trim / borderline / consistency).
122+
2. **Methodology.** Brief restatement of taxonomy applied.
123+
3. **Findings.** Per-finding sections with:
124+
- File path and line range
125+
- Current text (verbatim, with surrounding context)
126+
- Why this is the verdict it got
127+
- Action (delete / specific replacement / no change)
128+
- "After" text where useful
129+
4. **Items considered and rejected.** Representative not-cruft cases with the reasoning. Transparency about why these did NOT make the report — protects against over-trimming on future runs.
130+
5. **Aggregate scale.** Total lines audited / lines changed.
131+
6. **Apply / hold.** The human decides whether to apply.
132+
133+
When the human approves application, ship as a single PR touching only the playbooks involved. `.agent/**` is governance, so a governance approval is required for the apply step (the audit itself is read-only and does not need governance approval).
134+
135+
## Anti-patterns
136+
137+
- **Auto-applying syntactic-pass hits.** Same word is sometimes cruft, sometimes load-bearing. The grep surfaces candidates; only the interactive interview produces verdicts.
138+
- **Skipping the calibration rounds.** Going straight to autonomous mode produces over-aggressive trimming. The first session used six examples to discover that "could be terser" is the wrong bar.
139+
- **Reductive logical interpretation.** "These four bullets all resolve to the same action — they're redundant." In a prompt, redundancy that anchors pattern recognition is doing work the reductive read misses.
140+
- **Trimming worked examples to abstract them.** "Specific repo names, dates, and PR numbers will rot" — true, but they're also what makes the rule recognizable. Strip the session anchor that points at unrecoverable context; keep the concrete shape.
141+
- **Failing to verify referenced tracker IDs / file paths.** "Until WI-X lands" might be cruft if WI-X is closed, or load-bearing if it's open. You can't tell without checking.

.agent/agent_playbooks_protocols_sops_skills/process-validation-queue-with-bakeoffs-and-uat.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ for repo in REPOS:
9090
# Output: claim_id -> (verdict, evidence)
9191
```
9292

93-
**Hard rule: import from the actual codebase, do not hand-roll allowlists.** The most expensive mistake of this session was a hand-rolled `KNOWN_LANGS` set in the verification script that omitted `jsonnet` and `rst`, producing 3,000+ false-flag invalid-language nodes. Use `from hypergumbo_core.taxonomy import LANGUAGES` (or whatever is canonical for the property being checked). When the codebase changes, the verification script automatically tracks.
93+
**Hard rule: import from the actual codebase, do not hand-roll allowlists.** A real mistake we hit: a hand-rolled `KNOWN_LANGS` set in the verification script that omitted `jsonnet` and `rst`, producing 3,000+ false-flag invalid-language nodes. Use `from hypergumbo_core.taxonomy import LANGUAGES` (or whatever is canonical for the property being checked). When the codebase changes, the verification script automatically tracks.
9494

9595
The script's output is the basis for filling the YAML assessments in phase 5.
9696

@@ -364,8 +364,7 @@ A single processing session can use both paths in parallel. The cohort path live
364364
## Process anti-patterns to avoid
365365

366366
- **Trusting auto-pr / merge-pr state announcements without cross-check.** When the API was returning 5xx during polling, the script's "🔄 Closing PR" message can be wrong. Always confirm with `./scripts/ci-debug pr-status <num>` before reporting up. Existing INV-rahib invariant covers this on the tool side; agent-side discipline is to not parrot the script's state-changes verbatim.
367-
- **Over-filing tracker items.** When extending a discussion on an existing item would suffice, do that instead of spawning a new item. The user's repeated pushback this session was a signal — every new item costs queue-management overhead.
368-
- **Manual cleanup of `.git/TRACKER_SYNC_PENDING`.** This marker leaks on SIGKILL (e.g., when the reflect-aggregate step's 60s subprocess timeout fires). Per WI-nutin, the right structural fix is fcntl.flock; until that lands, recognize the symptom (auto-pr exits with "Error: tracker sync in progress") and check whether a sync process is actually running before deleting the marker. Don't make the manual cleanup step part of the routine.
367+
- **Over-filing tracker items.** When extending a discussion on an existing item would suffice, do that instead of spawning a new item. Every new item costs queue-management overhead.
369368
- **Hand-rolled allowlists in verification scripts.** Drift from the codebase's canonical sources guarantees false flags. Always import from the canonical module (`from hypergumbo_core.taxonomy import LANGUAGES`).
370369
- **`init`-for-every-iteration.** Validation re-runs of a cohort go in iter-002 of the same session, not a fresh init. The bakeoff-artifacts-guide playbook covers this; it's worth re-reading.
371370

0 commit comments

Comments
 (0)