Real ways loops fail — and how good design mitigates them. Use this when debugging a misbehaving loop or writing a new pattern.
| Severity | Meaning |
|---|---|
| S1 — Annoying | Wasted time/tokens, no user harm |
| S2 — Harmful | Wrong code merged, bad tickets, alert fatigue |
| S3 — Critical | Security, data loss, production incident |
Symptom: Same PR or CI job gets automated fix attempts 5+ times; never converges.
Severity: S2
Causes:
- Verifier too weak or same session as implementer
- Root cause misdiagnosed (symptom fixing)
- Flaky test treated as regression
Mitigations:
- Hard cap on attempts (e.g. 3) → escalate to human
- Separate verifier model / higher reasoning effort
- Classify flakes in triage; quarantine instead of code change
- Record attempt count in state file
Symptom: STATE.md references merged PRs, closed tickets, or stale branches.
Severity: S1 → S2 (loop acts on ghosts)
Causes:
- No prune step at end of run
- State file not read at start
- Multiple loops writing same file without schema
Mitigations:
- Prune closed/merged items every run
Last runtimestamp + validate IDs against live API- One state file per loop pattern, or clear sections
Symptom: Verifier "approves" but tests fail in CI or review finds obvious bugs.
Severity: S2
Causes:
- Verifier prompt too vague ("looks good")
- Verifier doesn't run tests
- Same model, same context as implementer
Mitigations:
- Verifier must run test/lint commands and report output
- Different instructions: "find reasons to reject"
- Stronger model on verifier for unattended loops
Symptom: Slack/email pings every 5 minutes; team mutes the bot.
Severity: S1 → S2 (real escalations missed)
Causes:
- Notify on every run, not every actionable finding
- Low bar for "high priority" in triage skill
Mitigations:
- Notify only when human decision required
- Digest mode for report-only loops
- Tighten triage "High Priority" rules
Symptom: Bill spikes; loop runs full sub-agent chains on empty or noisy triage.
Severity: S1
Causes:
- Sub-minute cadence with heavy sub-agents
- No early exit when watchlist empty
- Retrying entire pipeline on transient API errors
Mitigations:
- Cheaper triage-only pass first; spawn sub-agents only for actionable items
scheduler_deletewhen nothing to watch- Daily token budget → pause loop
- See operating-loops.md
Symptom: Loop refactors unrelated modules, "fixes" design issues, or touches denylisted paths.
Severity: S2 → S3
Causes:
- minimal-fix skill too permissive
- No path allowlist/denylist
- Triage puts architectural work in "High Priority"
Mitigations:
- safety.md denylist enforced in skills
- "Smallest possible diff" + verifier checks touched files
- Triage skill: signal only, no invention
Symptom: Velocity up, but no one can explain recent changes; review becomes rubber-stamp.
Severity: S2 (long-term)
Causes:
- Human stops reading loop output
- Auto-merge on growing allowlist
- No weekly human synthesis of loop actions
Mitigations:
- Mandatory human review for non-trivial PRs
- Weekly "loop digest" read by owner
- Cap auto-merge to truly trivial paths
Symptom: "The loop handles it" — no opinions on correctness or design.
Severity: S2 (cultural)
Causes:
- Loop success metric = volume, not quality
- No human gates on medium-risk work
Mitigations:
- Explicit human gates in every pattern
- Success metric: time saved with quality bar held
- Osmani: "Build it like someone who intends to stay the engineer"
Symptom: Two sub-agents edit same files; merge conflicts; corrupted state.
Severity: S2
Causes:
- No worktree isolation
- Two loops acting on same PR without coordination
Mitigations:
isolation: worktreefor all code-editing sub-agents- Lock or queue in state: "PR #1234 — worktree in progress"
Symptom: Loop stuck retrying; human never notified.
Severity: S2
Causes:
- Max attempts not implemented
- Escalation only writes to state no one reads
Mitigations:
- Connector ping on escalation (Slack, Linear comment)
High Priority (waiting on human)section in STATE.md- Alert if item in that section >24h
Have a story? Add a row via PR to this doc or open an issue with:
- Pattern name
- Symptom
- What mitigated it (or didn't)