Failure Mode Catalog

Real ways loops fail — and how good design mitigates them. Use this when debugging a misbehaving loop or writing a new pattern.

Classification

Severity	Meaning
S1 — Annoying	Wasted time/tokens, no user harm
S2 — Harmful	Wrong code merged, bad tickets, alert fatigue
S3 — Critical	Security, data loss, production incident

Infinite Fix Loop

Symptom: Same PR or CI job gets automated fix attempts 5+ times; never converges.

Severity: S2

Causes:

Verifier too weak or same session as implementer
Root cause misdiagnosed (symptom fixing)
Flaky test treated as regression

Mitigations:

Hard cap on attempts (e.g. 3) → escalate to human
Separate verifier model / higher reasoning effort
Classify flakes in triage; quarantine instead of code change
Record attempt count in state file

State Rot

Symptom: STATE.md references merged PRs, closed tickets, or stale branches.

Severity: S1 → S2 (loop acts on ghosts)

Causes:

No prune step at end of run
State file not read at start
Multiple loops writing same file without schema

Mitigations:

Prune closed/merged items every run
Last run timestamp + validate IDs against live API
One state file per loop pattern, or clear sections

Verifier Theater

Symptom: Verifier "approves" but tests fail in CI or review finds obvious bugs.

Severity: S2

Causes:

Verifier prompt too vague ("looks good")
Verifier doesn't run tests
Same model, same context as implementer

Mitigations:

Verifier must run test/lint commands and report output
Different instructions: "find reasons to reject"
Stronger model on verifier for unattended loops

Notification Fatigue

Symptom: Slack/email pings every 5 minutes; team mutes the bot.

Severity: S1 → S2 (real escalations missed)

Causes:

Notify on every run, not every actionable finding
Low bar for "high priority" in triage skill

Mitigations:

Notify only when human decision required
Digest mode for report-only loops
Tighten triage "High Priority" rules

Token Burn

Symptom: Bill spikes; loop runs full sub-agent chains on empty or noisy triage.

Severity: S1

Causes:

Sub-minute cadence with heavy sub-agents
No early exit when watchlist empty
Retrying entire pipeline on transient API errors

Mitigations:

Cheaper triage-only pass first; spawn sub-agents only for actionable items
scheduler_delete when nothing to watch
Daily token budget → pause loop
See operating-loops.md

Over-Reach (Wrong Scope)

Symptom: Loop refactors unrelated modules, "fixes" design issues, or touches denylisted paths.

Severity: S2 → S3

Causes:

minimal-fix skill too permissive
No path allowlist/denylist
Triage puts architectural work in "High Priority"

Mitigations:

safety.md denylist enforced in skills
"Smallest possible diff" + verifier checks touched files
Triage skill: signal only, no invention

Comprehension Debt Spiral

Symptom: Velocity up, but no one can explain recent changes; review becomes rubber-stamp.

Severity: S2 (long-term)

Causes:

Human stops reading loop output
Auto-merge on growing allowlist
No weekly human synthesis of loop actions

Mitigations:

Mandatory human review for non-trivial PRs
Weekly "loop digest" read by owner
Cap auto-merge to truly trivial paths

Cognitive Surrender

Symptom: "The loop handles it" — no opinions on correctness or design.

Severity: S2 (cultural)

Causes:

Loop success metric = volume, not quality
No human gates on medium-risk work

Mitigations:

Explicit human gates in every pattern
Success metric: time saved with quality bar held
Osmani: "Build it like someone who intends to stay the engineer"

Parallel Collision

Symptom: Two sub-agents edit same files; merge conflicts; corrupted state.

Severity: S2

Causes:

No worktree isolation
Two loops acting on same PR without coordination

Mitigations:

isolation: worktree for all code-editing sub-agents
Lock or queue in state: "PR #1234 — worktree in progress"

Escalation Failure

Symptom: Loop stuck retrying; human never notified.

Severity: S2

Causes:

Max attempts not implemented
Escalation only writes to state no one reads

Mitigations:

Connector ping on escalation (Slack, Linear comment)
High Priority (waiting on human) section in STATE.md
Alert if item in that section >24h

Contributing Failures

Have a story? Add a row via PR to this doc or open an issue with:

Pattern name
Symptom
What mitigated it (or didn't)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure Mode Catalog

Classification

Infinite Fix Loop

State Rot

Verifier Theater

Notification Fatigue

Token Burn

Over-Reach (Wrong Scope)

Comprehension Debt Spiral

Cognitive Surrender

Parallel Collision

Escalation Failure

Contributing Failures

FilesExpand file tree

failure-modes.md

Latest commit

History

failure-modes.md

File metadata and controls

Failure Mode Catalog

Classification

Infinite Fix Loop

State Rot

Verifier Theater

Notification Fatigue

Token Burn

Over-Reach (Wrong Scope)

Comprehension Debt Spiral

Cognitive Surrender

Parallel Collision

Escalation Failure

Contributing Failures