Commit be59faa
jgstern-agent
feat(supervisor): meta-circuit-breaker with chain tracking + kill switch (WI-mujuk)
Hardens the WI-razub supervisor against persistent-failure loops that
the existing 24h rate limit absorbs instead of catching. A broken
playbook, corrupt state file, or bad env var that makes every fresh
spawn die immediately produces this pattern under the old design:
Day 1: 8 useless spawns over ~8 min, silence for ~23h52m
Day 2: same
Day 3: same (invisible forever)
The kill switch converts that silent loop into a loud "investigate me"
state by refusing to spawn after N chained failures of a specific shape.
## Three pieces
**1. Chain-length tracking in meta.json (scaffolding).** Every replaced
session's new `meta.json` records:
- `replaces`: session-id of the session it replaced (None for root)
- `chain_length`: 1 for root, prior + 1 on replacement
- `consecutive_no_progress`: running count, reset by progress
**2. No-progress failure classification (scaffolding).** At replacement
time, the dying session's tmux pane-byte count is the spawn-to-kill
delta (pane starts empty). If ≤ 512 bytes, the CLI produced nothing
visible → no-progress failure, increment counter. If > 512, real work
happened → progress replacement, reset counter to 0.
**3. Consecutive-failure kill switch (the actual signal).** When a
replacement would push `consecutive_no_progress` to the threshold
(default 5), the supervisor writes `supervisor.auto-paused` and
refuses all future spawns until the operator runs the new
`agent-supervisor resume` subcommand. The dying session is still
killed; we just don't start another.
Time-agnostic by design — 5 failures over 24h trip it identically
to 5 failures in 5 minutes. A persistent bug trips it regardless of
cadence.
## What this adds to status / CLI
- `status` JSON gains `auto_paused`, `kill_switch_threshold`, and
per-session `chain_length` / `consecutive_no_progress` / `replaces`.
- New `agent-supervisor resume` subcommand: clears the sentinel,
records an operator-driven clear in `respawn_log.log` so it's
distinguishable from a cold start.
- New constants `NO_PROGRESS_PANE_BYTES=512` and
`CONSECUTIVE_NO_PROGRESS_KILL_SWITCH=5`.
- `poll_once` and `spawn_fresh` both check `auto_paused()` and
short-circuit when it's true.
- `replace_session` now captures pane-bytes BEFORE killing the
session so the classification is correct even on a hard-kill path.
## Precedence ordering (tested explicitly)
autonomous_intent=OFF > attached_client > auto_paused > rate_limit
An OFF intent short-circuits the supervisor entirely (no wasted work
while disabled). An attached human prevents replacement, which
prevents the chain from growing, which prevents auto-pause — by
design: a human watching is a human diagnosing, don't kill their
workspace.
## Dropped from the original proposal
The earlier WI-mujuk design had a fourth piece: short-window fast-fire
cooldown ("3+ spawns in 10 min → cooldown 30 min"). Dropped after
analysis showed (a) the kill switch fires on the chain counter BEFORE
a cooldown would trigger in a real loop (the sequence completes in
~5 min at 60s poll), and (b) a 30-min cooldown is a SHORTER breather
than the existing 24h rate limit, so the cooldown would allow MORE
resource waste per day, not less. Documented in the tracker item's
"Explicitly NOT doing" section.
## Tests (24 new, all 71 pass combined with existing 47)
- Threshold classification: 0 / 100 / 512 / 513 / 4096 bytes.
- Chain tracking: root spawn, replaces pointer, counter increment on
no-progress, counter reset on progress.
- Kill switch: fires on 5th consecutive no-progress, not on 4th,
time-agnostic, interrupted by progress replacement.
- Auto-paused blocks `spawn_fresh` and `poll_once`.
- `autonomous_intent=OFF` short-circuits before meta-breaker logic.
- Attached client prevents replacement (and thus auto-pause) even on
a chain at count=4.
- `resume` subcommand clears sentinel (subprocess smoke test).
- `resume` on non-paused supervisor is a no-op.
- Capture-pane failure treated as 0 bytes (a session we can't read
from definitely isn't making progress).
Docs updated: new "Recovering from auto-pause" section in
`docs/agent-supervisor.md` with the recommended investigation
recipe before running `resume`, and the status-output + state-dir
+ edge-case sections incorporate the new fields / files.
Implements WI-mujuk-gadum-lulog-dijiz-lomap-vorar-tudat-lusop.
Signed-off-by: jgstern-agent <josh-agent@iterabloom.com>1 parent c44410b commit be59faa
5 files changed
Lines changed: 698 additions & 14 deletions
File tree
- .ci
- docs
- scripts
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
| 2 | + | |
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
6 | | - | |
| 6 | + | |
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
| 16 | + | |
15 | 17 | | |
16 | 18 | | |
17 | 19 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
83 | 111 | | |
84 | 112 | | |
85 | 113 | | |
| |||
90 | 118 | | |
91 | 119 | | |
92 | 120 | | |
93 | | - | |
| 121 | + | |
94 | 122 | | |
| 123 | + | |
| 124 | + | |
95 | 125 | | |
96 | | - | |
| 126 | + | |
97 | 127 | | |
98 | 128 | | |
99 | 129 | | |
100 | 130 | | |
101 | 131 | | |
102 | 132 | | |
| 133 | + | |
103 | 134 | | |
104 | 135 | | |
105 | 136 | | |
| 137 | + | |
106 | 138 | | |
107 | 139 | | |
108 | 140 | | |
| |||
113 | 145 | | |
114 | 146 | | |
115 | 147 | | |
| 148 | + | |
116 | 149 | | |
117 | 150 | | |
118 | 151 | | |
119 | 152 | | |
120 | 153 | | |
121 | 154 | | |
122 | 155 | | |
123 | | - | |
| 156 | + | |
| 157 | + | |
124 | 158 | | |
125 | | - | |
| 159 | + | |
126 | 160 | | |
127 | 161 | | |
128 | 162 | | |
| |||
0 commit comments