Skip to content

Commit be59faa

Browse files
author
jgstern-agent
committed
feat(supervisor): meta-circuit-breaker with chain tracking + kill switch (WI-mujuk)
Hardens the WI-razub supervisor against persistent-failure loops that the existing 24h rate limit absorbs instead of catching. A broken playbook, corrupt state file, or bad env var that makes every fresh spawn die immediately produces this pattern under the old design: Day 1: 8 useless spawns over ~8 min, silence for ~23h52m Day 2: same Day 3: same (invisible forever) The kill switch converts that silent loop into a loud "investigate me" state by refusing to spawn after N chained failures of a specific shape. ## Three pieces **1. Chain-length tracking in meta.json (scaffolding).** Every replaced session's new `meta.json` records: - `replaces`: session-id of the session it replaced (None for root) - `chain_length`: 1 for root, prior + 1 on replacement - `consecutive_no_progress`: running count, reset by progress **2. No-progress failure classification (scaffolding).** At replacement time, the dying session's tmux pane-byte count is the spawn-to-kill delta (pane starts empty). If ≤ 512 bytes, the CLI produced nothing visible → no-progress failure, increment counter. If > 512, real work happened → progress replacement, reset counter to 0. **3. Consecutive-failure kill switch (the actual signal).** When a replacement would push `consecutive_no_progress` to the threshold (default 5), the supervisor writes `supervisor.auto-paused` and refuses all future spawns until the operator runs the new `agent-supervisor resume` subcommand. The dying session is still killed; we just don't start another. Time-agnostic by design — 5 failures over 24h trip it identically to 5 failures in 5 minutes. A persistent bug trips it regardless of cadence. ## What this adds to status / CLI - `status` JSON gains `auto_paused`, `kill_switch_threshold`, and per-session `chain_length` / `consecutive_no_progress` / `replaces`. - New `agent-supervisor resume` subcommand: clears the sentinel, records an operator-driven clear in `respawn_log.log` so it's distinguishable from a cold start. - New constants `NO_PROGRESS_PANE_BYTES=512` and `CONSECUTIVE_NO_PROGRESS_KILL_SWITCH=5`. - `poll_once` and `spawn_fresh` both check `auto_paused()` and short-circuit when it's true. - `replace_session` now captures pane-bytes BEFORE killing the session so the classification is correct even on a hard-kill path. ## Precedence ordering (tested explicitly) autonomous_intent=OFF > attached_client > auto_paused > rate_limit An OFF intent short-circuits the supervisor entirely (no wasted work while disabled). An attached human prevents replacement, which prevents the chain from growing, which prevents auto-pause — by design: a human watching is a human diagnosing, don't kill their workspace. ## Dropped from the original proposal The earlier WI-mujuk design had a fourth piece: short-window fast-fire cooldown ("3+ spawns in 10 min → cooldown 30 min"). Dropped after analysis showed (a) the kill switch fires on the chain counter BEFORE a cooldown would trigger in a real loop (the sequence completes in ~5 min at 60s poll), and (b) a 30-min cooldown is a SHORTER breather than the existing 24h rate limit, so the cooldown would allow MORE resource waste per day, not less. Documented in the tracker item's "Explicitly NOT doing" section. ## Tests (24 new, all 71 pass combined with existing 47) - Threshold classification: 0 / 100 / 512 / 513 / 4096 bytes. - Chain tracking: root spawn, replaces pointer, counter increment on no-progress, counter reset on progress. - Kill switch: fires on 5th consecutive no-progress, not on 4th, time-agnostic, interrupted by progress replacement. - Auto-paused blocks `spawn_fresh` and `poll_once`. - `autonomous_intent=OFF` short-circuits before meta-breaker logic. - Attached client prevents replacement (and thus auto-pause) even on a chain at count=4. - `resume` subcommand clears sentinel (subprocess smoke test). - `resume` on non-paused supervisor is a no-op. - Capture-pane failure treated as 0 bytes (a session we can't read from definitely isn't making progress). Docs updated: new "Recovering from auto-pause" section in `docs/agent-supervisor.md` with the recommended investigation recipe before running `resume`, and the status-output + state-dir + edge-case sections incorporate the new fields / files. Implements WI-mujuk-gadum-lulog-dijiz-lomap-vorar-tudat-lusop. Signed-off-by: jgstern-agent <josh-agent@iterabloom.com>
1 parent c44410b commit be59faa

5 files changed

Lines changed: 698 additions & 14 deletions

File tree

.ci/affected-tests.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# Test selection manifest
2-
# Generated by smart-test at 2026-04-18T04:36:14-04:00
2+
# Generated by smart-test at 2026-04-18T05:29:04-04:00
33
# Mode: targeted
44
# Baseline: c01512b82b712fcdf8352f9a9f487d9c624927c8
55
# Reason: no Python source files changed
6-
# Changed files: 10
6+
# Changed files: 13
77
# Changed source files: 0
88
# Selected tests: 0
99
#

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ This changelog tracks the **tool version** (package releases). The **schema vers
1212

1313
### Added
1414

15+
- **Agent-supervisor meta-circuit-breaker: chain tracking + no-progress kill switch** (WI-mujuk, hardens WI-razub): `scripts/agent-supervisor` now classifies every replacement as a **no-progress failure** (dying session's pane-byte count ≤ 512 — the CLI produced nothing visible between spawn and kill) or a **progress replacement** (> 512), tracks a chain via new `replaces` / `chain_length` / `consecutive_no_progress` fields in each session's `meta.json`, and **auto-pauses** after 5 consecutive no-progress failures on the same chain by writing `supervisor.auto-paused` into the state dir. Auto-paused supervisors refuse all new spawns (both the poll loop and direct `spawn_fresh` calls) until the operator runs the new `agent-supervisor resume` subcommand, which clears the sentinel and lets the next poll tick start a fresh chain (`chain_length=1`). The kill switch is deliberately **time-agnostic**: 5 consecutive no-progress failures trigger it the same whether they happened in 5 minutes or 5 days — a persistent bug absorbed by the existing 24h rate limit (8 useless spawns every morning, silence the rest of the day, invisible forever) now converts into a loud "investigate me" state. Progress replacements reset the counter to 0, so a chain that eventually makes headway is NOT at risk of auto-pause. `agent-supervisor status` surfaces `auto_paused`, `kill_switch_threshold`, and per-session `chain_length` / `consecutive_no_progress` / `replaces` so operators can see how close each chain is to tripping. Attached-client check still takes precedence: a human watching prevents replacement, which prevents the chain from growing — auto-pause can't fire on a session a human is diagnosing. Also: `respawn_log.log` records `AUTO-PAUSED: …` lines with the chain tail so retrospective audits can identify the failure pattern, and the new `resume` subcommand records the operator-driven clear so it's distinguishable from a cold start. 24 new tests cover threshold classification (0 / 100 / 512 / 513 / 4096 bytes), chain propagation across replacements, counter-reset on progress replacement, time-agnostic triggering, `autonomous_intent=OFF` short-circuit precedence, attached-client precedence, and the CLI subcommand's subprocess behavior. `docs/agent-supervisor.md` now has a "Recovering from auto-pause" section with the recommended investigation recipe before running `resume`.
16+
1517
- **`docs/agent-supervisor.md` operator guide** (follow-up to WI-razub): net-new user-facing doc covering the `scripts/agent-supervisor` daemon's operator workflow — prerequisites, first-time setup (`loop-toggle DEEP` + `agent-supervisor run &`), daily operations (attach / detach / pause / resume / shutdown), `status` JSON field semantics, edge cases (two supervisors, human attached, rate-limited, crashed, missing tmux), a troubleshooting matrix, the state-directory layout, an explicit "what the supervisor does NOT do" list, and cross-references to AGENTS.md + the script docstring. Fills the documentation gap the design doc left — the WI-razub "Concrete user UX" section lived in the tracker thread, not anywhere a workstation operator would find it. Linked from the main `README.md` Links section.
1618

1719
- **Vendor Parity for Respawn AGENTS.md section** (WI-batob, sub-item of WI-razub respawn mechanism): new authoritative table in `AGENTS.md` documenting, for each of Claude Code / Codex CLI / Cursor / Gemini CLI, the per-turn hook path (WI-sipov heartbeat), the session-start hook path (WI-sakod respawn branch), the graceful-exit keystroke the supervisor sends via `tmux send-keys`, the non-interactive CLI invocation for `tmux new-session`, and any vendor-specific quirks. Verification status is explicit: Claude Code's `/quit` is verified; the other three are marked "unverified — FIXME WI-batob" with a documented verification procedure (throwaway tmux session, send the keystroke, confirm the CLI process exits within 30s). Adding a new vendor requires four coordinated changes in the same PR (table row, `VENDOR_TABLE` entry in `scripts/agent-supervisor`, per-turn hook sourcing `touch_heartbeat.sh`, session-start hook sourcing `session_start_logic.sh`); the existing structural-guard tests in `tests/test_touch_heartbeat.py` and `tests/test_session_start_respawn.py` will fire if any hook wire-up is missed. Completes the last open sub-item of WI-razub.

docs/agent-supervisor.md

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,34 @@ Prefer the narrow form if you want to temporarily disable autonomous mode on *ju
8080

8181
The running daemon consumes the sentinel on its next poll tick (≤ 60 s) and exits cleanly. Your live CLIs keep running until you close them; the supervisor just stops respawning. Re-arm with another `agent-supervisor run &` whenever you come back.
8282

83+
## Recovering from auto-pause (WI-mujuk meta-circuit-breaker)
84+
85+
If the supervisor detects **5 consecutive no-progress failures on the same chain** — meaning it spawned a session, that session died without rendering anything useful, it spawned another, same result, and this happened five times in a row — it writes `supervisor.auto-paused` into its state dir and stops spawning entirely. Running `agent-supervisor status` will show `"auto_paused": true`.
86+
87+
This is a load-bearing signal, not a rate limit: a persistent bug (broken playbook, corrupt state file, bad env var, session-start hook crashing) would otherwise burn through the 24h rate-limit budget every day forever, invisible from the outside. Auto-pause converts that silent loop into a loud "investigate me" state.
88+
89+
Before clearing the pause, find out what went wrong:
90+
91+
```bash
92+
./scripts/agent-supervisor status | jq '.sessions[] | {session, chain_length, consecutive_no_progress, pane_bytes}'
93+
tail -20 ~/hypergumbo_lab_notebook/agent-supervisor/respawn_log.log
94+
```
95+
96+
The log's last few lines show the chain tail: which sessions died, whether each was classified no-progress (pane ≤ 512 bytes at kill) or progress, and the `AUTO-PAUSED: N consecutive no-progress failures...` entry. Common root causes:
97+
98+
- Session-start hook crashing immediately → CLI dies before printing anything.
99+
- `HYPERGUMBO_RESPAWN=1` branch of `session_start_logic.sh` failing → `loop-toggle` call errors.
100+
- `autonomous_intent.txt` pointing at a mode whose bakeoff directory is missing.
101+
- Vendor CLI not installed / no longer on `$PATH`.
102+
103+
Once you've fixed the underlying issue, resume:
104+
105+
```bash
106+
./scripts/agent-supervisor resume
107+
```
108+
109+
This removes the sentinel and the next poll tick will spawn a fresh chain (`chain_length=1`, `consecutive_no_progress=0`). The respawn log records the operator-driven resume so audits can tell it apart from a cold start.
110+
83111
## `status` output
84112

85113
```bash
@@ -90,19 +118,23 @@ Returns a JSON object with:
90118

91119
- `intent` — current value of `autonomous_intent.txt`.
92120
- `rate_limit` — rolling 24h spawn count, the cap (default 8), and whether a spawn is currently allowed.
93-
- `sessions[]` — one entry per hypergumbo-prefixed tmux session, with `meta` (the stored session-id / CLI pid / vendor / start UTC), `clients_attached`, `pane_bytes` (raw scrollback size in bytes), and `heartbeat_age_sec` (seconds since the per-turn hooks last touched the heartbeat file).
121+
- `sessions[]` — one entry per hypergumbo-prefixed tmux session, with `meta` (the stored session-id / CLI pid / vendor / start UTC / `replaces` / `chain_length` / `consecutive_no_progress`), `clients_attached`, `pane_bytes` (raw scrollback size in bytes), `heartbeat_age_sec` (seconds since the per-turn hooks last touched the heartbeat file), plus top-level `chain_length`, `consecutive_no_progress`, and `replaces` fields lifted out of `meta` for convenience.
94122
- `stop_requested` — true if a stop sentinel is in flight.
123+
- `auto_paused` — true when the WI-mujuk kill switch has fired; clear with `agent-supervisor resume`.
124+
- `kill_switch_threshold` — the value of `CONSECUTIVE_NO_PROGRESS_KILL_SWITCH` (default 5) so tooling can compare against `consecutive_no_progress` without hardcoding the constant.
95125

96-
Use `pane_bytes` + `heartbeat_age_sec` together to debug "is this session actually working?" — if pane bytes haven't grown but the heartbeat is fresh, the CLI is stuck in a tool that's not emitting output. If both are stale, the CLI itself is frozen.
126+
Use `pane_bytes` + `heartbeat_age_sec` together to debug "is this session actually working?" — if pane bytes haven't grown but the heartbeat is fresh, the CLI is stuck in a tool that's not emitting output. If both are stale, the CLI itself is frozen. And use `consecutive_no_progress` + `chain_length` to tell "this is a fresh chain trying to start" (small, near zero) apart from "this chain is in trouble and close to auto-pause" (approaching `kill_switch_threshold`).
97127

98128
## Edge cases
99129

100130
- **Two supervisors for the same project.** The second `agent-supervisor run` invocation fails `fcntl.flock` acquisition on `supervisor.lock` and exits with "another supervisor is already running". This is the enforced single-instance invariant; don't work around it.
101131
- **You want to run a vendor CLI by hand.** Either launch it in a tmux session whose name does NOT start with `hypergumbo-session-` (the supervisor will ignore it entirely), or `loop-toggle OFF` first and it won't get touched.
102132
- **Rate-limited.** If the supervisor has spawned 8 sessions in the last 24 hours (default soft cap), the next spawn is skipped with a log entry in `respawn_log.log` instead of proceeding. Fix the underlying problem — pounding on the spawn button would indicate a deeper issue.
133+
- **Auto-paused (WI-mujuk).** After 5 consecutive no-progress failures on the same chain, the supervisor writes `supervisor.auto-paused` and stops spawning. See "Recovering from auto-pause" above.
103134
- **Supervisor crashes.** Nothing gets auto-spawned until you restart it with `agent-supervisor run &`. The daemon is not self-restarting by design.
104135
- **Tmux is not installed.** The `run_subprocess` seam returns rc=127 for every tmux call, so `status` works and reports zero sessions. The `run` loop no-ops each tick. Install tmux to unstick.
105136
- **CLI refuses graceful exit.** The supervisor polls `kill -0 <cli_pid>` for 30 seconds after sending the vendor exit keystroke. If the CLI is still alive, it falls back to `tmux kill-session` + direct invocation of `kill-transcript-sync.sh` / `rotate-on-session-end.sh` (the per-session cleanup scripts are already idempotent). An entry appears in `respawn_log.log` as `forced-kill fallback for session <name>`.
137+
- **Human attached during a chain close to auto-pause.** The attached-client check takes precedence over the kill switch — an attached session is never replaced, so its chain can't grow, so auto-pause can't fire on it. This is deliberate: a human watching is a human diagnosing. Detach to let the chain progress to its natural outcome.
106138

107139
## Troubleshooting
108140

@@ -113,16 +145,18 @@ Use `pane_bytes` + `heartbeat_age_sec` together to debug "is this session actual
113145
| `respawn_log.log` shows repeated "rate-limit reached" | 8 spawns in 24h — usually indicates a loop somewhere upstream | Read the log tail + `agent_notes.json` for a pattern; don't just raise the cap |
114146
| Fresh CLI launches but doesn't enable autonomous mode | `autonomous_intent.txt` is OFF or missing | `loop-toggle --set-intent DEEP` (narrow-write, doesn't touch the current session) |
115147
| Fresh CLI launches but the session-start hook doesn't inject the seed prompt | Vendor's hook file missing or unwired | Verify `.agent/hooks/<vendor>/session-start.sh` exists and sources `_shared/session_start_logic.sh` |
148+
| `status` shows `auto_paused: true` | Kill switch fired after 5 consecutive no-progress failures | See "Recovering from auto-pause" above; investigate log tail, fix root cause, run `agent-supervisor resume` |
116149

117150
## State directory layout
118151

119152
`~/hypergumbo_lab_notebook/agent-supervisor/` (override with `AGENT_SUPERVISOR_STATE_DIR`):
120153

121154
- `supervisor.lock` — flock + pid-file for single-instance enforcement.
122155
- `supervisor.stop-sentinel` — present when a stop is requested; consumed on the next tick.
123-
- `<session>.meta.json` — written on spawn: session_id, cli_pid, vendor, project_dir, tmux session name, start_utc.
156+
- `supervisor.auto-paused` — present when the WI-mujuk kill switch has fired; contents are a human-readable reason line; cleared by `agent-supervisor resume`.
157+
- `<session>.meta.json` — written on spawn: session_id, cli_pid, vendor, project_dir, tmux session name, start_utc, `replaces`, `chain_length`, `consecutive_no_progress`.
124158
- `<session>.heartbeat` — touched by the per-turn hooks (telemetry only; never a spawn/replace input).
125-
- `respawn_log.log` — append-only audit of every spawn / replace / rate-limit event.
159+
- `respawn_log.log` — append-only audit of every spawn / replace / rate-limit / auto-pause event.
126160
- `rate_limit.json` — rolling 24h spawn timestamps.
127161

128162
## What the supervisor does NOT do

0 commit comments

Comments
 (0)