|
| 1 | +<!-- SPDX-License-Identifier: AGPL-3.0-or-later --> |
| 2 | +# Agent Supervisor — Operator Guide |
| 3 | + |
| 4 | +`scripts/agent-supervisor` is a long-running daemon that monitors tmux sessions running hypergumbo-aware agent CLIs (Claude Code, Codex CLI, Cursor, Gemini CLI) and replaces a stuck session with a fresh one when your project-level intent says autonomous work is desired but the current session has stopped making progress. |
| 5 | + |
| 6 | +This guide covers the operator workflow. For the design rationale, see tracker item `WI-razub` and the related vendor-contract documentation in [`AGENTS.md` → Vendor Parity for Respawn](../AGENTS.md). |
| 7 | + |
| 8 | +## What the supervisor solves |
| 9 | + |
| 10 | +The stop-hook circuit breaker (5 consecutive no-progress hashes) correctly detects a stagnating autonomous session — but it does so by permitting the session to terminate, which gives a stuck agent a one-way exit out of long-running work. The supervisor closes that loop: when a session tripss the breaker (or crashes, or exits cleanly), the supervisor spawns a fresh CLI with a clean context, seeded with a generic "familiarize yourself with this repo" prompt, so forward-march resumes automatically. |
| 11 | + |
| 12 | +The supervisor's authoritative signal is tmux pane-byte delta over a rolling 15-minute window — "is the pane actually scrolling?" — NOT any file the agent itself writes. Per-session heartbeat files exist (touched by every vendor's per-turn hook) but are telemetry only; they surface in `status` output but are never consulted for spawn/replace decisions. |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +- `tmux` installed on the workstation. |
| 17 | +- One or more vendor CLIs installed and on `$PATH`: `claude`, `codex`, `cursor`, `gemini`. |
| 18 | +- `python3` (standard library only — no extra dependencies). |
| 19 | +- You have run `./scripts/dev-install` in this repo so the hooks and scripts are wired up. |
| 20 | + |
| 21 | +> **Verification status note.** The exit-keystroke for Claude Code is verified. For Codex / Cursor / Gemini the supervisor's table is best-effort and marked `FIXME WI-batob` in both `scripts/agent-supervisor::VENDOR_TABLE` and the AGENTS.md parity table. Before relying on the supervisor to respawn those vendors in production, do the one-time verification step documented in AGENTS.md. |
| 22 | +
|
| 23 | +## First-time setup |
| 24 | + |
| 25 | +Two commands per workstation. Run them once and the supervisor owns the lifecycle from then on. |
| 26 | + |
| 27 | +```bash |
| 28 | +./scripts/loop-toggle DEEP # writes autonomous_intent.txt = DEEP |
| 29 | + # (also writes AUTONOMOUS_MODE.txt = DEEP |
| 30 | + # for today's session — preserves old UX) |
| 31 | + |
| 32 | +./scripts/agent-supervisor run & # starts the daemon in background |
| 33 | +``` |
| 34 | + |
| 35 | +The supervisor creates `~/hypergumbo_lab_notebook/agent-supervisor/` if it doesn't exist. Override the default with `AGENT_SUPERVISOR_STATE_DIR=<path>` if you need the state elsewhere. |
| 36 | + |
| 37 | +Substitute `BROAD` for `DEEP` if you want breadth / linker-coverage work instead of feature-quality work — see [AGENTS.md § Mode Selection](../AGENTS.md). |
| 38 | + |
| 39 | +## Normal operation |
| 40 | + |
| 41 | +Once the supervisor is running, it polls every 60 seconds (tunable via `--interval N`). On each tick it: |
| 42 | + |
| 43 | +1. Reads `autonomous_intent.txt`. If OFF, does nothing. |
| 44 | +2. Enumerates tmux sessions whose name starts with `hypergumbo-session-` (reserved prefix — human-managed tmux sessions are never touched). |
| 45 | +3. For each such session, checks: is a tmux client attached? is the recorded CLI PID alive? has the pane scrolled in the last 15 minutes? |
| 46 | +4. Acts: if no session exists, spawn one. If a session is attached, do nothing (human is watching). If the CLI is dead OR the pane has been frozen for ≥ 15 minutes, run the replacement sequence. |
| 47 | + |
| 48 | +### Watching a live session |
| 49 | + |
| 50 | +The supervisor launches sessions in detached mode. To observe one: |
| 51 | + |
| 52 | +```bash |
| 53 | +./scripts/agent-supervisor status # lists live sessions + pane bytes + heartbeat ages |
| 54 | +tmux attach -t hypergumbo-session-<UTC-timestamp> |
| 55 | +``` |
| 56 | + |
| 57 | +Detach without killing the CLI with `Ctrl-B D`. |
| 58 | + |
| 59 | +**Important:** while you are attached, the supervisor will NOT replace the session even if the pane freezes — an attached client blocks replacement, by design. Detach when you're done watching so the watchdog can do its job. |
| 60 | + |
| 61 | +### Pausing the loop |
| 62 | + |
| 63 | +```bash |
| 64 | +./scripts/loop-toggle OFF # flips intent to OFF (and today's session mode, too) |
| 65 | +``` |
| 66 | + |
| 67 | +The supervisor continues running but its decision matrix short-circuits on OFF: no spawns, no replacements. Any live CLI finishes its current work and idles. Resume with another `loop-toggle DEEP` / `BROAD`. |
| 68 | + |
| 69 | +Prefer the narrow form if you want to temporarily disable autonomous mode on *just* the currently-running CLI without flipping project intent: |
| 70 | + |
| 71 | +```bash |
| 72 | +./scripts/loop-toggle --set-session-mode OFF # session only; intent stays on |
| 73 | +``` |
| 74 | + |
| 75 | +### Shutting down for the day |
| 76 | + |
| 77 | +```bash |
| 78 | +./scripts/agent-supervisor stop # writes supervisor.stop-sentinel |
| 79 | +``` |
| 80 | + |
| 81 | +The running daemon consumes the sentinel on its next poll tick (≤ 60 s) and exits cleanly. Your live CLIs keep running until you close them; the supervisor just stops respawning. Re-arm with another `agent-supervisor run &` whenever you come back. |
| 82 | + |
| 83 | +## `status` output |
| 84 | + |
| 85 | +```bash |
| 86 | +./scripts/agent-supervisor status | jq . |
| 87 | +``` |
| 88 | + |
| 89 | +Returns a JSON object with: |
| 90 | + |
| 91 | +- `intent` — current value of `autonomous_intent.txt`. |
| 92 | +- `rate_limit` — rolling 24h spawn count, the cap (default 8), and whether a spawn is currently allowed. |
| 93 | +- `sessions[]` — one entry per hypergumbo-prefixed tmux session, with `meta` (the stored session-id / CLI pid / vendor / start UTC), `clients_attached`, `pane_bytes` (raw scrollback size in bytes), and `heartbeat_age_sec` (seconds since the per-turn hooks last touched the heartbeat file). |
| 94 | +- `stop_requested` — true if a stop sentinel is in flight. |
| 95 | + |
| 96 | +Use `pane_bytes` + `heartbeat_age_sec` together to debug "is this session actually working?" — if pane bytes haven't grown but the heartbeat is fresh, the CLI is stuck in a tool that's not emitting output. If both are stale, the CLI itself is frozen. |
| 97 | + |
| 98 | +## Edge cases |
| 99 | + |
| 100 | +- **Two supervisors for the same project.** The second `agent-supervisor run` invocation fails `fcntl.flock` acquisition on `supervisor.lock` and exits with "another supervisor is already running". This is the enforced single-instance invariant; don't work around it. |
| 101 | +- **You want to run a vendor CLI by hand.** Either launch it in a tmux session whose name does NOT start with `hypergumbo-session-` (the supervisor will ignore it entirely), or `loop-toggle OFF` first and it won't get touched. |
| 102 | +- **Rate-limited.** If the supervisor has spawned 8 sessions in the last 24 hours (default soft cap), the next spawn is skipped with a log entry in `respawn_log.log` instead of proceeding. Fix the underlying problem — pounding on the spawn button would indicate a deeper issue. |
| 103 | +- **Supervisor crashes.** Nothing gets auto-spawned until you restart it with `agent-supervisor run &`. The daemon is not self-restarting by design. |
| 104 | +- **Tmux is not installed.** The `run_subprocess` seam returns rc=127 for every tmux call, so `status` works and reports zero sessions. The `run` loop no-ops each tick. Install tmux to unstick. |
| 105 | +- **CLI refuses graceful exit.** The supervisor polls `kill -0 <cli_pid>` for 30 seconds after sending the vendor exit keystroke. If the CLI is still alive, it falls back to `tmux kill-session` + direct invocation of `kill-transcript-sync.sh` / `rotate-on-session-end.sh` (the per-session cleanup scripts are already idempotent). An entry appears in `respawn_log.log` as `forced-kill fallback for session <name>`. |
| 106 | + |
| 107 | +## Troubleshooting |
| 108 | + |
| 109 | +| Symptom | Likely cause | Fix | |
| 110 | +| --- | --- | --- | |
| 111 | +| `agent-supervisor run` fails with "another supervisor is already running" | flock still held by a supervisor PID | `agent-supervisor status` to confirm, then `ps -fp <pid>` on the PID in `supervisor.lock`; if that PID is dead, remove the lock file and retry | |
| 112 | +| Live session not getting replaced despite being stuck | You're attached to it, or the pane has scrolled within 15 min | Detach (`Ctrl-B D`); or wait out the 15-minute frozen window | |
| 113 | +| `respawn_log.log` shows repeated "rate-limit reached" | 8 spawns in 24h — usually indicates a loop somewhere upstream | Read the log tail + `agent_notes.json` for a pattern; don't just raise the cap | |
| 114 | +| Fresh CLI launches but doesn't enable autonomous mode | `autonomous_intent.txt` is OFF or missing | `loop-toggle --set-intent DEEP` (narrow-write, doesn't touch the current session) | |
| 115 | +| Fresh CLI launches but the session-start hook doesn't inject the seed prompt | Vendor's hook file missing or unwired | Verify `.agent/hooks/<vendor>/session-start.sh` exists and sources `_shared/session_start_logic.sh` | |
| 116 | + |
| 117 | +## State directory layout |
| 118 | + |
| 119 | +`~/hypergumbo_lab_notebook/agent-supervisor/` (override with `AGENT_SUPERVISOR_STATE_DIR`): |
| 120 | + |
| 121 | +- `supervisor.lock` — flock + pid-file for single-instance enforcement. |
| 122 | +- `supervisor.stop-sentinel` — present when a stop is requested; consumed on the next tick. |
| 123 | +- `<session>.meta.json` — written on spawn: session_id, cli_pid, vendor, project_dir, tmux session name, start_utc. |
| 124 | +- `<session>.heartbeat` — touched by the per-turn hooks (telemetry only; never a spawn/replace input). |
| 125 | +- `respawn_log.log` — append-only audit of every spawn / replace / rate-limit event. |
| 126 | +- `rate_limit.json` — rolling 24h spawn timestamps. |
| 127 | + |
| 128 | +## What the supervisor does NOT do |
| 129 | + |
| 130 | +- **Decide mode.** The human still picks BROAD vs DEEP via `loop-toggle`. The supervisor only mirrors project intent into each spawned session. |
| 131 | +- **Self-heal tmux.** If tmux is down, the supervisor waits silently for it to come back. |
| 132 | +- **Restart after crash.** No systemd / cron wiring by default — you launch `agent-supervisor run &` manually (or add it to your shell rc). |
| 133 | +- **Persist pane history across restarts.** Pane-byte observations are in-memory only; after a supervisor restart, the first tick per session seeds a new observation and the 15-minute frozen clock restarts. |
| 134 | +- **Consult the heartbeat.** Heartbeats are for your debugging / retrospective metrics, not the spawn/replace decision. See WI-sipov. |
| 135 | + |
| 136 | +## Deferred follow-ups |
| 137 | + |
| 138 | +These are noted on their tracker items and would extend the supervisor's reach without changing today's contract: |
| 139 | + |
| 140 | +- **Stop-hook + long-running-command heartbeats.** Today the heartbeat is only touched by per-turn hooks. Wrappers like `auto-pr` / `bakeoff-*` / `smart-test` don't yet have a supervisor-exported session_id env var to key their heartbeat touches. (Tracked as a follow-up on WI-sipov.) |
| 141 | +- **Codex / Cursor / Gemini exit keystrokes.** Marked `FIXME WI-batob` in the supervisor's `VENDOR_TABLE` and in the AGENTS.md parity table. Claude Code is verified; the others need a one-time "start the CLI in a throwaway tmux, send the keystroke, confirm exit within 30s" verification. |
| 142 | + |
| 143 | +## Related reading |
| 144 | + |
| 145 | +- [AGENTS.md § Vendor Parity for Respawn](../AGENTS.md) — the per-vendor contract table (hook paths, exit keystrokes, CLI invocations). |
| 146 | +- [AGENTS.md § Premature Stopping Prevention](../AGENTS.md) — the autonomous-mode framework the supervisor plugs into. |
| 147 | +- `scripts/agent-supervisor` — inline design notes in the script's docstring. |
| 148 | +- `scripts/loop-toggle --help` — the intent/mode split (`--set-intent` / `--set-session-mode`). |
| 149 | +- Tracker item `WI-razub-duluf-nobun-rulit-dapam-jipal-dafud-nahob` — the full design discussion and resolution notes. |
0 commit comments