|
| 1 | +# Fleet Telemetry Cases |
| 2 | + |
| 3 | +Live cases surfaced by `/tmp/codex-fleet-telemetry-*.jsonl` and the in-process |
| 4 | +supervisors during real bringups. Each entry documents the symptom, the |
| 5 | +detection signal, and the fix that addresses it. |
| 6 | + |
| 7 | +## F1 — Dead panes silent in overview |
| 8 | + |
| 9 | +**Symptom (live 2026-05-18):** `Pane is dead (signal 15, Mon May 18 11:43:27 2026)` |
| 10 | +on 5+ panes of `codex-fleet` session. Operator only noticed by scrolling into |
| 11 | +each pane manually; the overview chrome rendered them as if alive. |
| 12 | + |
| 13 | +**Detection signal:** |
| 14 | +```jsonl |
| 15 | +{"kind":"pane","pane_id":"%16","last_line":"Pane is dead (signal 15, Mon May 18 11:43:27 2026)","blocked":0,"stall_secs":0} |
| 16 | +``` |
| 17 | + |
| 18 | +**Fix:** `scripts/codex-fleet/show-fleet.sh:dead_panes_report()` reads |
| 19 | +`tmux list-panes -F '#{pane_dead}'` and emits a JSON summary on stderr. |
| 20 | +Markers under `/tmp/claude-viz/dead-pane-firstseen/` track first-seen |
| 21 | +timestamps so we can alert at age >60s. |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## F2 — Cap-probe cache outlived quota recovery |
| 26 | + |
| 27 | +**Symptom (live 2026-05-18):** First `full-bringup.sh` found 5/6 healthy |
| 28 | +accounts; a fresh `--no-cap-cache` re-run ~5min later found 8/8 healthy. |
| 29 | +The 300s default `CACHE_TTL_HEALTHY` outlived the actual quota window |
| 30 | +during a normal fleet bringup. |
| 31 | + |
| 32 | +**Fix:** `scripts/codex-fleet/cap-probe.sh` lowers `CACHE_TTL_HEALTHY` default |
| 33 | +to 60s, adds `CODEX_FLEET_CAP_CACHE_TTL` env override, and zeroes the TTL |
| 34 | +when `/tmp/claude-viz/bringup-failure.marker` exists. |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## F3 + F7 — wake-prompt and trust-prompt never fire on bringup |
| 39 | + |
| 40 | +**Symptom (live 2026-05-18):** `fleet-ticker-2:wake-prompt` window blank |
| 41 | +after bringup; 8 workers in `codex-fleet-2` stuck at default Codex |
| 42 | +placeholders (`"Implement {feature}"`). Separately, FLEET_ID=3's 8 workers |
| 43 | +each blocked on `Do you trust the contents of this directory?` → |
| 44 | +`External agent config detected` → `Press enter to continue`. |
| 45 | + |
| 46 | +**Fix:** |
| 47 | +- `scripts/codex-fleet/codex-first-launch-supervisor.sh` (new) drains all |
| 48 | + three first-launch prompts in parallel. Verified live: 8/8 panes drained. |
| 49 | +- `scripts/codex-fleet/full-bringup.sh` calls it just before the `DONE.` |
| 50 | + banner, gated by `CODEX_FLEET_AUTO_BYPASS=1` default. Auto-wake follows |
| 51 | + immediately after, gated by `CODEX_FLEET_AUTO_WAKE=1` default. |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## F4 — plan-watcher rejects depends_on plans |
| 56 | + |
| 57 | +**Symptom (live 2026-05-18):** |
| 58 | +``` |
| 59 | +[plan-watcher] PLAN-VALIDATE: ERROR 5 |
| 60 | +[plan-watcher] {"ok":false,"errors":["tasks[1] '…' has depends_on=[0] but --allow-waves was not passed", …]} |
| 61 | +[plan-watcher] plan-validator reported hard errors; skipping dispatch this tick |
| 62 | +``` |
| 63 | +Force-claim silently fell back to `trading-edge-foundations-pt2-2026-05-18` |
| 64 | +while our priority plan `marketing-content-waves-2026-05-18` (which used |
| 65 | +`depends_on`) was rejected on every tick. |
| 66 | + |
| 67 | +**Fix:** `scripts/codex-fleet/plan-watcher.sh:run_plan_validator()` passes |
| 68 | +`--allow-waves` (matching what `full-bringup.sh` does at publish time). |
| 69 | +`CODEX_FLEET_PLAN_VALIDATOR_FLAGS` env layers extra operator flags without |
| 70 | +losing the baseline. |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## F5 — force-claim silently drops dispatch on non-idle panes |
| 75 | + |
| 76 | +**Symptom (live 2026-05-18):** force-claim log showed `not in a mode` 9× per |
| 77 | +tick on panes that were busy with prior work. The Colony claim had already |
| 78 | +been consumed; the dispatch silently failed; the subtask sat orphaned. |
| 79 | + |
| 80 | +**Fix:** `scripts/codex-fleet/force-claim.sh:dispatch()` runs a pane-ready |
| 81 | +check via `tmux display-message -p '#{pane_in_mode}'` plus a visible-screen |
| 82 | +heuristic (last 10 lines must contain `›` input glyph and not contain |
| 83 | +`Working (...esc to interrupt)`) before `send-keys`. Non-ready panes |
| 84 | +return early with `[defer]` so the Colony claim is not consumed and the |
| 85 | +subtask returns to `available` for the next tick. |
| 86 | + |
| 87 | +--- |
| 88 | + |
| 89 | +## F6 — Codex auto-submit not firing on send-keys |
| 90 | + |
| 91 | +**Symptom (live 2026-05-18):** Worker context drops from 92% to 83% (keys |
| 92 | +arrived in the input box) but Colony shows 0 claims and the worker stays |
| 93 | +at the input prompt. The typed prompt sits there unsubmitted. |
| 94 | + |
| 95 | +**Fix (still investigating):** `scripts/codex-fleet/test/codex-auto-submit-test.sh` |
| 96 | +spawns a 1-pane fleet against a no-op plan, sends the wake prompt via the |
| 97 | +candidate submit-key sequence, and asserts >=1 Colony claim within 90s. |
| 98 | +Candidate sequences tested: `Enter`, `Enter Enter`, `tmux paste-buffer`, |
| 99 | +`Tab Enter`. The smoke test is the gate; the working sequence lands in |
| 100 | +`force-claim.sh:dispatch()` once identified. |
0 commit comments