Problem
The Claude Agent SDK turn doesn't surface a "no events for N minutes" condition. If a tool call (Bash invoking a long-running subprocess) hangs, the SDK just waits. The orchestrator has no way to detect or interrupt this.
In #127, nous added nous status --watch which displays a STUCK marker after 5 min of no streaming events — but that's an informational marker for a human watching the screen. The orchestrator itself doesn't act on it.
Concrete impact (paper-burst friction-test, 2026-05-26)
One BLIS subprocess (externality-credit_seed11) hung at 100% CPU. The agent's Bash tool call (./blis run … --seed 11 …) waited for it. The SDK waited for the tool call. The orchestrator waited for the SDK turn. Total stall: 40+ minutes of zero progress, no events to executor_log.jsonl, no log lines to stdout. The only way out was external intervention (nous stop → kill).
The brief budgeted "iter-2: 5–10 min wall." Reality: a single arm hung indefinitely.
Proposal
Two layers:
(1) Streaming-event silence detector in orchestrator/sdk_dispatch.py. When _tee_event lands on the file, the orchestrator records the timestamp. A background thread (or async task) checks: if the most recent event is older than silence_threshold (default 10 min, configurable per campaign), emit a structured warning to retry_log.jsonl:
{
"phase": "execute-analyze",
"iteration": 1,
"failure_type": "sdk_silence",
"elapsed_silent_seconds": 612,
"last_event": {"type": "AssistantMessage", "tool_name": "Bash", ...}
}
This doesn't interrupt the turn — but it surfaces the stall so nous status --watch shows it, retry_log records it, and external monitors (CI heartbeat checks) can act.
(2) Optional hard-kill on prolonged silence. When silence exceeds kill_threshold (default off), the orchestrator cancels the SDK turn and raises an exception. Default off because hard-kill in the middle of a possibly-recoverable turn is destructive. Operators opt in per campaign:
# campaign.yaml
sdk_timeouts:
silence_threshold_seconds: 600 # warn after 10 min
kill_threshold_seconds: 1800 # kill after 30 min
Files to touch
orchestrator/sdk_dispatch.py — track last-event timestamp, optional async monitor.
orchestrator/schemas/campaign.schema.yaml — accept sdk_timeouts block.
orchestrator/status.py — read silence-warning from retry_log.jsonl, surface in panel/line output.
tests/test_sdk_dispatch.py — drive a fake runner that sleeps; assert silence warning fires.
Related
Discovered in
paper-burst friction-test, 2026-05-26.
Problem
The Claude Agent SDK turn doesn't surface a "no events for N minutes" condition. If a tool call (Bash invoking a long-running subprocess) hangs, the SDK just waits. The orchestrator has no way to detect or interrupt this.
In #127, nous added
nous status --watchwhich displays aSTUCKmarker after 5 min of no streaming events — but that's an informational marker for a human watching the screen. The orchestrator itself doesn't act on it.Concrete impact (paper-burst friction-test, 2026-05-26)
One BLIS subprocess (externality-credit_seed11) hung at 100% CPU. The agent's Bash tool call (
./blis run … --seed 11 …) waited for it. The SDK waited for the tool call. The orchestrator waited for the SDK turn. Total stall: 40+ minutes of zero progress, no events to executor_log.jsonl, no log lines to stdout. The only way out was external intervention (nous stop→ kill).The brief budgeted "iter-2: 5–10 min wall." Reality: a single arm hung indefinitely.
Proposal
Two layers:
(1) Streaming-event silence detector in
orchestrator/sdk_dispatch.py. When_tee_eventlands on the file, the orchestrator records the timestamp. A background thread (or async task) checks: if the most recent event is older thansilence_threshold(default 10 min, configurable per campaign), emit a structured warning toretry_log.jsonl:{ "phase": "execute-analyze", "iteration": 1, "failure_type": "sdk_silence", "elapsed_silent_seconds": 612, "last_event": {"type": "AssistantMessage", "tool_name": "Bash", ...} }This doesn't interrupt the turn — but it surfaces the stall so
nous status --watchshows it, retry_log records it, and external monitors (CI heartbeat checks) can act.(2) Optional hard-kill on prolonged silence. When silence exceeds
kill_threshold(default off), the orchestrator cancels the SDK turn and raises an exception. Default off because hard-kill in the middle of a possibly-recoverable turn is destructive. Operators opt in per campaign:Files to touch
orchestrator/sdk_dispatch.py— track last-event timestamp, optional async monitor.orchestrator/schemas/campaign.schema.yaml— acceptsdk_timeoutsblock.orchestrator/status.py— read silence-warning from retry_log.jsonl, surface in panel/line output.tests/test_sdk_dispatch.py— drive a fake runner that sleeps; assert silence warning fires.Related
--output-format stream-json;nous status --watchTUI #127 (nous status STUCK marker — same idea, observer-only).Discovered in
paper-burst friction-test, 2026-05-26.