Skip to content

Detect SDK turn silence (no streaming events for N min) and surface to retry_log #201

@sriumcp

Description

@sriumcp

Problem

The Claude Agent SDK turn doesn't surface a "no events for N minutes" condition. If a tool call (Bash invoking a long-running subprocess) hangs, the SDK just waits. The orchestrator has no way to detect or interrupt this.

In #127, nous added nous status --watch which displays a STUCK marker after 5 min of no streaming events — but that's an informational marker for a human watching the screen. The orchestrator itself doesn't act on it.

Concrete impact (paper-burst friction-test, 2026-05-26)

One BLIS subprocess (externality-credit_seed11) hung at 100% CPU. The agent's Bash tool call (./blis run … --seed 11 …) waited for it. The SDK waited for the tool call. The orchestrator waited for the SDK turn. Total stall: 40+ minutes of zero progress, no events to executor_log.jsonl, no log lines to stdout. The only way out was external intervention (nous stop → kill).

The brief budgeted "iter-2: 5–10 min wall." Reality: a single arm hung indefinitely.

Proposal

Two layers:

(1) Streaming-event silence detector in orchestrator/sdk_dispatch.py. When _tee_event lands on the file, the orchestrator records the timestamp. A background thread (or async task) checks: if the most recent event is older than silence_threshold (default 10 min, configurable per campaign), emit a structured warning to retry_log.jsonl:

{
  "phase": "execute-analyze",
  "iteration": 1,
  "failure_type": "sdk_silence",
  "elapsed_silent_seconds": 612,
  "last_event": {"type": "AssistantMessage", "tool_name": "Bash", ...}
}

This doesn't interrupt the turn — but it surfaces the stall so nous status --watch shows it, retry_log records it, and external monitors (CI heartbeat checks) can act.

(2) Optional hard-kill on prolonged silence. When silence exceeds kill_threshold (default off), the orchestrator cancels the SDK turn and raises an exception. Default off because hard-kill in the middle of a possibly-recoverable turn is destructive. Operators opt in per campaign:

# campaign.yaml
sdk_timeouts:
  silence_threshold_seconds: 600   # warn after 10 min
  kill_threshold_seconds: 1800     # kill after 30 min

Files to touch

  • orchestrator/sdk_dispatch.py — track last-event timestamp, optional async monitor.
  • orchestrator/schemas/campaign.schema.yaml — accept sdk_timeouts block.
  • orchestrator/status.py — read silence-warning from retry_log.jsonl, surface in panel/line output.
  • tests/test_sdk_dispatch.py — drive a fake runner that sleeps; assert silence warning fires.

Related

Discovered in

paper-burst friction-test, 2026-05-26.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions