Detect SDK turn silence (no streaming events for N min) and surface to retry_log

## Problem

The Claude Agent SDK turn doesn't surface a "no events for N minutes" condition. If a tool call (Bash invoking a long-running subprocess) hangs, the SDK just waits. The orchestrator has no way to detect or interrupt this.

In #127, nous added `nous status --watch` which displays a `STUCK` marker after 5 min of no streaming events — but that's an *informational* marker for a human watching the screen. The orchestrator itself doesn't act on it.

## Concrete impact (paper-burst friction-test, 2026-05-26)

One BLIS subprocess (externality-credit_seed11) hung at 100% CPU. The agent's Bash tool call (`./blis run … --seed 11 …`) waited for it. The SDK waited for the tool call. The orchestrator waited for the SDK turn. **Total stall: 40+ minutes of zero progress, no events to executor_log.jsonl, no log lines to stdout.** The only way out was external intervention (`nous stop` → kill).

The brief budgeted "iter-2: 5–10 min wall." Reality: a single arm hung indefinitely.

## Proposal

Two layers:

**(1) Streaming-event silence detector** in `orchestrator/sdk_dispatch.py`. When `_tee_event` lands on the file, the orchestrator records the timestamp. A background thread (or async task) checks: if the most recent event is older than `silence_threshold` (default 10 min, configurable per campaign), emit a structured warning to `retry_log.jsonl`:

```json
{
  "phase": "execute-analyze",
  "iteration": 1,
  "failure_type": "sdk_silence",
  "elapsed_silent_seconds": 612,
  "last_event": {"type": "AssistantMessage", "tool_name": "Bash", ...}
}
```

This doesn't interrupt the turn — but it **surfaces** the stall so `nous status --watch` shows it, retry_log records it, and external monitors (CI heartbeat checks) can act.

**(2) Optional hard-kill on prolonged silence.** When silence exceeds `kill_threshold` (default off), the orchestrator cancels the SDK turn and raises an exception. Default off because hard-kill in the middle of a possibly-recoverable turn is destructive. Operators opt in per campaign:

```yaml
# campaign.yaml
sdk_timeouts:
  silence_threshold_seconds: 600   # warn after 10 min
  kill_threshold_seconds: 1800     # kill after 30 min
```

## Files to touch

- `orchestrator/sdk_dispatch.py` — track last-event timestamp, optional async monitor.
- `orchestrator/schemas/campaign.schema.yaml` — accept `sdk_timeouts` block.
- `orchestrator/status.py` — read silence-warning from retry_log.jsonl, surface in panel/line output.
- `tests/test_sdk_dispatch.py` — drive a fake runner that sleeps; assert silence warning fires.

## Related

- [N6: nous stop should be phase-boundary, not just iteration-boundary].
- [N8: ExecuteAnalyzeIncompleteError — the silence detector should feed into this diagnostic when missing artifacts].
- #127 (nous status STUCK marker — same idea, observer-only).

## Discovered in

paper-burst friction-test, 2026-05-26.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detect SDK turn silence (no streaming events for N min) and surface to retry_log #201

Problem

Concrete impact (paper-burst friction-test, 2026-05-26)

Proposal

Files to touch

Related

Discovered in

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Detect SDK turn silence (no streaming events for N min) and surface to retry_log #201

Description

Problem

Concrete impact (paper-burst friction-test, 2026-05-26)

Proposal

Files to touch

Related

Discovered in

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions