Skip to content

nous stop should honor phase boundaries, not just iteration boundaries #198

@sriumcp

Description

@sriumcp

Problem

nous stop <target> (added in #189 wave) writes a STOP sentinel that the orchestrator honors between iterations. The check sits in orchestrator/campaign.py in the iteration loop, before each run_iteration() call.

That granularity is wrong for the most common reason an operator wants to stop: the current iteration has gone off the rails (agent stuck, BLIS subprocess hanging, runaway costs) and they want to halt now, not in 30+ minutes when EXECUTE_ANALYZE eventually completes.

Repro (paper-burst friction-test, 2026-05-26)

EXECUTE_ANALYZE was running 50 BLIS arms in parallel. One arm (externality-credit_seed11) hung at 100% CPU for 40+ minutes, blocking the agent's tool call, blocking the SDK turn. Operator wanted to stop. nous stop wrote the sentinel — but the sentinel wouldn't be checked until iter-2's iteration-loop boundary, which would only happen after iter-1's stuck EXECUTE_ANALYZE eventually returned. Net result: ctrl-C / kill was needed for an immediate halt; nous stop was effectively a no-op for this case.

Proposal

Add phase-boundary stop checks inside run_iteration itself, in addition to the existing iteration-loop check:

# orchestrator/iteration.py
def run_iteration(...):
    ...
    if _enter_phase(engine, "DESIGN"):
        _raise_if_stopped(work_dir, where="before DESIGN")
        ...
    if _enter_phase(engine, "HUMAN_DESIGN_GATE"):
        _raise_if_stopped(work_dir, where="before HUMAN_DESIGN_GATE")
        ...
    if _enter_phase(engine, "EXECUTE_ANALYZE"):
        _raise_if_stopped(work_dir, where="before EXECUTE_ANALYZE")
        ...

This still doesn't interrupt a stuck SDK turn mid-flight (the agent's tool call is opaque to the orchestrator), but it lets the operator halt cleanly on any phase boundary instead of waiting for the next iteration boundary.

Stretch: a --reason "I want to halt mid-phase" could set a more aggressive flag that the dispatcher checks during streaming-event teeing, raising CampaignStopped from inside the SDK callback. That's a real interrupt, but messier; phase-boundary first.

Files to touch

  • orchestrator/iteration.py — sprinkle _raise_if_stopped calls at each _enter_phase site.
  • orchestrator/campaign.py — surface "stopped at phase X" in the ledger row's error field (currently says "stopped_by_user: ...", which is fine).
  • tests/test_nous_stop.py — add a fixture where the sentinel is written between phases and assert the next phase doesn't start.

Discovered in

paper-burst friction-test, 2026-05-26.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions