Skip to content

Commit c13f871

Browse files
sriumcpclaude
andauthored
* paper-burst friction-test cleanup (AI-native-Systems-Research#203) Closes AI-native-Systems-Research#193, AI-native-Systems-Research#194, AI-native-Systems-Research#195, AI-native-Systems-Research#196, AI-native-Systems-Research#197, AI-native-Systems-Research#198, AI-native-Systems-Research#199, AI-native-Systems-Research#200, AI-native-Systems-Research#201, AI-native-Systems-Research#202. Refs AI-native-Systems-Research#203. Ten nous-side improvements surfaced by a clean-room friction-test of the `paper-burst` campaign on top of post-AI-native-Systems-Research#189 nous (PR AI-native-Systems-Research#192). Each fix lands with regression coverage; the full suite goes from 939 → 993 passing tests (+54 new, 0 regressions). Behaviour changes ----------------- **Critical** * `AI-native-Systems-Research#193` — SDK sandbox configurable. ``ClaudeAgentOptions.permission_mode`` defaults to ``"bypassPermissions"`` so campaigns can write outside the launched cwd (the orchestrator runs the agent with cwd=worktree but BLIS / build / test outputs land at <main-repo>/.nous/<run>/...). Operators opt out via ``campaign.sandbox: default`` or ``nous run --sandbox default``. * `AI-native-Systems-Research#194` — ``state.json.iteration`` matches ``runs/iter-N/`` from the start of iter-1. Engine increments on leaving INIT (whether via PRE_WORK or directly to DESIGN) in addition to DONE→DESIGN. Fixes ``nous status`` showing "iter 0" while artifacts live in iter-1/, and silences the misleading "starting fresh" path on resume. **Operator UX** * `AI-native-Systems-Research#195` — ``nous status`` "last tool" view actually populates. ``_tee_event`` now walks ``message.content`` for ``ToolUseBlock`` entries (where ``name`` actually lives) instead of trying ``getattr(message, "tool_name")`` on the top-level message (which was always None on AssistantMessage). * `AI-native-Systems-Research#197` — ``--max-iterations`` survives ``nous resume``. The effective cap is persisted to ``state.json`` on first run and read back on resume when no CLI flag is supplied. CLI flag still wins; resume prints which source it used. * `AI-native-Systems-Research#198` — ``nous stop`` honours phase boundaries, not just iteration boundaries. ``_enter_phase`` checks the STOP sentinel before each phase transition, so a long EXECUTE_ANALYZE can be halted at the next gate boundary instead of waiting for the next iteration. **Structural / extensibility** * `AI-native-Systems-Research#199` — Per-campaign iter-root whitelist extension. New ``campaign.validation.iter_root_extensions`` lets paper-* campaigns declare ``analysis_summary.json`` / ``manifest.json`` / etc. as allowed iter-root artifacts. Validator merges with the global ``_KNOWN_ROOT_FILES`` whitelist. * `AI-native-Systems-Research#200` — ``ExecuteAnalyzeIncompleteError`` analog of AI-native-Systems-Research#187 for the EXECUTE phase. Fires before ``validate_execution`` when ``experiment_plan.yaml`` / ``findings.json`` / ``principle_updates.json`` is missing on disk after dispatch. Names the missing files and lists the four common causes (max_turns exhaustion, subprocess hang, polling-loop stall, API stall) each pointing at a concrete artifact. Writes ``failure_type: "execute_incomplete"`` to retry_log.jsonl. * `AI-native-Systems-Research#201` — Post-turn SDK silence detector. After each SDK turn, ``summarize_silence_gaps`` walks the streaming ``executor_log.jsonl`` and writes ``failure_type: "sdk_silence"`` to retry_log.jsonl when the longest gap between events exceeds ``campaign.sdk_timeouts.silence_threshold_seconds`` (default 600s). Observation-only — doesn't interrupt or fail the turn. **Polish** * `AI-native-Systems-Research#196` — Dispatch log line uses ``type(self).__name__`` so SDK campaigns log as ``SDKDispatcher``, not ``CLIDispatcher`` (the parent class whose ``logger.info`` was leaking the name). * `AI-native-Systems-Research#202` — Resume's "starting fresh" warning rephrased to informational wording ("treating as iter-1; existing artifacts preserved"), demoted WARNING → INFO. After AI-native-Systems-Research#194 this path mostly never fires anyway. Tests ----- 993 passed, 1 skipped, 0 failed (was 939 on `reflective`; +54 new). New behavioural test files: * ``test_sdk_sandbox.py`` — ``permission_mode`` defaults to bypass; campaign override; explicit kwarg override; schema validation. * ``test_sdk_silence_detector.py`` — gap detection helper; threshold trigger; below-threshold no-op; opt-out via threshold=0. * ``test_max_iterations_persist.py`` — persist/read helpers; resume resolution chain (CLI > state > campaign > default). * ``test_execute_artifact_assertion.py`` — missing-artifact helper, diagnostic message shape, retry_log entry shape. Updated regression coverage in ``test_engine.py`` (iteration counter ticks on leaving INIT and on DONE→DESIGN), ``test_campaign.py`` (resume warning is INFO with new wording), ``test_validate.py`` (iter-root extension whitelist), ``test_nous_stop.py`` (phase-boundary stop checks), ``test_cli_dispatch.py`` (log line uses runtime class name), ``test_sdk_dispatch.py`` (last-tool extraction from ToolUseBlock). Migration notes --------------- * No breaking changes to existing campaigns. All new fields (``sandbox``, ``sdk_timeouts``, ``validation.iter_root_extensions``) are optional with sensible defaults; old campaigns keep working unchanged. * Programmatic callers of ``validate_design`` / ``validate_execution`` may now pass an optional ``campaign`` kwarg to opt into the iter-root extension whitelist. Omitting it preserves the strict default. * ``state.json`` gains an optional ``max_iterations`` field. Old state files without it work fine on resume — the resolver falls back to campaign.yaml + hardcoded default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * review fixups: tighter typing + missing wiring tests + doc accuracy PR AI-native-Systems-Research#204 review surfaced one bug-tier finding plus a handful of suggestions; this commit folds in everything. Changes ------- * `orchestrator/sdk_dispatch.py` — defensive Python-side validation of ``campaign.sdk_timeouts.silence_threshold_seconds`` at construction (mirroring the ``sandbox`` enum check). Schema enforces ``minimum: 0`` but ad-hoc dict callers bypass schema; this closes the gap. Also tightened ``SDKRunner.permission_mode`` to ``Literal["bypassPermissions"] | None`` (was ``str | None``) and promoted the silence-detector fallback log from ``logger.debug`` → ``logger.warning`` so a perpetually-failing detector surfaces. * `orchestrator/iteration.py` — ``_enter_phase`` now requires ``work_dir`` (was optional, defaulted to None which silently skipped the stop-sentinel check). All in-repo callers already pass it. ``ExecuteAnalyzeIncompleteError.__init__`` validates that ``missing`` is non-empty and every entry is in ``_REQUIRED_EXECUTE_ARTIFACTS`` — kills the typo class at the raise site. Error message clarified about where each missing artifact belongs (iter root vs ``results/``). * `orchestrator/campaign.py` — ``_persist_max_iterations`` docstring rewritten to distinguish silent benign no-ops (state.json absent / not a dict / value unchanged) from logged failures (read/parse/write errors). * `orchestrator/cli.py` — ``_cmd_resume``'s max_iterations resolution chain comment merged steps 3+4 (campaign.yaml fallback and hardcoded default 10 are the same code path). Tests (+6 new, 0 regressions; 993 → 999 total) ---------------------------------------------- * ``test_execute_artifact_assertion.py`` — new end-to-end test pinning the orchestrator glue: when EXECUTE_ANALYZE's dispatch returns cleanly without writing artifacts, ``run_iteration`` writes the retry_log row + raises ``ExecuteAnalyzeIncompleteError``. Uses ``InlineDispatcher.dispatch`` no-op stub so no live LLM, no signal- file polling. * ``test_nous_stop.py`` — new end-to-end test pinning AI-native-Systems-Research#198 call-site wiring: a STOP sentinel staged at DESIGN halts at the HUMAN_DESIGN_GATE boundary instead of running the gate. Catches any regression that drops ``work_dir`` from one of the four ``_enter_phase`` call sites in ``run_iteration``. * ``test_max_iterations_persist.py`` — two new tests for AI-native-Systems-Research#197 resolution-chain steps 3 (campaign.yaml fallback) and 4 (hardcoded default 10) when state.json has no ``max_iterations`` field. * ``test_sdk_sandbox.py`` — two new tests for AI-native-Systems-Research#193 CLI flag end-to-end (``--sandbox default`` overrides ``campaign.sandbox: bypass``; absent flag preserves campaign value). * ``test_nous_stop.py::test_enter_phase_requires_work_dir`` — replaces the prior "without work_dir skips sentinel" test with one that pins the new contract (TypeError on missing kwarg). * ``test_integration_real_execution.py`` — updated 8 ``_enter_phase`` call sites to pass ``tmp_path`` as ``work_dir``. Reviewed but skipped -------------------- The "worktree leak on validate_execution failure path" finding from the silent-failure-hunter agent did not reproduce on inspection: ``validate_execution`` only logs warnings, doesn't raise. The worktree cleanup at iteration.py line 1011 runs unconditionally after that. The two raise sites further down (findings.json missing / schema fail) fire AFTER cleanup, so no leak. The agent referenced ``ScriptError``/``RecoverableExecutionError`` handlers that don't exist in the file — appears to have hallucinated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c6a6ee6 commit c13f871

20 files changed

Lines changed: 1921 additions & 57 deletions

orchestrator/campaign.py

Lines changed: 71 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
from orchestrator.engine import Engine
3939
from orchestrator.gates import HumanGate
4040
from orchestrator.inline_dispatch import InlineDispatcher
41+
from orchestrator.util import atomic_write
4142
from orchestrator.ledger import append_failed_row, append_ledger_row
4243
from orchestrator.llm_dispatch import LLMDispatcher
4344
from orchestrator.metrics import summarize_metrics
@@ -175,6 +176,60 @@ def _generate_report(
175176
print(f" Report generation skipped: {exc}")
176177

177178

179+
def _persist_max_iterations(work_dir: Path, max_iterations: int) -> None:
180+
"""Write effective max_iterations into state.json (#197).
181+
182+
Best-effort. Three early-return paths are silent benign no-ops:
183+
* state.json doesn't exist (run hasn't called setup_work_dir yet)
184+
* state.json content isn't a dict (corrupt; load_state will catch it)
185+
* value is unchanged (idempotent skip)
186+
187+
OS / parse / write *errors* are logged at WARNING level with the
188+
fallback chain noted, since the feature is operational sugar
189+
(carries the original cap across resume) and not load-bearing for
190+
correctness.
191+
"""
192+
state_path = Path(work_dir) / "state.json"
193+
if not state_path.exists():
194+
return
195+
try:
196+
state = json.loads(state_path.read_text())
197+
if not isinstance(state, dict):
198+
return
199+
if state.get("max_iterations") == max_iterations:
200+
return
201+
state["max_iterations"] = int(max_iterations)
202+
atomic_write(state_path, json.dumps(state, indent=2) + "\n")
203+
except (OSError, json.JSONDecodeError) as exc:
204+
logger.warning(
205+
"Could not persist max_iterations=%d to state.json (%s): "
206+
"resume will fall back to CLI flag / campaign.yaml / default.",
207+
max_iterations, exc,
208+
)
209+
210+
211+
def read_persisted_max_iterations(work_dir: Path) -> int | None:
212+
"""Read max_iterations from state.json if present (#197).
213+
214+
Returns None when state.json doesn't exist, can't be parsed, or
215+
doesn't carry a max_iterations field. Callers should fall back to
216+
their normal resolution chain (CLI flag → campaign.yaml → default).
217+
"""
218+
state_path = Path(work_dir) / "state.json"
219+
if not state_path.exists():
220+
return None
221+
try:
222+
state = json.loads(state_path.read_text())
223+
except (OSError, json.JSONDecodeError):
224+
return None
225+
if not isinstance(state, dict):
226+
return None
227+
val = state.get("max_iterations")
228+
if not isinstance(val, int) or val < 1:
229+
return None
230+
return val
231+
232+
178233
def _resume_completed_campaign(work_dir: Path, max_iterations: int) -> int:
179234
"""Decide where to resume a campaign and, if DONE, advance it.
180235
@@ -194,8 +249,16 @@ def _resume_completed_campaign(work_dir: Path, max_iterations: int) -> int:
194249
if engine.phase not in ("INIT", "DONE"):
195250
start = engine.iteration
196251
if start < 1:
197-
logger.warning(
198-
"state.json has iteration=%d (< 1); starting fresh.", start,
252+
# #202: pre-#194, the engine kept state.iteration=0 throughout
253+
# iter-1 (incrementing only on DONE→DESIGN). The earlier WARNING
254+
# "starting fresh" wording read like data loss; in practice
255+
# existing iter-1 artifacts are preserved and the resume
256+
# continues at state.phase. Phrase it informationally.
257+
logger.info(
258+
"state.json has iteration=%d (no completed iterations yet); "
259+
"treating as iter-1. Existing artifacts under runs/iter-1/ "
260+
"are preserved; resuming at phase=%s.",
261+
start, engine.phase,
199262
)
200263
return 1
201264
if start > max_iterations:
@@ -309,6 +372,12 @@ def run_campaign(
309372
)
310373
preflight_dispatcher.preflight_check()
311374

375+
# #197: persist effective max_iterations into state.json so a later
376+
# `nous resume` (without --max-iterations) honors the original cap
377+
# instead of silently defaulting to 10. The state file is the single
378+
# source of truth across run/resume invocations.
379+
_persist_max_iterations(work_dir, max_iterations)
380+
312381
start_iter = _resume_completed_campaign(work_dir, max_iterations)
313382

314383
max_redesigns = 3

orchestrator/cli.py

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,11 @@ def _cmd_run(args):
114114
file=sys.stderr,
115115
)
116116
sys.exit(1)
117+
# #193: --sandbox CLI flag overrides campaign.sandbox if both present;
118+
# leaving it unset preserves whatever the campaign.yaml declares (or
119+
# the SDKDispatcher default of "bypass").
120+
if getattr(args, "sandbox", None) is not None:
121+
campaign["sandbox"] = args.sandbox
117122
run_campaign(
118123
campaign,
119124
work_dir,
@@ -132,7 +137,7 @@ def _cmd_run(args):
132137
def _cmd_resume(args):
133138
import logging
134139

135-
from orchestrator.campaign import run_campaign
140+
from orchestrator.campaign import run_campaign, read_persisted_max_iterations
136141

137142
logging.basicConfig(level=logging.DEBUG if args.verbose else logging.INFO)
138143

@@ -150,7 +155,31 @@ def _cmd_resume(args):
150155
print("resume requires campaign.yaml", file=sys.stderr)
151156
sys.exit(1)
152157

153-
max_iterations = args.max_iterations if args.max_iterations is not None else campaign.get("max_iterations", 10)
158+
# #197: max_iterations resolution chain on resume:
159+
# 1. CLI --max-iterations (explicit override wins).
160+
# 2. state.json (preserves the cap from the original `nous run`).
161+
# 3. campaign.yaml.max_iterations, or the hardcoded default 10 if
162+
# campaign.yaml doesn't pin it. (Both flow through the same
163+
# `campaign.get("max_iterations", 10)` call — legacy state files
164+
# pre-dating #197 land here.)
165+
if args.max_iterations is not None:
166+
max_iterations = args.max_iterations
167+
print(f"Resuming with max_iterations={max_iterations} (CLI override).")
168+
else:
169+
persisted = read_persisted_max_iterations(work_dir)
170+
if persisted is not None:
171+
max_iterations = persisted
172+
print(
173+
f"Resuming with max_iterations={max_iterations} "
174+
f"(persisted from original `nous run`)."
175+
)
176+
else:
177+
max_iterations = campaign.get("max_iterations", 10)
178+
print(
179+
f"Resuming with max_iterations={max_iterations} "
180+
f"(from campaign.yaml / default — state.json had no "
181+
f"persisted value)."
182+
)
154183
run_campaign(
155184
campaign,
156185
work_dir,
@@ -609,6 +638,13 @@ def main():
609638
"an enclosing agent framework. The legacy 'api' backend "
610639
"was removed in #183.",
611640
)
641+
p_run.add_argument(
642+
"--sandbox", choices=["bypass", "default"], default=None,
643+
help="SDK filesystem sandbox mode (#193). Default 'bypass' (set "
644+
"via campaign.sandbox). Pass 'default' to use the SDK's "
645+
"default permission gating — only sensible when the "
646+
"campaign's writes all land under the launched cwd.",
647+
)
612648
p_run.add_argument(
613649
"--bundle", type=Path, default=None,
614650
help="Path to a pre-authored bundle.yaml. Skips DESIGN's agent "

orchestrator/cli_dispatch.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,14 @@ def dispatch(
186186
else:
187187
atomic_write(output_path, json.dumps(data, indent=2) + "\n")
188188

189-
logger.info("CLIDispatcher: role=%s phase=%s -> %s", role, phase, output_path)
189+
logger.info(
190+
# #196: report the runtime class name, not the parent's. SDKDispatcher
191+
# inherits from CLIDispatcher and runs through this path, but the
192+
# log line should say "SDKDispatcher" so operators don't think the
193+
# legacy claude -p subprocess (removed in #183) is in use.
194+
"%s: role=%s phase=%s -> %s",
195+
type(self).__name__, role, phase, output_path,
196+
)
190197

191198
def _retry_cli_parse(self, original_prompt: str, error: Exception, fmt: str) -> dict:
192199
feedback = (

orchestrator/engine.py

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,16 @@ def transition(self, to_state: str) -> None:
121121
f"Invalid transition: {current} -> {to_state}. "
122122
f"Valid: {TRANSITIONS[current]}"
123123
)
124-
# Build candidate state before writing to disk
124+
# Build candidate state before writing to disk.
125+
# #194: increment iteration whenever we leave INIT (iter-1 begins,
126+
# whether via PRE_WORK or directly to DESIGN) and whenever DONE
127+
# transitions to DESIGN (iter-N+1 begins). Pre-#194, the counter
128+
# only ticked on DONE→DESIGN, so state.iteration stayed at 0
129+
# throughout iter-1 even though artifacts lived at runs/iter-1/.
125130
new_state = dict(self._state)
126-
if current == "DONE" and to_state == "DESIGN":
131+
if current == "INIT":
132+
new_state["iteration"] += 1
133+
elif current == "DONE" and to_state == "DESIGN":
127134
new_state["iteration"] += 1
128135
new_state["phase"] = to_state
129136
new_state["timestamp"] = datetime.now(timezone.utc).isoformat()

0 commit comments

Comments
 (0)