Skip to content

Commit ef6403e

Browse files
sriumcpclaude
andauthored
fix: clear all post-AI-native-Systems-Research#204 friction (AI-native-Systems-Research#205-AI-native-Systems-Research#217) — rehearsal mode + reliable iters + operational handoff (AI-native-Systems-Research#218)
* fix: surgical rehearsal iterations + reliable full iterations (post-AI-native-Systems-Research#204) Lands the seven highest-leverage friction fixes from the post-AI-native-Systems-Research#204 paper-burst rerun (tracker AI-native-Systems-Research#213), targeted at making rehearsal iterations surgical and full iterations reliable. Closes AI-native-Systems-Research#205 — SDK turn now wraps each ``__anext__`` in ``anyio.fail_after``; per-message silence raises SDKTransientError so the existing retry path catches mid-turn hangs instead of blocking the campaign forever. New helper ``aiter_with_silence_watchdog`` is pure-Python testable. New knob: ``campaign.sdk_timeouts.turn_silence_threshold_seconds`` (defaults to ``silence_threshold_seconds``). Closes AI-native-Systems-Research#206 — bundle schema accepts ``complexity_tier`` / ``tier_justification`` under ``metadata`` (where the agent's natural instinct put them in AI-native-Systems-Research#203). Reader prefers metadata, falls back to the legacy top-level location for backward-compat. Methodology text clarifies the canonical location. Closes AI-native-Systems-Research#208 — ``nous stop`` CLI message now says "phase boundary" (the post-AI-native-Systems-Research#198 reality), not "iteration boundary". Docstring updated to match. Closes AI-native-Systems-Research#212 (v1) — campaign YAML grows ``iterations: [...]`` with per-entry ``mode: rehearsal | real``. The DESIGN methodology reads the resolved mode and renders mode-specific guidance: rehearsal iterations focus on apparatus + feasibility checks, use ONE seed, emit ``brief_amendments.md``; real iterations run the full bundle. Defaults to ``real`` for missing/unknown config (backward-compat). Same model for both modes — model swap is a v2 lever. Closes AI-native-Systems-Research#214 — REPORT extractor's prompt now includes a pre-rendered enumeration of ``runs/iter-*/results/*.json`` plus a retry_log summary, so a campaign that completed 4/5 policies before failing can no longer have its 40 valid arms dismissed as "no data." ``report.md`` template explicitly instructs the extractor to engage with partial data and distinguish apparatus failure from null experimental outcome (search-orientation discipline). Closes AI-native-Systems-Research#215 — ``meta_findings.json`` no longer reports empty ``nous_asks`` on iteration failure. Single ``sdk_silence`` and unactionable ``api_error`` entries each produce structured asks; the "no surprises" notes string is replaced with a "heuristics gap" message when retry_log has unclassified entries. Closes AI-native-Systems-Research#216 — ``build_error_message`` synthesizes useful diagnostic context (stop_reason, num_turns, message-class counts) when the SDK returns ``is_error=True`` with empty/None result text. retry_log rows are no longer literally ``"None"``. Out of scope (filed for follow-up): - AI-native-Systems-Research#207 — status reader walkback for ``Last tool`` - AI-native-Systems-Research#209 — worktree blis-binary copy/symlink - AI-native-Systems-Research#210 — operational decisions in bundle (handoff structure) - AI-native-Systems-Research#211 — bundle_amendments.jsonl drift record - AI-native-Systems-Research#217 — failed-iteration distinction in ``nous status`` Tests ----- +51 new tests (1050 passed, 1 skipped). Coverage by area: - ``test_complexity_tier``: +5 metadata-location tests (AI-native-Systems-Research#206) - ``test_nous_stop``: +1 CLI-message test (AI-native-Systems-Research#208) - ``test_sdk_dispatch``: +9 watchdog + +7 build_error_message (AI-native-Systems-Research#205, AI-native-Systems-Research#216) - ``test_meta_findings``: +3 retry_log-walking (AI-native-Systems-Research#215) - ``test_llm_dispatch``: +10 results_summary / retry_log_summary / report context (AI-native-Systems-Research#214) - ``test_iteration_mode``: +18 (new file) for iteration_mode_for / mode_guidance_for / schema acceptance / design-prompt rendering (AI-native-Systems-Research#212) - ``test_prompt_loader``: tiny ctx fix to surface the new design placeholders. All tests are behavioral and use existing seam-injected fakes per CLAUDE.md ("no live LLM calls in tests"). The watchdog logic is pulled into a standalone async helper specifically so it's testable without ``claude_agent_sdk``. Refs AI-native-Systems-Research#213 (tracker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: remaining friction subissues — handoff, status UX (AI-native-Systems-Research#207/AI-native-Systems-Research#209/AI-native-Systems-Research#210/AI-native-Systems-Research#211/AI-native-Systems-Research#217) Lands the five remaining friction subissues from tracker AI-native-Systems-Research#213 so the next paper-burst rerun has the full friction surface cleared. Composes with the prior commit (rehearsal mode + reliability) to satisfy the goal: surgical rehearsal iterations + full iterations that don't fail. Closes AI-native-Systems-Research#207 — ``nous status`` no longer shows ``Last tool: ?`` when the tail of executor_log is a SystemMessage / TaskNotificationMessage / pure ThinkingBlock. The reader walks backward through up to 200 recent events looking for the nearest ``tool_name``. New ``StatusSnapshot.last_tool_name`` field; format_one_liner / format_watch_panel use it as primary, falling back to the legacy last-event lookup. Closes AI-native-Systems-Research#217 — ``nous status`` distinguishes failed iterations from clean completions. New ``StatusSnapshot.failed_iterations`` field. ``Completed: N`` now counts only clean rows (CONFIRMED / REFUTED); rows with ``status: "FAILED"`` (the shape ``append_failed_row`` writes) contribute to ``Failed: N``. The watch panel shows the Failed line only when non-zero (uncluttered for healthy campaigns); ``--line`` shows ``N done / M failed`` when M > 0. Closes AI-native-Systems-Research#209 — bundle schema gains ``experiment_spec.preflight_commands``, a list of shell commands the EXECUTE_ANALYZE agent must run before fan-out. Methodology updates instruct the executor to run them before smoke testing — fixes the worktree's missing-binary friction and any similar build-state-doesn't-cross-the-worktree-boundary surprise. Closes AI-native-Systems-Research#210 — bundle schema gains structured operational handoff fields: ``fanout_template``, ``classification_function``, ``verified_parameters``. The DESIGN methodology instructs agents to populate them when they've manually verified things in the worktree; the EXECUTE_ANALYZE methodology instructs the executor to treat them as canonical (no re-derivation). Together with AI-native-Systems-Research#209, EXECUTE_ANALYZE no longer pays the token cost of re-grepping the target's source for tenant_id semantics, GNU-parallel quoting, or cache-size sanity. Closes AI-native-Systems-Research#211 — bundle drift gets a formal record. EXECUTE_ANALYZE appends to ``runs/iter-N/inputs/bundle_amendments.jsonl`` whenever it overrides a parameter from ``experiment_spec.verified_parameters`` during smoke / validation. New helper ``_format_bundle_amendments_summary`` reads the log and feeds the REPORT extractor's prompt (alongside AI-native-Systems-Research#214's results_summary and retry_log_summary), so the report describes the *executed* parameters, not the *prescribed* ones. Why one commit -------------- These five fixes share two threads — handoff structure (AI-native-Systems-Research#209, AI-native-Systems-Research#210, AI-native-Systems-Research#211) and status UX (AI-native-Systems-Research#207, AI-native-Systems-Research#217) — and each is small. Splitting would inflate review surface without splitting risk. Tests ----- +24 new tests (1074 passed, 1 skipped). Coverage by area: - ``test_status``: +5 walkback (AI-native-Systems-Research#207) + +7 failed-iter counter (AI-native-Systems-Research#217) - ``test_experiment_spec`` (new file): +6 schema acceptance for the experiment_spec block (AI-native-Systems-Research#209/AI-native-Systems-Research#210), +5 bundle_amendments helper rendering (AI-native-Systems-Research#211), +1 REPORT context inclusion (AI-native-Systems-Research#211). All tests are behavioral and use existing seam-injected fakes. Schema tests assert both forward (new field works) and backward (omitting the new block still validates) compat. Refs AI-native-Systems-Research#213 (tracker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): close all PR-review findings — production-path correctness, invariants, behavioral tests Addresses 17 review findings from /pr-review-toolkit:review-pr (5 agents: code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer). Two were CRITICAL bugs the prior commits silently introduced; the rest are correctness, hygiene, and test-discipline improvements. Critical fixes -------------- * **Thin templates were silent in production.** The PromptLoader prefers ``<template>_thin.md`` whenever ``work_dir/CLAUDE.md`` exists, and ``campaign.py`` writes that file before iter 1. The prior commits only updated the FULL templates, so ``{{iteration_mode}}`` / ``{{mode_guidance}}`` (AI-native-Systems-Research#212) never reached production agents — tests passed by exercising the wrong branch (no CLAUDE.md in tmp_path). ``design_thin.md`` now includes the placeholders, and the parametrized ``TestDesignPromptIncludesMode`` exercises BOTH template paths (with_claude_md=True is the production hot path). Static methodology text for AI-native-Systems-Research#209/AI-native-Systems-Research#210/AI-native-Systems-Research#211 already reaches the agent via system_prompt (verified by ``test_real_methodology_preamble_carries_new_methodology``). * **``experiment_spec.additionalProperties: true`` defeated schema validation.** A typo'd field name (``preflight_command`` singular, ``verifid_parameters``, etc.) silently passed validation, EXECUTE_ANALYZE ignored the typo, and the divergence showed only as runtime failure — exactly the silent-handoff failure AI-native-Systems-Research#209/AI-native-Systems-Research#210 were added to fix, re-introduced through schema permissiveness. Locked the outer block to ``additionalProperties: false`` (the four sub-fields are an enumerated DESIGN→EXECUTE_ANALYZE contract; new fields should be schema-versioned). ``verified_parameters.additionalProperties: true`` stays — it's correctly modelling an open knob namespace. High-priority correctness ------------------------- * **``aiter_with_silence_watchdog`` no longer leaks the SDK stream.** Wrapped the loop in ``try/finally`` that calls ``aclose()`` (when the underlying iterator is an async generator) so SDK-stream tasks and sockets are released on both clean completion and silence-raise. Two new behavioral tests (``test_silence_raise_closes_underlying_iterator``, ``test_clean_completion_also_closes_iterator``) record ``aclose_called`` on a fake async generator and assert. * **``_walkback_for_tool_name`` is now bounded I/O.** Was ``read_text().splitlines()`` which slurped the whole file before applying the 200-event cap — defeating the cap on large logs. Switched to ``deque(file, maxlen=limit)`` so the file streams line-by-line and only the last N are retained. Two cap-boundary tests verify: tool event past the cap returns None; tool event just inside the cap returns it. * **``SDKResult`` invariant locked.** Added ``__post_init__`` that raises ``ValueError`` when ``is_error=True`` with an empty ``error_message``. The friction-test rerun produced ``error: "None"`` rows in retry_log because this invariant was enforced by convention (build_error_message) and a producer that forgot to call it would silently break it. Now construction-time failure points operators at ``build_error_message`` directly. Removed unused ``extra: dict`` field (YAGNI). * **Tightened the over-tolerant assertion.** ``test_none_result_falls_back_to_diagnostic`` was ``"None" not in msg or "no result text" in msg`` — a disjunction that passed if EITHER half was true, including for messages that DID contain literal ``"None"`` (the very bug AI-native-Systems-Research#216 fixed). Replaced with positive-only assertions: starts with the canonical prefix, contains every expected ``key=value``, no ``: None`` or ``= None`` embedded. Test discipline (per CLAUDE.md "behavioral, no live LLM calls") --------------------------------------------------------------- * **Replaced ``runner.calls[0]["turn_silence_threshold"]`` structural assertions** with ``TestTurnSilenceThresholdActuallyControlsBehavior``: three behavioral tests using a runner whose return value depends on the threshold value. Tests now assert on disk artifacts (``design_log.md`` content) and ``retry_log.jsonl`` rows — rename-safe. * **Added the missing E2E system_prompt regression test** (``test_real_methodology_preamble_carries_new_methodology``): loads the project's real prompts/methodology, confirms preflight_commands / fanout_template / classification_function / verified_parameters / bundle_amendments.jsonl all reach the system_prompt. Catches a future edit that drops the operational-handoff section from execute_analyze.md. * **Added test_design_renders_in_both_template_paths parametrize fixture** so the production thin-template path is always exercised alongside the full-template path. Type design (lower-friction APIs) --------------------------------- * ``iteration_mode_for(...) -> Mode`` (Literal["rehearsal", "real"]). Same for ``mode_guidance_for``. Type checker now catches typo'd modes in callers; runtime behavior unchanged for valid inputs. * ``mode_guidance_for`` now ``raise ValueError`` on unknown mode (was silent default to REAL_GUIDANCE). REAL is the more dangerous default (a typo would run a full experiment when scope-shrink was intended); fail loudly instead. Operational brittleness ----------------------- * **``failed_iterations`` counter pinned against future drift.** New module-level ``_TERMINAL_FAILURE_STATUSES = frozenset({"FAILED"})`` with comment that any new failure status (``"TIMEOUT"``, ``"CANCELLED"``) must be added there. Was ``r.get("status") == "FAILED"`` equality, which would silently bucket a future ``"TIMEOUT"`` row as a clean completion. * **bundle_amendments helper surfaces corruption instead of swallowing.** Malformed JSONL lines now produce ``"X amendment(s) + N malformed line(s) skipped"`` in the report context. ``OSError`` (permission denied, truncated file) produces ``"<iter>: bundle_amendments.jsonl unreadable (PermissionError)"``. Tests: ``test_malformed_lines_surfaced_not_silently_skipped``, ``test_unreadable_amendments_log_surfaced``. Comment trim ------------ * Removed the temporal claim in ``iteration_mode.py`` (``"agents at sonnet level have always run real iterations until AI-native-Systems-Research#212"``) — would rot fast. * Trimmed duplicate ``AI-native-Systems-Research#207`` and ``AI-native-Systems-Research#217`` issue references in ``status.py`` per CLAUDE.md ("references rot as the codebase evolves"). Kept one canonical reference per concept. Tests ----- +15 new tests; full suite 1089 passed, 1 skipped, 0 regressions. Behavioral throughout: assertions on disk content, retry_log rows, schema validation, module-level state (``aclose_called``). Refs AI-native-Systems-Research#213 (tracker) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c13f871 commit ef6403e

22 files changed

Lines changed: 2491 additions & 35 deletions

orchestrator/cli.py

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -196,10 +196,11 @@ def _cmd_stop(args):
196196
"""Ask a running campaign to wind down cleanly between phases.
197197
198198
Writes a ``STOP`` sentinel at the campaign work_dir root. The
199-
next time the orchestrator passes a checkpoint (between
200-
iterations today; between phases is on the roadmap), it raises
201-
``CampaignStopped``, persists a ``stopped_by_user`` ledger row,
202-
and exits without orphaning worktrees or pending dispatcher calls.
199+
next time the orchestrator passes a checkpoint — at the start of
200+
each iteration AND at every phase transition within an iteration
201+
(#198) — it raises ``CampaignStopped``, persists a
202+
``stopped_by_user`` ledger row, and exits without orphaning
203+
worktrees or pending dispatcher calls.
203204
204205
For mid-iteration interruption, ``Ctrl+C`` still works — the
205206
engine's atomic checkpoint means the next ``nous resume`` picks
@@ -229,8 +230,10 @@ def _cmd_stop(args):
229230
if reason:
230231
print(f"Reason: {reason}")
231232
print(
232-
"The campaign will halt at the next iteration boundary. To "
233-
"cancel the stop request, delete the sentinel file."
233+
"The campaign will halt at the next phase boundary (a phase "
234+
"transition within the current iteration, or the start of "
235+
"the next iteration — whichever comes first). To cancel the "
236+
"stop request, delete the sentinel file."
234237
)
235238

236239

orchestrator/complexity_tier.py

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,12 @@
3737

3838

3939
def _read_bundle_tier(path: Path) -> int | None:
40-
"""Read complexity_tier from a bundle.yaml. None if missing or malformed."""
40+
"""Read complexity_tier from a bundle.yaml. None if missing or malformed.
41+
42+
Looks under ``metadata`` first, then falls back to the legacy top-level
43+
location (#206). When both are populated, the metadata value wins so
44+
``metadata`` is the canonical place to put it going forward.
45+
"""
4146
if not path.exists():
4247
return None
4348
try:
@@ -46,12 +51,32 @@ def _read_bundle_tier(path: Path) -> int | None:
4651
return None
4752
if not isinstance(data, dict):
4853
return None
54+
metadata = data.get("metadata")
55+
if isinstance(metadata, dict):
56+
tier = metadata.get("complexity_tier")
57+
if isinstance(tier, int) and 1 <= tier <= 4:
58+
return tier
4959
tier = data.get("complexity_tier")
5060
if isinstance(tier, int) and 1 <= tier <= 4:
5161
return tier
5262
return None
5363

5464

65+
def _read_bundle_justification(bundle: object) -> str | None:
66+
"""Pull tier_justification from metadata, falling back to root (#206)."""
67+
if not isinstance(bundle, dict):
68+
return None
69+
metadata = bundle.get("metadata")
70+
if isinstance(metadata, dict):
71+
j = metadata.get("tier_justification")
72+
if isinstance(j, str) and j.strip():
73+
return j
74+
j = bundle.get("tier_justification")
75+
if isinstance(j, str) and j.strip():
76+
return j
77+
return None
78+
79+
5580
def prior_iteration_tiers(work_dir: Path, *, up_to: int) -> dict[int, int]:
5681
"""Return {iteration: tier} for completed prior iterations.
5782
@@ -139,7 +164,7 @@ def format_tier_summary(
139164
# Show justification if present.
140165
try:
141166
bundle = yaml.safe_load(Path(bundle_path).read_text())
142-
justification = bundle.get("tier_justification") if isinstance(bundle, dict) else None
167+
justification = _read_bundle_justification(bundle)
143168
except (OSError, yaml.YAMLError):
144169
justification = None
145170
if justification:

orchestrator/iteration_mode.py

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
"""Per-iteration mode resolution.
2+
3+
A campaign's ``iterations: [...]`` list, when present, lets operators tag
4+
each iteration as ``rehearsal`` or ``real``. The DESIGN methodology reads
5+
the resolved mode from its prompt context and scope-shrinks accordingly:
6+
rehearsal iterations focus on the *apparatus check* (does the workload
7+
spec parse, do BLIS args bind, does analysis validate?) and the
8+
*feasibility check* (does the parameter regime engage the mechanism?),
9+
emitting ``brief_amendments.md`` for any campaign-spec friction. Real
10+
iterations run the full bundle at full scope.
11+
12+
This module is **pure Python — no LLM, no I/O**. The orchestrator and
13+
LLMDispatcher import the resolver to populate prompt context;
14+
test_iteration_mode covers the cases.
15+
"""
16+
from __future__ import annotations
17+
18+
from typing import Literal
19+
20+
21+
# Type alias used by callers that want the type-checker to enforce the
22+
# enum at the API surface (instead of duck-typed strings flowing through).
23+
Mode = Literal["rehearsal", "real"]
24+
25+
# Default when the campaign omits ``iterations``, or when an iteration
26+
# index is out of range. ``real`` is the conservative default — a
27+
# rehearsal-mode iteration scope-shrinks; defaulting to it could mean
28+
# "skip the full experiment by accident."
29+
DEFAULT_MODE: Mode = "real"
30+
31+
VALID_MODES: tuple[Mode, ...] = ("rehearsal", "real")
32+
33+
34+
def iteration_mode_for(campaign: dict, iteration: int) -> Mode:
35+
"""Return the mode for iteration N, defaulting to ``real``.
36+
37+
Out-of-range index, missing block, or malformed entry: ``real``.
38+
"""
39+
if iteration < 1:
40+
return DEFAULT_MODE
41+
iters = campaign.get("iterations")
42+
if not isinstance(iters, list) or not iters:
43+
return DEFAULT_MODE
44+
idx = iteration - 1
45+
if idx >= len(iters):
46+
return DEFAULT_MODE
47+
entry = iters[idx]
48+
if not isinstance(entry, dict):
49+
return DEFAULT_MODE
50+
mode = entry.get("mode")
51+
if mode in VALID_MODES:
52+
return mode # type: ignore[return-value] — narrowed by membership
53+
return DEFAULT_MODE
54+
55+
56+
REHEARSAL_GUIDANCE = """\
57+
This iteration is a **REHEARSAL** (#212). Optimize for fast feedback over
58+
scientific completeness. Two distinct goals — score them separately:
59+
60+
1. **Apparatus check.** Does the experimental machinery work end-to-end?
61+
- Does the workload spec parse?
62+
- Do BLIS / target-system args bind correctly?
63+
- Does the analysis script schema-validate at least one result?
64+
- Are the canonical seeds usable, or do they trip a known bug?
65+
66+
2. **Feasibility check.** Is the parameter regime worth running?
67+
- Does the workload actually engage the mechanism under test? (e.g.
68+
does the burst create KV pressure, vs all adversary requests being
69+
dropped_unservable?)
70+
- Does the policy contrast actually differentiate on this workload,
71+
or do both arms produce identical metrics?
72+
73+
**Scope discipline for rehearsals:**
74+
- Use ONE seed (the first canonical seed for this campaign).
75+
- Use the contrast-pair arms only (h-main vs the most direct control).
76+
Do NOT fan out across all arms in a multi-arm bundle.
77+
- Keep wall-time small. If a rehearsal is going to take more than ~5–10
78+
minutes, you're doing too much.
79+
80+
**What to emit alongside findings:**
81+
If you find any campaign-spec or brief inconsistencies (paths the
82+
validator rejects, broken argv quoting, wall-time claims that don't
83+
match reality, single-tenant probes when the target requires multi-
84+
tenant, etc.), write them to ``runs/iter-N/brief_amendments.md`` —
85+
one entry per finding, with file path + suggested change. The next
86+
``real`` iteration will read this; future runs of the same campaign
87+
will benefit indefinitely.
88+
89+
**Do NOT:**
90+
- Author full multi-arm bundles. Keep arms minimal.
91+
- Run all canonical seeds. One seed is enough to verify apparatus
92+
+ feasibility.
93+
- Conclude on the research question. Rehearsals don't confirm or
94+
refute hypotheses; they validate the apparatus.
95+
"""
96+
97+
REAL_GUIDANCE = """\
98+
This iteration is a **REAL** run (#212). Run the bundle at full scope:
99+
all arms, full seed list, full workload. Do not scope-shrink.
100+
101+
If a prior ``rehearsal`` iteration emitted ``brief_amendments.md``,
102+
read it before authoring the bundle — apply the amendments and don't
103+
re-discover the same friction.
104+
"""
105+
106+
107+
def mode_guidance_for(mode: Mode) -> str:
108+
"""Return the prompt block that guides the agent for ``mode``.
109+
110+
Raises ``ValueError`` on an unknown mode value. Silently defaulting
111+
to REAL_GUIDANCE was the prior behavior; that's the more dangerous
112+
default (rehearsal is the conservative one), so we fail loudly
113+
instead of running a full experiment when a typo says otherwise.
114+
"""
115+
if mode == "rehearsal":
116+
return REHEARSAL_GUIDANCE
117+
if mode == "real":
118+
return REAL_GUIDANCE
119+
raise ValueError(
120+
f"unknown iteration mode {mode!r}; expected one of {VALID_MODES}"
121+
)

0 commit comments

Comments
 (0)