Commit ef6403e
fix: clear all post-AI-native-Systems-Research#204 friction (AI-native-Systems-Research#205-AI-native-Systems-Research#217) — rehearsal mode + reliable iters + operational handoff (AI-native-Systems-Research#218)
* fix: surgical rehearsal iterations + reliable full iterations (post-AI-native-Systems-Research#204)
Lands the seven highest-leverage friction fixes from the post-AI-native-Systems-Research#204
paper-burst rerun (tracker AI-native-Systems-Research#213), targeted at making rehearsal
iterations surgical and full iterations reliable.
Closes AI-native-Systems-Research#205 — SDK turn now wraps each ``__anext__`` in
``anyio.fail_after``; per-message silence raises SDKTransientError
so the existing retry path catches mid-turn hangs instead of blocking
the campaign forever. New helper ``aiter_with_silence_watchdog`` is
pure-Python testable. New knob:
``campaign.sdk_timeouts.turn_silence_threshold_seconds`` (defaults
to ``silence_threshold_seconds``).
Closes AI-native-Systems-Research#206 — bundle schema accepts ``complexity_tier`` /
``tier_justification`` under ``metadata`` (where the agent's natural
instinct put them in AI-native-Systems-Research#203). Reader prefers metadata, falls back to
the legacy top-level location for backward-compat. Methodology text
clarifies the canonical location.
Closes AI-native-Systems-Research#208 — ``nous stop`` CLI message now says "phase boundary"
(the post-AI-native-Systems-Research#198 reality), not "iteration boundary". Docstring updated
to match.
Closes AI-native-Systems-Research#212 (v1) — campaign YAML grows ``iterations: [...]`` with
per-entry ``mode: rehearsal | real``. The DESIGN methodology reads
the resolved mode and renders mode-specific guidance: rehearsal
iterations focus on apparatus + feasibility checks, use ONE seed,
emit ``brief_amendments.md``; real iterations run the full bundle.
Defaults to ``real`` for missing/unknown config (backward-compat).
Same model for both modes — model swap is a v2 lever.
Closes AI-native-Systems-Research#214 — REPORT extractor's prompt now includes a pre-rendered
enumeration of ``runs/iter-*/results/*.json`` plus a retry_log
summary, so a campaign that completed 4/5 policies before failing
can no longer have its 40 valid arms dismissed as "no data."
``report.md`` template explicitly instructs the extractor to engage
with partial data and distinguish apparatus failure from null
experimental outcome (search-orientation discipline).
Closes AI-native-Systems-Research#215 — ``meta_findings.json`` no longer reports empty
``nous_asks`` on iteration failure. Single ``sdk_silence`` and
unactionable ``api_error`` entries each produce structured asks;
the "no surprises" notes string is replaced with a "heuristics gap"
message when retry_log has unclassified entries.
Closes AI-native-Systems-Research#216 — ``build_error_message`` synthesizes useful diagnostic
context (stop_reason, num_turns, message-class counts) when the SDK
returns ``is_error=True`` with empty/None result text. retry_log
rows are no longer literally ``"None"``.
Out of scope (filed for follow-up):
- AI-native-Systems-Research#207 — status reader walkback for ``Last tool``
- AI-native-Systems-Research#209 — worktree blis-binary copy/symlink
- AI-native-Systems-Research#210 — operational decisions in bundle (handoff structure)
- AI-native-Systems-Research#211 — bundle_amendments.jsonl drift record
- AI-native-Systems-Research#217 — failed-iteration distinction in ``nous status``
Tests
-----
+51 new tests (1050 passed, 1 skipped). Coverage by area:
- ``test_complexity_tier``: +5 metadata-location tests (AI-native-Systems-Research#206)
- ``test_nous_stop``: +1 CLI-message test (AI-native-Systems-Research#208)
- ``test_sdk_dispatch``: +9 watchdog + +7 build_error_message (AI-native-Systems-Research#205, AI-native-Systems-Research#216)
- ``test_meta_findings``: +3 retry_log-walking (AI-native-Systems-Research#215)
- ``test_llm_dispatch``: +10 results_summary / retry_log_summary / report
context (AI-native-Systems-Research#214)
- ``test_iteration_mode``: +18 (new file) for iteration_mode_for /
mode_guidance_for / schema acceptance / design-prompt rendering (AI-native-Systems-Research#212)
- ``test_prompt_loader``: tiny ctx fix to surface the new design
placeholders.
All tests are behavioral and use existing seam-injected fakes per
CLAUDE.md ("no live LLM calls in tests"). The watchdog logic is
pulled into a standalone async helper specifically so it's testable
without ``claude_agent_sdk``.
Refs AI-native-Systems-Research#213 (tracker)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: remaining friction subissues — handoff, status UX (AI-native-Systems-Research#207/AI-native-Systems-Research#209/AI-native-Systems-Research#210/AI-native-Systems-Research#211/AI-native-Systems-Research#217)
Lands the five remaining friction subissues from tracker AI-native-Systems-Research#213 so the next
paper-burst rerun has the full friction surface cleared. Composes with the
prior commit (rehearsal mode + reliability) to satisfy the goal: surgical
rehearsal iterations + full iterations that don't fail.
Closes AI-native-Systems-Research#207 — ``nous status`` no longer shows ``Last tool: ?`` when
the tail of executor_log is a SystemMessage / TaskNotificationMessage /
pure ThinkingBlock. The reader walks backward through up to 200 recent
events looking for the nearest ``tool_name``. New
``StatusSnapshot.last_tool_name`` field; format_one_liner /
format_watch_panel use it as primary, falling back to the legacy
last-event lookup.
Closes AI-native-Systems-Research#217 — ``nous status`` distinguishes failed iterations from
clean completions. New ``StatusSnapshot.failed_iterations`` field.
``Completed: N`` now counts only clean rows (CONFIRMED / REFUTED);
rows with ``status: "FAILED"`` (the shape ``append_failed_row`` writes)
contribute to ``Failed: N``. The watch panel shows the Failed line
only when non-zero (uncluttered for healthy campaigns); ``--line``
shows ``N done / M failed`` when M > 0.
Closes AI-native-Systems-Research#209 — bundle schema gains ``experiment_spec.preflight_commands``,
a list of shell commands the EXECUTE_ANALYZE agent must run before
fan-out. Methodology updates instruct the executor to run them before
smoke testing — fixes the worktree's missing-binary friction and any
similar build-state-doesn't-cross-the-worktree-boundary surprise.
Closes AI-native-Systems-Research#210 — bundle schema gains structured operational handoff
fields: ``fanout_template``, ``classification_function``,
``verified_parameters``. The DESIGN methodology instructs agents to
populate them when they've manually verified things in the worktree;
the EXECUTE_ANALYZE methodology instructs the executor to treat
them as canonical (no re-derivation). Together with AI-native-Systems-Research#209,
EXECUTE_ANALYZE no longer pays the token cost of re-grepping the
target's source for tenant_id semantics, GNU-parallel quoting, or
cache-size sanity.
Closes AI-native-Systems-Research#211 — bundle drift gets a formal record. EXECUTE_ANALYZE
appends to ``runs/iter-N/inputs/bundle_amendments.jsonl`` whenever
it overrides a parameter from ``experiment_spec.verified_parameters``
during smoke / validation. New helper
``_format_bundle_amendments_summary`` reads the log and feeds the
REPORT extractor's prompt (alongside AI-native-Systems-Research#214's results_summary and
retry_log_summary), so the report describes the *executed*
parameters, not the *prescribed* ones.
Why one commit
--------------
These five fixes share two threads — handoff structure (AI-native-Systems-Research#209, AI-native-Systems-Research#210,
AI-native-Systems-Research#211) and status UX (AI-native-Systems-Research#207, AI-native-Systems-Research#217) — and each is small. Splitting would
inflate review surface without splitting risk.
Tests
-----
+24 new tests (1074 passed, 1 skipped). Coverage by area:
- ``test_status``: +5 walkback (AI-native-Systems-Research#207) + +7 failed-iter counter (AI-native-Systems-Research#217)
- ``test_experiment_spec`` (new file): +6 schema acceptance for the
experiment_spec block (AI-native-Systems-Research#209/AI-native-Systems-Research#210), +5 bundle_amendments helper
rendering (AI-native-Systems-Research#211), +1 REPORT context inclusion (AI-native-Systems-Research#211).
All tests are behavioral and use existing seam-injected fakes. Schema
tests assert both forward (new field works) and backward (omitting the
new block still validates) compat.
Refs AI-native-Systems-Research#213 (tracker)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(review): close all PR-review findings — production-path correctness, invariants, behavioral tests
Addresses 17 review findings from /pr-review-toolkit:review-pr (5 agents:
code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer,
type-design-analyzer). Two were CRITICAL bugs the prior commits silently
introduced; the rest are correctness, hygiene, and test-discipline
improvements.
Critical fixes
--------------
* **Thin templates were silent in production.** The PromptLoader prefers
``<template>_thin.md`` whenever ``work_dir/CLAUDE.md`` exists, and
``campaign.py`` writes that file before iter 1. The prior commits only
updated the FULL templates, so ``{{iteration_mode}}`` /
``{{mode_guidance}}`` (AI-native-Systems-Research#212) never reached production agents — tests
passed by exercising the wrong branch (no CLAUDE.md in tmp_path).
``design_thin.md`` now includes the placeholders, and the parametrized
``TestDesignPromptIncludesMode`` exercises BOTH template paths
(with_claude_md=True is the production hot path). Static methodology
text for AI-native-Systems-Research#209/AI-native-Systems-Research#210/AI-native-Systems-Research#211 already reaches the agent via system_prompt
(verified by ``test_real_methodology_preamble_carries_new_methodology``).
* **``experiment_spec.additionalProperties: true`` defeated schema
validation.** A typo'd field name (``preflight_command`` singular,
``verifid_parameters``, etc.) silently passed validation, EXECUTE_ANALYZE
ignored the typo, and the divergence showed only as runtime failure —
exactly the silent-handoff failure AI-native-Systems-Research#209/AI-native-Systems-Research#210 were added to fix,
re-introduced through schema permissiveness. Locked the outer block
to ``additionalProperties: false`` (the four sub-fields are an
enumerated DESIGN→EXECUTE_ANALYZE contract; new fields should be
schema-versioned). ``verified_parameters.additionalProperties: true``
stays — it's correctly modelling an open knob namespace.
High-priority correctness
-------------------------
* **``aiter_with_silence_watchdog`` no longer leaks the SDK stream.**
Wrapped the loop in ``try/finally`` that calls ``aclose()`` (when
the underlying iterator is an async generator) so SDK-stream tasks
and sockets are released on both clean completion and silence-raise.
Two new behavioral tests (``test_silence_raise_closes_underlying_iterator``,
``test_clean_completion_also_closes_iterator``) record ``aclose_called``
on a fake async generator and assert.
* **``_walkback_for_tool_name`` is now bounded I/O.** Was
``read_text().splitlines()`` which slurped the whole file before
applying the 200-event cap — defeating the cap on large logs.
Switched to ``deque(file, maxlen=limit)`` so the file streams
line-by-line and only the last N are retained. Two cap-boundary
tests verify: tool event past the cap returns None;
tool event just inside the cap returns it.
* **``SDKResult`` invariant locked.** Added ``__post_init__`` that
raises ``ValueError`` when ``is_error=True`` with an empty
``error_message``. The friction-test rerun produced
``error: "None"`` rows in retry_log because this invariant was
enforced by convention (build_error_message) and a producer that
forgot to call it would silently break it. Now construction-time
failure points operators at ``build_error_message`` directly.
Removed unused ``extra: dict`` field (YAGNI).
* **Tightened the over-tolerant assertion.** ``test_none_result_falls_back_to_diagnostic``
was ``"None" not in msg or "no result text" in msg`` — a disjunction
that passed if EITHER half was true, including for messages that
DID contain literal ``"None"`` (the very bug AI-native-Systems-Research#216 fixed). Replaced
with positive-only assertions: starts with the canonical prefix,
contains every expected ``key=value``, no ``: None`` or ``= None``
embedded.
Test discipline (per CLAUDE.md "behavioral, no live LLM calls")
---------------------------------------------------------------
* **Replaced ``runner.calls[0]["turn_silence_threshold"]`` structural
assertions** with ``TestTurnSilenceThresholdActuallyControlsBehavior``:
three behavioral tests using a runner whose return value depends on
the threshold value. Tests now assert on disk artifacts
(``design_log.md`` content) and ``retry_log.jsonl`` rows — rename-safe.
* **Added the missing E2E system_prompt regression test**
(``test_real_methodology_preamble_carries_new_methodology``): loads
the project's real prompts/methodology, confirms preflight_commands /
fanout_template / classification_function / verified_parameters /
bundle_amendments.jsonl all reach the system_prompt. Catches a future
edit that drops the operational-handoff section from execute_analyze.md.
* **Added test_design_renders_in_both_template_paths parametrize
fixture** so the production thin-template path is always exercised
alongside the full-template path.
Type design (lower-friction APIs)
---------------------------------
* ``iteration_mode_for(...) -> Mode`` (Literal["rehearsal", "real"]).
Same for ``mode_guidance_for``. Type checker now catches typo'd modes
in callers; runtime behavior unchanged for valid inputs.
* ``mode_guidance_for`` now ``raise ValueError`` on unknown mode (was
silent default to REAL_GUIDANCE). REAL is the more dangerous default
(a typo would run a full experiment when scope-shrink was intended);
fail loudly instead.
Operational brittleness
-----------------------
* **``failed_iterations`` counter pinned against future drift.** New
module-level ``_TERMINAL_FAILURE_STATUSES = frozenset({"FAILED"})``
with comment that any new failure status (``"TIMEOUT"``,
``"CANCELLED"``) must be added there. Was ``r.get("status") ==
"FAILED"`` equality, which would silently bucket a future
``"TIMEOUT"`` row as a clean completion.
* **bundle_amendments helper surfaces corruption instead of swallowing.**
Malformed JSONL lines now produce ``"X amendment(s) + N malformed
line(s) skipped"`` in the report context. ``OSError`` (permission
denied, truncated file) produces ``"<iter>: bundle_amendments.jsonl
unreadable (PermissionError)"``. Tests:
``test_malformed_lines_surfaced_not_silently_skipped``,
``test_unreadable_amendments_log_surfaced``.
Comment trim
------------
* Removed the temporal claim in ``iteration_mode.py`` (``"agents at
sonnet level have always run real iterations until AI-native-Systems-Research#212"``) — would
rot fast.
* Trimmed duplicate ``AI-native-Systems-Research#207`` and ``AI-native-Systems-Research#217`` issue references in
``status.py`` per CLAUDE.md ("references rot as the codebase
evolves"). Kept one canonical reference per concept.
Tests
-----
+15 new tests; full suite 1089 passed, 1 skipped, 0 regressions.
Behavioral throughout: assertions on disk content, retry_log rows,
schema validation, module-level state (``aclose_called``).
Refs AI-native-Systems-Research#213 (tracker)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent c13f871 commit ef6403e
22 files changed
Lines changed: 2491 additions & 35 deletions
File tree
- orchestrator
- schemas
- prompts/methodology
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
196 | 196 | | |
197 | 197 | | |
198 | 198 | | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
203 | 204 | | |
204 | 205 | | |
205 | 206 | | |
| |||
229 | 230 | | |
230 | 231 | | |
231 | 232 | | |
232 | | - | |
233 | | - | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
234 | 237 | | |
235 | 238 | | |
236 | 239 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
41 | 46 | | |
42 | 47 | | |
43 | 48 | | |
| |||
46 | 51 | | |
47 | 52 | | |
48 | 53 | | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
49 | 59 | | |
50 | 60 | | |
51 | 61 | | |
52 | 62 | | |
53 | 63 | | |
54 | 64 | | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
55 | 80 | | |
56 | 81 | | |
57 | 82 | | |
| |||
139 | 164 | | |
140 | 165 | | |
141 | 166 | | |
142 | | - | |
| 167 | + | |
143 | 168 | | |
144 | 169 | | |
145 | 170 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
0 commit comments