Skip to content

Commit 0ae90a9

Browse files
sriumcpclaude
andauthored
fix(methodology): close all 5 subissues of tracker AI-native-Systems-Research#225 (post-AI-native-Systems-Research#218 follow-up) (AI-native-Systems-Research#227)
* fix(methodology): close all 5 subissues of tracker AI-native-Systems-Research#225 (post-AI-native-Systems-Research#218 follow-up) Lands v1 scope of every methodology subissue revealed by the post-AI-native-Systems-Research#218 paper-burst rerun (2026-05-27). Each subissue has a concrete fix that materially improves the rehearsal/real iteration loop and prepares the codebase for cross-run knowledge propagation. Closes AI-native-Systems-Research#221 — Render iteration_mode + execute_mode_guidance in EXECUTE_ANALYZE prompt context. The DESIGN-phase agent (post-AI-native-Systems-Research#212) honors rehearsal scope-shrink for probes, but the bundle it authors declares the full experimental design, and the EXECUTE_ANALYZE-phase agent (which had no mode signal) dutifully fanned out the full bundle anyway. New ``execute_mode_guidance_for(mode)`` returns rehearsal/real text distinct from the design-phase helper. Plumbed through ``_build_context`` for ``phase == "execute-analyze"``; rendered into ``execute_analyze.md`` + ``execute_analyze_thin.md`` with ``{{iteration_mode}}`` + ``{{mode_guidance}}`` placeholders. Test parametrized over ``with_claude_md`` (production thin path). Closes AI-native-Systems-Research#222 — bundle.experiment_spec gains a structured ``rehearsal_subset`` field (seeds, arms, extra_validation_only). Schema-locked enum so a typo'd field name fails validation. The DESIGN methodology instructs agents to populate it when iter is rehearsal; EXECUTE_ANALYZE honors it (per AI-native-Systems-Research#221's mode_guidance). Composes with AI-native-Systems-Research#221: prose-only scope-shrink was unreliable; a structured field is enforceable. Closes AI-native-Systems-Research#223 v1 — structured ``brief_amendments.jsonl`` schema + REPORT-context renderer. New ``brief_amendments.schema.json`` with required fields (``id, brief_section, problem, fix, priority``) and an enumerated priority. New ``_format_brief_amendments_summary(work_dir)`` helper renders amendments grouped by priority into the REPORT prompt. CLI ``nous brief apply-amendments`` deferred to v2. Closes AI-native-Systems-Research#224 v1 — deterministic ``promote_gate.evaluate_promote_gate(work_dir, iteration) -> dict`` function. Pure Python; reads findings.json, brief_amendments.jsonl, applied_amendments.jsonl. Decision rule: missing/invalid findings → ``abort``; unapplied BLOCKING amendment → ``revise``; else → ``promote``. Engine state-machine integration (the actual halting behavior at iter boundaries) deferred to v2 — this PR lands the decision logic so it's testable in isolation before any engine state changes. Closes AI-native-Systems-Research#226 — bundle.experiment_spec gains a structured ``timing_observations`` block (per-policy expected wall-time + recommended_turn_silence_threshold_seconds). ``SDKDispatcher.dispatch`` reads the prior iter's bundle for the recommended threshold and applies it as a per-call override; restores the campaign default after. Resolution chain: bundle override > campaign default > factory default (600s). Methodology prescribes that rehearsal-mode agents record per-policy timing observations during feasibility probes — the recurring ``externality-credit`` slowness across three reruns becomes structural data instead of folklore. Refs AI-native-Systems-Research#225 (tracker — five children covered). Tests: +44 new (1133 passed, 1 skipped, 0 regressions). Behavioral throughout: assertions on resolved ctx values, on-disk artifacts, schema validation. Per CLAUDE.md "no live LLM calls" — all tests use existing seam-injected fakes. Compaction-safe plan at ``docs/plans/methodology-improvements-pr.md`` captures the full implementation map; memory entries at ``project_methodology_pr_in_flight.md`` and ``project_paper_burst_workload_divergence.md`` carry session context across compactions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review): close PR AI-native-Systems-Research#227 review findings — silent-failure, docstring, e2e coverage Addresses 11 findings from /pr-review-toolkit:review-pr (5 agents: code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer). Three are critical regressions of patterns PR AI-native-Systems-Research#218 had killed; rest are correctness, documentation, and test-coverage improvements. Critical fixes -------------- * **Narrowed `except (OSError, Exception)` → `(OSError, yaml.YAMLError)`** in ``SDKDispatcher._bundle_recommended_turn_silence_threshold``. The prior broad-except swallowed ImportError, MemoryError, and any future YAML library refactor — defeating PR AI-native-Systems-Research#218's silent-failure guarantees. Both branches now ``logger.warning`` so operators see why the override didn't apply. ImportError on missing PyYAML now propagates as it should (it's an environmental defect, not a runtime fallback case). * **`_format_brief_amendments_summary` docstring corrected.** Previous docstring claimed "schema-validates rows individually (skipping malformed with a visible warning)." Code only does ``json.loads``; no schema validation. Updated docstring to describe what the code actually does — the schema is enforced by the agent that *writes* the file (per methodology), not by this renderer. * **DESIGN-phase REHEARSAL_GUIDANCE path mismatch.** Prior text told agents to write ``runs/iter-N/brief_amendments.md`` (legacy markdown path); the EXECUTE-phase guidance, the schema, the renderer, and the promote gate ALL use ``runs/iter-N/inputs/brief_amendments.jsonl`` (post-AI-native-Systems-Research#223 structured). DESIGN-following agents would have silently dropped amendments on the floor. Both phases now point at the same JSONL path with the same required-fields list. High-priority correctness ------------------------- * **Promote gate: malformed amendment lines downgrade to revise** (asymmetric-risk choice). ``_read_jsonl_with_skips`` returns ``(rows, malformed_count)``. If brief_amendments.jsonl has any unparseable lines, the gate cannot rule out a hidden BLOCKING entry — silently treating that as "no BLOCKING amendments" risks false promotion past corruption. Now: emits ``revise`` with ``malformed_amendment_lines: N`` in the result dict and reasoning text that names the file path. Operator inspects vs. wastes an iteration's tokens — symmetric cost reversal. * **Promote gate scope explicitly documented as iter-local.** Per the brief_amendments schema, ``id`` is "stable within this iter's amendments" (not globally unique). The gate reads only iter-N's amendments, so iter-1 BLOCKING amendments that were never applied do NOT re-flag at iter-2. Docstring now states this explicitly + notes the v2 work (composite IDs, apply-amendments CLI) needed to cross-iter-scope. Callers MUST run the gate after every iter that emits BLOCKING amendments, not just the last one. * **Restore-after-failure now has a behavioral test.** Post-AI-native-Systems-Research#218 the ``SDKDispatcher.dispatch`` override-and-restore lives in ``try/finally``, but no test exercised the failure path. New ``test_dispatch_restores_threshold_when_runner_raises`` constructs a runner that raises ``SDKTransientError``, asserts dispatch raises, AND asserts the dispatcher's stored threshold equals the campaign default afterward. A regression that moves the restore out of ``finally`` is now caught. * **End-to-end coupling test across AI-native-Systems-Research#221+AI-native-Systems-Research#222+AI-native-Systems-Research#223+AI-native-Systems-Research#224.** Each subissue had per-feature tests, but no single test verified the chain (rehearsal mode → execute honors rehearsal_subset → BLOCKING amendment written → gate decides revise). New ``TestEndToEndIntegration.test_rehearsal_emits_blocking_amendment_then_gate_revises`` walks the full pipeline using only public functions and schemas, exercising the most likely future regression class (mode resolver bug, schema field rename, gate logic drift) in one place. Also verifies the apply-amendments-then-promote happy path. Documentation / methodology --------------------------- * **Stripped issue-number references from agent-facing prose.** The agent has no GitHub access; ``(AI-native-Systems-Research#212)``, ``(AI-native-Systems-Research#221)``, ``— AI-native-Systems-Research#222`` in ``EXECUTE_REHEARSAL_GUIDANCE`` etc. were noise. Kept in Python docstrings/comments where developers benefit. Same applies to the ``(post-AI-native-Systems-Research#223 v2)`` parenthetical that was leaking into the promote_gate's operator-facing reasoning text. * **EXECUTE_REAL_GUIDANCE rewrites the halt-mechanism description.** Previous text said "halt with a failure_note.md" — but the actual halt mechanism (post-AI-native-Systems-Research#224 v1) is ``decision=revise`` from the promote gate, which the engine acts on (v2 wiring). Updated to describe the agent's role correctly: read amendments, apply them to run config, write findings.json with appropriate status. The failure_note.md is now a fallback for the "I cannot apply this amendment" case, not the primary halt mechanism. * **`_decision` docstring removed** (redundant with function name). Module-level docstring trimmed of v1/v2 task-tracking framing (that lives in the PR description; rots in code). Tests ----- +8 new tests (1137 passed, 1 skipped, 0 regressions). All behavioral. The end-to-end test specifically exercises every artifact + schema + function in the AI-native-Systems-Research#221-226 cluster as a single chain. Refs PR AI-native-Systems-Research#227 review findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ef6403e commit 0ae90a9

14 files changed

Lines changed: 1750 additions & 5 deletions

orchestrator/iteration_mode.py

Lines changed: 103 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,15 @@ def iteration_mode_for(campaign: dict, iteration: int) -> Mode:
8181
If you find any campaign-spec or brief inconsistencies (paths the
8282
validator rejects, broken argv quoting, wall-time claims that don't
8383
match reality, single-tenant probes when the target requires multi-
84-
tenant, etc.), write them to ``runs/iter-N/brief_amendments.md`` —
85-
one entry per finding, with file path + suggested change. The next
86-
``real`` iteration will read this; future runs of the same campaign
87-
will benefit indefinitely.
84+
tenant, etc.), write them to
85+
``runs/iter-N/inputs/brief_amendments.jsonl`` as one structured JSON
86+
object per line. Required fields: ``id`` (pattern ``BA-N``),
87+
``brief_section``, ``problem``, ``fix``, ``priority`` (one of
88+
``BLOCKING``, ``HIGH``, ``MEDIUM``, ``LOW``, ``INFO``). Optional
89+
``evidence``, ``impact``. Schema:
90+
``orchestrator/schemas/brief_amendments.schema.json``. The promote
91+
gate, the REPORT extractor, and the future ``apply-amendments`` CLI
92+
all read this structured form.
8893
8994
**Do NOT:**
9095
- Author full multi-arm bundles. Keep arms minimal.
@@ -105,7 +110,7 @@ def iteration_mode_for(campaign: dict, iteration: int) -> Mode:
105110

106111

107112
def mode_guidance_for(mode: Mode) -> str:
108-
"""Return the prompt block that guides the agent for ``mode``.
113+
"""Return the DESIGN-phase prompt block that guides the agent for ``mode``.
109114
110115
Raises ``ValueError`` on an unknown mode value. Silently defaulting
111116
to REAL_GUIDANCE was the prior behavior; that's the more dangerous
@@ -119,3 +124,96 @@ def mode_guidance_for(mode: Mode) -> str:
119124
raise ValueError(
120125
f"unknown iteration mode {mode!r}; expected one of {VALID_MODES}"
121126
)
127+
128+
129+
# ─── Execute-phase mode guidance (#221) ──────────────────────────────────
130+
#
131+
# The DESIGN agent's mode_guidance shaped how it scope-shrunk probes /
132+
# bundle authoring. EXECUTE_ANALYZE needs its OWN mode-aware guidance
133+
# so it doesn't fan out the bundle at full scope when iter is rehearsal.
134+
# Without this, post-#212 paper-burst reruns observed the DESIGN agent
135+
# honoring rehearsal scope while EXECUTE_ANALYZE dutifully ran the full
136+
# 50-arm experiment anyway — defeating the cost asymmetry that was the
137+
# entire economic argument for #212.
138+
139+
EXECUTE_REHEARSAL_GUIDANCE = """\
140+
This iteration is in **REHEARSAL** mode. The DESIGN agent's bundle
141+
declares the full experimental design (so iter-2 / future runs can
142+
run it untouched). YOUR JOB this iter:
143+
144+
1. **Honor the rehearsal scope.** If the bundle's
145+
``experiment_spec.rehearsal_subset`` is populated, execute ONLY
146+
that subset (typically: 1 seed × the contrast-pair arms).
147+
Do NOT fan out the full ``experiment_spec`` — that's iter-2's job.
148+
If ``rehearsal_subset`` is missing, default to: first canonical
149+
seed + ``h-main`` and ``h-control-negative`` arms only.
150+
151+
2. **Validate the analysis pipeline.** Schema-pass at least one
152+
result through the analysis_summary.json computation. If the
153+
analysis script fails or returns null where data is present,
154+
fix the script (or surface the issue) before iter-2 runs.
155+
156+
3. **Append per-policy timing observations.** During the
157+
feasibility / contrast-pair runs, measure wall-clock per policy.
158+
Record into ``experiment_spec.timing_observations``:
159+
``expected_wall_time_seconds_per_policy: { ea-wfq: 25, wfq: 23, ... }``
160+
and a derived ``recommended_turn_silence_threshold_seconds``
161+
(~3× the slowest observed policy + buffer). iter-2's watchdog
162+
reads these to calibrate.
163+
164+
4. **Emit ``brief_amendments.jsonl``** at
165+
``runs/iter-N/inputs/brief_amendments.jsonl`` if you find any
166+
campaign-spec friction (workload params, timing claims, missing
167+
flags, etc.). One JSON object per line; required fields: ``id``
168+
(pattern ``BA-N``), ``brief_section``, ``problem``, ``fix``,
169+
``priority`` (BLOCKING / HIGH / MEDIUM / LOW / INFO). Optional
170+
``evidence``, ``impact``.
171+
172+
5. **Append to ``bundle_amendments.jsonl``** when you override
173+
any parameter from ``experiment_spec.verified_parameters``.
174+
175+
6. **Write findings.json with ``mode: rehearsal``** in the outcome,
176+
noting that scientific claims are deferred to iter-2. The
177+
``experiment_valid: true`` flag means "the apparatus works" —
178+
not "the hypothesis is confirmed/refuted."
179+
180+
**Do NOT:**
181+
- Fan out the full bundle's seeds × policies grid.
182+
- Mark h-main as CONFIRMED / REFUTED based on rehearsal data.
183+
- Skip writing ``brief_amendments.jsonl`` if you discovered
184+
campaign-spec friction.
185+
"""
186+
187+
EXECUTE_REAL_GUIDANCE = """\
188+
This iteration is in **REAL** mode. Run the full experiment_spec at
189+
the bundle's prescribed scope: all arms, full seed list.
190+
191+
If a prior ``rehearsal`` iter emitted ``brief_amendments.jsonl``, read
192+
it BEFORE launching the experiment. Any ``priority: BLOCKING``
193+
amendments encode constraints iter-2 must respect (e.g., a workload
194+
parameter the rehearsal verified is required for the experiment to
195+
engage the mechanism). Apply each BLOCKING amendment to your run
196+
configuration and proceed; if you cannot apply one, write a
197+
``failure_note.md`` describing why and STOP — the campaign should
198+
revise the brief before continuing.
199+
200+
Write ``findings.json`` with ``mode: real`` and a CONFIRMED / REFUTED
201+
/ NULL status per arm. Append ``bundle_amendments.jsonl`` for any
202+
parameter overrides observed during execution (silent drift breaks
203+
reproducibility).
204+
"""
205+
206+
207+
def execute_mode_guidance_for(mode: Mode) -> str:
208+
"""Return the EXECUTE_ANALYZE-phase prompt block for ``mode`` (#221).
209+
210+
Distinct from ``mode_guidance_for`` (which targets the DESIGN agent).
211+
Raises ``ValueError`` on unknown modes for the same fail-loud reason.
212+
"""
213+
if mode == "rehearsal":
214+
return EXECUTE_REHEARSAL_GUIDANCE
215+
if mode == "real":
216+
return EXECUTE_REAL_GUIDANCE
217+
raise ValueError(
218+
f"unknown iteration mode {mode!r}; expected one of {VALID_MODES}"
219+
)

orchestrator/llm_dispatch.py

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,94 @@ def _format_results_summary(work_dir: Path) -> str:
133133
return "\n".join(lines)
134134

135135

136+
def _format_brief_amendments_summary(work_dir: Path) -> str:
137+
"""#223: surface structured ``brief_amendments.jsonl`` entries to
138+
the REPORT extractor.
139+
140+
Each amendment is a JSON object with required fields
141+
``id, brief_section, problem, fix, priority``. Optional
142+
``evidence``, ``impact``. The schema lives at
143+
``orchestrator/schemas/brief_amendments.schema.json`` and is
144+
enforced by the agent that *writes* the file (per methodology) —
145+
this renderer JSON-decodes each row and surfaces a count of
146+
lines that failed to parse so the operator sees corruption,
147+
but does not itself re-validate against the schema.
148+
149+
Walks ``runs/iter-*/inputs/brief_amendments.jsonl`` and renders a
150+
per-iter listing grouped by priority. The REPORT extractor can use
151+
this to: (a) cite which amendments shaped the iteration's findings,
152+
(b) flag which BLOCKING amendments still need applying to the
153+
upstream brief (the cross-run learning loop).
154+
"""
155+
runs_dir = work_dir / "runs"
156+
if not runs_dir.is_dir():
157+
return "(no iteration directories — no brief amendments to report.)"
158+
iter_dirs = sorted(
159+
(d for d in runs_dir.iterdir()
160+
if d.is_dir() and d.name.startswith("iter-")),
161+
key=lambda d: d.name,
162+
)
163+
sections: list[str] = []
164+
total = 0
165+
for iter_dir in iter_dirs:
166+
log = iter_dir / "inputs" / "brief_amendments.jsonl"
167+
if not log.exists():
168+
continue
169+
try:
170+
text = log.read_text()
171+
except OSError as exc:
172+
sections.append(
173+
f"- {iter_dir.name}: brief_amendments.jsonl unreadable "
174+
f"({type(exc).__name__})"
175+
)
176+
continue
177+
rows: list[dict] = []
178+
skipped_malformed = 0
179+
for line in text.splitlines():
180+
if not line.strip():
181+
continue
182+
try:
183+
rows.append(json.loads(line))
184+
except json.JSONDecodeError:
185+
skipped_malformed += 1
186+
if not rows and skipped_malformed == 0:
187+
continue
188+
# Group by priority for at-a-glance triage. BLOCKING first, then
189+
# HIGH / MEDIUM / LOW / INFO. Unknown priorities sort last.
190+
priority_order = {
191+
"BLOCKING": 0, "HIGH": 1, "MEDIUM": 2, "LOW": 3, "INFO": 4,
192+
}
193+
rows_sorted = sorted(
194+
rows,
195+
key=lambda r: priority_order.get(
196+
str(r.get("priority", "")).upper(), 99
197+
),
198+
)
199+
header = f"- {iter_dir.name}: {len(rows)} amendment(s)"
200+
if skipped_malformed:
201+
header += f" + {skipped_malformed} malformed line(s) skipped"
202+
sections.append(header)
203+
total += len(rows)
204+
cap = 20
205+
for r in rows_sorted[:cap]:
206+
aid = r.get("id", "?")
207+
prio = r.get("priority", "?")
208+
section = r.get("brief_section", "?")
209+
problem = r.get("problem", "")
210+
sections.append(
211+
f" - [{prio}] {aid} (target: {section}) — "
212+
+ (problem[:160] + "..." if len(problem) > 160 else problem)
213+
)
214+
if len(rows_sorted) > cap:
215+
sections.append(f" - ... and {len(rows_sorted) - cap} more")
216+
if not sections:
217+
return (
218+
"(no brief_amendments.jsonl entries — the campaign brief was "
219+
"consistent with the agent's runs; no amendments queued.)"
220+
)
221+
return "\n".join(sections)
222+
223+
136224
def _format_bundle_amendments_summary(work_dir: Path) -> str:
137225
"""#211: surface bundle_amendments.jsonl entries to the REPORT extractor.
138226
@@ -594,6 +682,19 @@ def _build_context(
594682
"No design handoff available — explore the system directly."
595683
)
596684

685+
# #221: per-iteration mode signal in EXECUTE_ANALYZE too. The
686+
# post-#212 paper-burst rerun observed the DESIGN agent
687+
# honoring rehearsal scope-shrink while EXECUTE_ANALYZE
688+
# dutifully fanned out the full bundle anyway — because the
689+
# mode signal only flowed to DESIGN. Rendering it in execute
690+
# too closes that gap.
691+
from orchestrator.iteration_mode import (
692+
iteration_mode_for, execute_mode_guidance_for,
693+
)
694+
mode = iteration_mode_for(self.campaign, iteration)
695+
ctx["iteration_mode"] = mode
696+
ctx["mode_guidance"] = execute_mode_guidance_for(mode)
697+
597698
if perspective is not None:
598699
ctx["perspective_name"] = perspective
599700

@@ -656,6 +757,14 @@ def _build_context(
656757
ctx["bundle_amendments_summary"] = (
657758
_format_bundle_amendments_summary(self.work_dir)
658759
)
760+
# #223: structured brief_amendments — propagate to REPORT
761+
# so the extractor can cite which amendments shaped the
762+
# iteration's findings AND surface BLOCKING amendments
763+
# that haven't been applied to the upstream brief yet
764+
# (cross-run learning loop).
765+
ctx["brief_amendments_summary"] = (
766+
_format_brief_amendments_summary(self.work_dir)
767+
)
659768

660769
return ctx
661770

0 commit comments

Comments
 (0)