Commit 0ae90a9
* fix(methodology): close all 5 subissues of tracker #225 (post-#218 follow-up)
Lands v1 scope of every methodology subissue revealed by the post-#218
paper-burst rerun (2026-05-27). Each subissue has a concrete fix that
materially improves the rehearsal/real iteration loop and prepares the
codebase for cross-run knowledge propagation.
Closes #221 — Render iteration_mode + execute_mode_guidance in
EXECUTE_ANALYZE prompt context. The DESIGN-phase agent (post-#212)
honors rehearsal scope-shrink for probes, but the bundle it authors
declares the full experimental design, and the EXECUTE_ANALYZE-phase
agent (which had no mode signal) dutifully fanned out the full
bundle anyway. New ``execute_mode_guidance_for(mode)`` returns
rehearsal/real text distinct from the design-phase helper. Plumbed
through ``_build_context`` for ``phase == "execute-analyze"``;
rendered into ``execute_analyze.md`` + ``execute_analyze_thin.md``
with ``{{iteration_mode}}`` + ``{{mode_guidance}}`` placeholders.
Test parametrized over ``with_claude_md`` (production thin path).
Closes #222 — bundle.experiment_spec gains a structured
``rehearsal_subset`` field (seeds, arms, extra_validation_only).
Schema-locked enum so a typo'd field name fails validation. The
DESIGN methodology instructs agents to populate it when iter is
rehearsal; EXECUTE_ANALYZE honors it (per #221's mode_guidance).
Composes with #221: prose-only scope-shrink was unreliable; a
structured field is enforceable.
Closes #223 v1 — structured ``brief_amendments.jsonl`` schema +
REPORT-context renderer. New ``brief_amendments.schema.json`` with
required fields (``id, brief_section, problem, fix, priority``)
and an enumerated priority. New
``_format_brief_amendments_summary(work_dir)`` helper renders
amendments grouped by priority into the REPORT prompt. CLI
``nous brief apply-amendments`` deferred to v2.
Closes #224 v1 — deterministic
``promote_gate.evaluate_promote_gate(work_dir, iteration) -> dict``
function. Pure Python; reads findings.json, brief_amendments.jsonl,
applied_amendments.jsonl. Decision rule: missing/invalid findings →
``abort``; unapplied BLOCKING amendment → ``revise``; else →
``promote``. Engine state-machine integration (the actual halting
behavior at iter boundaries) deferred to v2 — this PR lands the
decision logic so it's testable in isolation before any engine
state changes.
Closes #226 — bundle.experiment_spec gains a structured
``timing_observations`` block (per-policy expected wall-time +
recommended_turn_silence_threshold_seconds). ``SDKDispatcher.dispatch``
reads the prior iter's bundle for the recommended threshold and
applies it as a per-call override; restores the campaign default
after. Resolution chain: bundle override > campaign default >
factory default (600s). Methodology prescribes that rehearsal-mode
agents record per-policy timing observations during feasibility
probes — the recurring ``externality-credit`` slowness across
three reruns becomes structural data instead of folklore.
Refs #225 (tracker — five children covered).
Tests: +44 new (1133 passed, 1 skipped, 0 regressions). Behavioral
throughout: assertions on resolved ctx values, on-disk artifacts,
schema validation. Per CLAUDE.md "no live LLM calls" — all tests
use existing seam-injected fakes.
Compaction-safe plan at ``docs/plans/methodology-improvements-pr.md``
captures the full implementation map; memory entries at
``project_methodology_pr_in_flight.md`` and
``project_paper_burst_workload_divergence.md`` carry session
context across compactions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(review): close PR #227 review findings — silent-failure, docstring, e2e coverage
Addresses 11 findings from /pr-review-toolkit:review-pr (5 agents:
code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer,
type-design-analyzer). Three are critical regressions of patterns PR #218
had killed; rest are correctness, documentation, and test-coverage
improvements.
Critical fixes
--------------
* **Narrowed `except (OSError, Exception)` → `(OSError, yaml.YAMLError)`**
in ``SDKDispatcher._bundle_recommended_turn_silence_threshold``. The
prior broad-except swallowed ImportError, MemoryError, and any future
YAML library refactor — defeating PR #218's silent-failure guarantees.
Both branches now ``logger.warning`` so operators see why the override
didn't apply. ImportError on missing PyYAML now propagates as it
should (it's an environmental defect, not a runtime fallback case).
* **`_format_brief_amendments_summary` docstring corrected.** Previous
docstring claimed "schema-validates rows individually (skipping
malformed with a visible warning)." Code only does ``json.loads``;
no schema validation. Updated docstring to describe what the code
actually does — the schema is enforced by the agent that *writes*
the file (per methodology), not by this renderer.
* **DESIGN-phase REHEARSAL_GUIDANCE path mismatch.** Prior text told
agents to write ``runs/iter-N/brief_amendments.md`` (legacy markdown
path); the EXECUTE-phase guidance, the schema, the renderer, and the
promote gate ALL use ``runs/iter-N/inputs/brief_amendments.jsonl``
(post-#223 structured). DESIGN-following agents would have silently
dropped amendments on the floor. Both phases now point at the same
JSONL path with the same required-fields list.
High-priority correctness
-------------------------
* **Promote gate: malformed amendment lines downgrade to revise**
(asymmetric-risk choice). ``_read_jsonl_with_skips`` returns
``(rows, malformed_count)``. If brief_amendments.jsonl has any
unparseable lines, the gate cannot rule out a hidden BLOCKING
entry — silently treating that as "no BLOCKING amendments" risks
false promotion past corruption. Now: emits ``revise`` with
``malformed_amendment_lines: N`` in the result dict and reasoning
text that names the file path. Operator inspects vs. wastes an
iteration's tokens — symmetric cost reversal.
* **Promote gate scope explicitly documented as iter-local.** Per the
brief_amendments schema, ``id`` is "stable within this iter's
amendments" (not globally unique). The gate reads only iter-N's
amendments, so iter-1 BLOCKING amendments that were never applied
do NOT re-flag at iter-2. Docstring now states this explicitly +
notes the v2 work (composite IDs, apply-amendments CLI) needed to
cross-iter-scope. Callers MUST run the gate after every iter that
emits BLOCKING amendments, not just the last one.
* **Restore-after-failure now has a behavioral test.** Post-#218 the
``SDKDispatcher.dispatch`` override-and-restore lives in
``try/finally``, but no test exercised the failure path. New
``test_dispatch_restores_threshold_when_runner_raises`` constructs
a runner that raises ``SDKTransientError``, asserts dispatch
raises, AND asserts the dispatcher's stored threshold equals the
campaign default afterward. A regression that moves the restore
out of ``finally`` is now caught.
* **End-to-end coupling test across #221+#222+#223+#224.** Each
subissue had per-feature tests, but no single test verified the
chain (rehearsal mode → execute honors rehearsal_subset → BLOCKING
amendment written → gate decides revise). New
``TestEndToEndIntegration.test_rehearsal_emits_blocking_amendment_then_gate_revises``
walks the full pipeline using only public functions and schemas,
exercising the most likely future regression class (mode resolver
bug, schema field rename, gate logic drift) in one place. Also
verifies the apply-amendments-then-promote happy path.
Documentation / methodology
---------------------------
* **Stripped issue-number references from agent-facing prose.** The
agent has no GitHub access; ``(#212)``, ``(#221)``, ``— #222`` in
``EXECUTE_REHEARSAL_GUIDANCE`` etc. were noise. Kept in Python
docstrings/comments where developers benefit. Same applies to the
``(post-#223 v2)`` parenthetical that was leaking into the
promote_gate's operator-facing reasoning text.
* **EXECUTE_REAL_GUIDANCE rewrites the halt-mechanism description.**
Previous text said "halt with a failure_note.md" — but the actual
halt mechanism (post-#224 v1) is ``decision=revise`` from the
promote gate, which the engine acts on (v2 wiring). Updated to
describe the agent's role correctly: read amendments, apply them
to run config, write findings.json with appropriate status. The
failure_note.md is now a fallback for the "I cannot apply this
amendment" case, not the primary halt mechanism.
* **`_decision` docstring removed** (redundant with function name).
Module-level docstring trimmed of v1/v2 task-tracking framing
(that lives in the PR description; rots in code).
Tests
-----
+8 new tests (1137 passed, 1 skipped, 0 regressions). All behavioral.
The end-to-end test specifically exercises every artifact + schema +
function in the #221-226 cluster as a single chain.
Refs PR #227 review findings.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ef6403e commit 0ae90a9
14 files changed
Lines changed: 1750 additions & 5 deletions
File tree
- orchestrator
- schemas
- prompts/methodology
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
88 | 93 | | |
89 | 94 | | |
90 | 95 | | |
| |||
105 | 110 | | |
106 | 111 | | |
107 | 112 | | |
108 | | - | |
| 113 | + | |
109 | 114 | | |
110 | 115 | | |
111 | 116 | | |
| |||
119 | 124 | | |
120 | 125 | | |
121 | 126 | | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
133 | 133 | | |
134 | 134 | | |
135 | 135 | | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
136 | 224 | | |
137 | 225 | | |
138 | 226 | | |
| |||
594 | 682 | | |
595 | 683 | | |
596 | 684 | | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
597 | 698 | | |
598 | 699 | | |
599 | 700 | | |
| |||
656 | 757 | | |
657 | 758 | | |
658 | 759 | | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
659 | 768 | | |
660 | 769 | | |
661 | 770 | | |
| |||
0 commit comments