Skip to content

Commit bb99cfd

Browse files
sriumcpclaude
andcommitted
fix(review): close PR #227 review findings — silent-failure, docstring, e2e coverage
Addresses 11 findings from /pr-review-toolkit:review-pr (5 agents: code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer). Three are critical regressions of patterns PR #218 had killed; rest are correctness, documentation, and test-coverage improvements. Critical fixes -------------- * **Narrowed `except (OSError, Exception)` → `(OSError, yaml.YAMLError)`** in ``SDKDispatcher._bundle_recommended_turn_silence_threshold``. The prior broad-except swallowed ImportError, MemoryError, and any future YAML library refactor — defeating PR #218's silent-failure guarantees. Both branches now ``logger.warning`` so operators see why the override didn't apply. ImportError on missing PyYAML now propagates as it should (it's an environmental defect, not a runtime fallback case). * **`_format_brief_amendments_summary` docstring corrected.** Previous docstring claimed "schema-validates rows individually (skipping malformed with a visible warning)." Code only does ``json.loads``; no schema validation. Updated docstring to describe what the code actually does — the schema is enforced by the agent that *writes* the file (per methodology), not by this renderer. * **DESIGN-phase REHEARSAL_GUIDANCE path mismatch.** Prior text told agents to write ``runs/iter-N/brief_amendments.md`` (legacy markdown path); the EXECUTE-phase guidance, the schema, the renderer, and the promote gate ALL use ``runs/iter-N/inputs/brief_amendments.jsonl`` (post-#223 structured). DESIGN-following agents would have silently dropped amendments on the floor. Both phases now point at the same JSONL path with the same required-fields list. High-priority correctness ------------------------- * **Promote gate: malformed amendment lines downgrade to revise** (asymmetric-risk choice). ``_read_jsonl_with_skips`` returns ``(rows, malformed_count)``. If brief_amendments.jsonl has any unparseable lines, the gate cannot rule out a hidden BLOCKING entry — silently treating that as "no BLOCKING amendments" risks false promotion past corruption. Now: emits ``revise`` with ``malformed_amendment_lines: N`` in the result dict and reasoning text that names the file path. Operator inspects vs. wastes an iteration's tokens — symmetric cost reversal. * **Promote gate scope explicitly documented as iter-local.** Per the brief_amendments schema, ``id`` is "stable within this iter's amendments" (not globally unique). The gate reads only iter-N's amendments, so iter-1 BLOCKING amendments that were never applied do NOT re-flag at iter-2. Docstring now states this explicitly + notes the v2 work (composite IDs, apply-amendments CLI) needed to cross-iter-scope. Callers MUST run the gate after every iter that emits BLOCKING amendments, not just the last one. * **Restore-after-failure now has a behavioral test.** Post-#218 the ``SDKDispatcher.dispatch`` override-and-restore lives in ``try/finally``, but no test exercised the failure path. New ``test_dispatch_restores_threshold_when_runner_raises`` constructs a runner that raises ``SDKTransientError``, asserts dispatch raises, AND asserts the dispatcher's stored threshold equals the campaign default afterward. A regression that moves the restore out of ``finally`` is now caught. * **End-to-end coupling test across #221+#222+#223+#224.** Each subissue had per-feature tests, but no single test verified the chain (rehearsal mode → execute honors rehearsal_subset → BLOCKING amendment written → gate decides revise). New ``TestEndToEndIntegration.test_rehearsal_emits_blocking_amendment_then_gate_revises`` walks the full pipeline using only public functions and schemas, exercising the most likely future regression class (mode resolver bug, schema field rename, gate logic drift) in one place. Also verifies the apply-amendments-then-promote happy path. Documentation / methodology --------------------------- * **Stripped issue-number references from agent-facing prose.** The agent has no GitHub access; ``(#212)``, ``(#221)``, ``— #222`` in ``EXECUTE_REHEARSAL_GUIDANCE`` etc. were noise. Kept in Python docstrings/comments where developers benefit. Same applies to the ``(post-#223 v2)`` parenthetical that was leaking into the promote_gate's operator-facing reasoning text. * **EXECUTE_REAL_GUIDANCE rewrites the halt-mechanism description.** Previous text said "halt with a failure_note.md" — but the actual halt mechanism (post-#224 v1) is ``decision=revise`` from the promote gate, which the engine acts on (v2 wiring). Updated to describe the agent's role correctly: read amendments, apply them to run config, write findings.json with appropriate status. The failure_note.md is now a fallback for the "I cannot apply this amendment" case, not the primary halt mechanism. * **`_decision` docstring removed** (redundant with function name). Module-level docstring trimmed of v1/v2 task-tracking framing (that lives in the PR description; rots in code). Tests ----- +8 new tests (1137 passed, 1 skipped, 0 regressions). All behavioral. The end-to-end test specifically exercises every artifact + schema + function in the #221-226 cluster as a single chain. Refs PR #227 review findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 749cd76 commit bb99cfd

6 files changed

Lines changed: 344 additions & 57 deletions

File tree

orchestrator/iteration_mode.py

Lines changed: 33 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,15 @@ def iteration_mode_for(campaign: dict, iteration: int) -> Mode:
8181
If you find any campaign-spec or brief inconsistencies (paths the
8282
validator rejects, broken argv quoting, wall-time claims that don't
8383
match reality, single-tenant probes when the target requires multi-
84-
tenant, etc.), write them to ``runs/iter-N/brief_amendments.md`` —
85-
one entry per finding, with file path + suggested change. The next
86-
``real`` iteration will read this; future runs of the same campaign
87-
will benefit indefinitely.
84+
tenant, etc.), write them to
85+
``runs/iter-N/inputs/brief_amendments.jsonl`` as one structured JSON
86+
object per line. Required fields: ``id`` (pattern ``BA-N``),
87+
``brief_section``, ``problem``, ``fix``, ``priority`` (one of
88+
``BLOCKING``, ``HIGH``, ``MEDIUM``, ``LOW``, ``INFO``). Optional
89+
``evidence``, ``impact``. Schema:
90+
``orchestrator/schemas/brief_amendments.schema.json``. The promote
91+
gate, the REPORT extractor, and the future ``apply-amendments`` CLI
92+
all read this structured form.
8893
8994
**Do NOT:**
9095
- Author full multi-arm bundles. Keep arms minimal.
@@ -132,13 +137,13 @@ def mode_guidance_for(mode: Mode) -> str:
132137
# entire economic argument for #212.
133138

134139
EXECUTE_REHEARSAL_GUIDANCE = """\
135-
This iteration is in **REHEARSAL** mode (#212). The DESIGN agent's
136-
bundle declares the full experimental design (so iter-2 / future runs
137-
can run it untouched). YOUR JOB this iter:
140+
This iteration is in **REHEARSAL** mode. The DESIGN agent's bundle
141+
declares the full experimental design (so iter-2 / future runs can
142+
run it untouched). YOUR JOB this iter:
138143
139144
1. **Honor the rehearsal scope.** If the bundle's
140-
``experiment_spec.rehearsal_subset`` is populated (#222), execute
141-
ONLY that subset (typically: 1 seed × the contrast-pair arms).
145+
``experiment_spec.rehearsal_subset`` is populated, execute ONLY
146+
that subset (typically: 1 seed × the contrast-pair arms).
142147
Do NOT fan out the full ``experiment_spec`` — that's iter-2's job.
143148
If ``rehearsal_subset`` is missing, default to: first canonical
144149
seed + ``h-main`` and ``h-control-negative`` arms only.
@@ -148,23 +153,23 @@ def mode_guidance_for(mode: Mode) -> str:
148153
analysis script fails or returns null where data is present,
149154
fix the script (or surface the issue) before iter-2 runs.
150155
151-
3. **Append per-policy timing observations** (#226). During the
156+
3. **Append per-policy timing observations.** During the
152157
feasibility / contrast-pair runs, measure wall-clock per policy.
153158
Record into ``experiment_spec.timing_observations``:
154159
``expected_wall_time_seconds_per_policy: { ea-wfq: 25, wfq: 23, ... }``
155160
and a derived ``recommended_turn_silence_threshold_seconds``
156161
(~3× the slowest observed policy + buffer). iter-2's watchdog
157162
reads these to calibrate.
158163
159-
4. **Emit ``brief_amendments.jsonl``** (post-#223 structured form) at
164+
4. **Emit ``brief_amendments.jsonl``** at
160165
``runs/iter-N/inputs/brief_amendments.jsonl`` if you find any
161166
campaign-spec friction (workload params, timing claims, missing
162167
flags, etc.). One JSON object per line; required fields: ``id``
163168
(pattern ``BA-N``), ``brief_section``, ``problem``, ``fix``,
164169
``priority`` (BLOCKING / HIGH / MEDIUM / LOW / INFO). Optional
165170
``evidence``, ``impact``.
166171
167-
5. **Append to ``bundle_amendments.jsonl``** (#211) when you override
172+
5. **Append to ``bundle_amendments.jsonl``** when you override
168173
any parameter from ``experiment_spec.verified_parameters``.
169174
170175
6. **Write findings.json with ``mode: rehearsal``** in the outcome,
@@ -180,15 +185,22 @@ def mode_guidance_for(mode: Mode) -> str:
180185
"""
181186

182187
EXECUTE_REAL_GUIDANCE = """\
183-
This iteration is in **REAL** mode (#212). Run the full experiment_spec
184-
at the bundle's prescribed scope: all arms, full seed list. If a prior
185-
``rehearsal`` iter emitted ``brief_amendments.jsonl``, read it before
186-
launching the experiment — if any ``priority: BLOCKING`` amendments
187-
exist that haven't been applied to the brief, halt with a
188-
``failure_note.md`` (promotion-gate logic, #224). Otherwise run the
189-
full bundle, write ``findings.json`` with ``mode: real`` and a CONFIRMED /
190-
REFUTED / NULL status per arm, and append ``bundle_amendments.jsonl``
191-
(#211) for any parameter overrides observed during execution.
188+
This iteration is in **REAL** mode. Run the full experiment_spec at
189+
the bundle's prescribed scope: all arms, full seed list.
190+
191+
If a prior ``rehearsal`` iter emitted ``brief_amendments.jsonl``, read
192+
it BEFORE launching the experiment. Any ``priority: BLOCKING``
193+
amendments encode constraints iter-2 must respect (e.g., a workload
194+
parameter the rehearsal verified is required for the experiment to
195+
engage the mechanism). Apply each BLOCKING amendment to your run
196+
configuration and proceed; if you cannot apply one, write a
197+
``failure_note.md`` describing why and STOP — the campaign should
198+
revise the brief before continuing.
199+
200+
Write ``findings.json`` with ``mode: real`` and a CONFIRMED / REFUTED
201+
/ NULL status per arm. Append ``bundle_amendments.jsonl`` for any
202+
parameter overrides observed during execution (silent drift breaks
203+
reproducibility).
192204
"""
193205

194206

orchestrator/llm_dispatch.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -140,14 +140,17 @@ def _format_brief_amendments_summary(work_dir: Path) -> str:
140140
Each amendment is a JSON object with required fields
141141
``id, brief_section, problem, fix, priority``. Optional
142142
``evidence``, ``impact``. The schema lives at
143-
``orchestrator/schemas/brief_amendments.schema.json``.
144-
145-
Walks ``runs/iter-*/inputs/brief_amendments.jsonl``, schema-validates
146-
rows individually (skipping malformed with a visible warning), and
147-
renders a per-iter listing grouped by priority. The REPORT extractor
148-
can use this to: (a) cite which amendments shaped the iteration's
149-
findings, (b) flag which BLOCKING amendments still need applying
150-
to the upstream brief (the cross-run learning loop).
143+
``orchestrator/schemas/brief_amendments.schema.json`` and is
144+
enforced by the agent that *writes* the file (per methodology) —
145+
this renderer JSON-decodes each row and surfaces a count of
146+
lines that failed to parse so the operator sees corruption,
147+
but does not itself re-validate against the schema.
148+
149+
Walks ``runs/iter-*/inputs/brief_amendments.jsonl`` and renders a
150+
per-iter listing grouped by priority. The REPORT extractor can use
151+
this to: (a) cite which amendments shaped the iteration's findings,
152+
(b) flag which BLOCKING amendments still need applying to the
153+
upstream brief (the cross-run learning loop).
151154
"""
152155
runs_dir = work_dir / "runs"
153156
if not runs_dir.is_dir():

orchestrator/promote_gate.py

Lines changed: 73 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -43,23 +43,40 @@
4343
VALID_DECISIONS: tuple[Decision, ...] = ("promote", "revise", "abort")
4444

4545

46-
def _read_jsonl(path: Path) -> list[dict]:
47-
"""Read a JSONL file; skip malformed lines silently."""
46+
def _read_jsonl_with_skips(path: Path) -> tuple[list[dict], int]:
47+
"""Read a JSONL file. Returns ``(valid_rows, malformed_count)``.
48+
49+
Malformed-line counts are surfaced (not silently dropped) because
50+
the gate makes BLOCKING decisions on these files: a corrupt line
51+
that was meant to be a BLOCKING amendment must not silently
52+
register as "no BLOCKING amendments" and let the campaign promote
53+
to its next iteration.
54+
"""
4855
if not path.exists():
49-
return []
50-
rows: list[dict] = []
56+
return [], 0
5157
try:
5258
text = path.read_text()
5359
except OSError:
54-
return []
60+
return [], 0
61+
rows: list[dict] = []
62+
malformed = 0
5563
for line in text.splitlines():
5664
line = line.strip()
5765
if not line:
5866
continue
5967
try:
6068
rows.append(json.loads(line))
6169
except json.JSONDecodeError:
62-
continue
70+
malformed += 1
71+
return rows, malformed
72+
73+
74+
def _read_jsonl(path: Path) -> list[dict]:
75+
"""Backward-compat thin wrapper. Drops the malformed count for
76+
callers that don't need it (today: ``applied_amendments.jsonl``,
77+
where the cost of a corrupted apply log is symmetric — false
78+
negatives + false positives both possible)."""
79+
rows, _ = _read_jsonl_with_skips(path)
6380
return rows
6481

6582

@@ -92,12 +109,26 @@ def evaluate_promote_gate(
92109
``<work_dir>/applied_amendments.jsonl`` → ``revise``
93110
("blocking amendment must be applied to the brief before
94111
iter-(N+1) can produce valid data").
112+
3b. Any malformed line(s) in ``brief_amendments.jsonl`` →
113+
``revise`` (cannot rule out a hidden BLOCKING entry;
114+
asymmetric-risk choice).
95115
4. Otherwise → ``promote``.
96116
117+
**Scoping (v1):** the gate reads only iter-N's
118+
``brief_amendments.jsonl``, NOT amendments from earlier iters. Per
119+
the schema, ``id`` (e.g. ``BA-1``) is "stable within this iter's
120+
amendments" — *not globally unique*. So an iter-1 BLOCKING
121+
amendment that the operator never applied will NOT be re-flagged
122+
when this function is called for iter-2 (iter-2's gate looks at
123+
iter-2's amendments only). The cross-iter "still-pending" view is
124+
deferred to v2 (apply-amendments CLI + composite IDs); for v1,
125+
callers MUST run the gate after EVERY iter that emits any
126+
BLOCKING amendment, not just the last one.
127+
97128
Engine integration (v2): the engine calls this between iterations
98129
and acts on ``decision``: ``promote`` → start iter-(N+1),
99130
``revise`` → halt with ``CampaignStopped("revise")`` so the operator
100-
can run ``nous brief apply-amendments``, ``abort`` → halt with
131+
can apply amendments, ``abort`` → halt with
101132
``CampaignStopped("abort")``.
102133
"""
103134
work_dir = Path(work_dir)
@@ -132,8 +163,14 @@ def evaluate_promote_gate(
132163
feasibility=False,
133164
)
134165

135-
# 3. Brief-amendments gate.
136-
amendments = _read_jsonl(iter_dir / "inputs" / "brief_amendments.jsonl")
166+
# 3. Brief-amendments gate. Counts malformed lines separately —
167+
# a corrupted line could have been a BLOCKING amendment, and
168+
# silently letting it disappear past the gate is exactly the
169+
# asymmetric-risk failure (false promote >> false revise) we
170+
# want to avoid.
171+
amendments, malformed = _read_jsonl_with_skips(
172+
iter_dir / "inputs" / "brief_amendments.jsonl",
173+
)
137174
applied_rows = _read_jsonl(work_dir / "applied_amendments.jsonl")
138175
applied_ids = {
139176
str(r.get("id"))
@@ -158,12 +195,34 @@ def evaluate_promote_gate(
158195
f"{len(blocking_unapplied)} BLOCKING brief_amendment(s) "
159196
f"({ids}) have not been applied to the upstream brief. "
160197
f"Iter-{iteration + 1} would re-discover or re-trip "
161-
f"these issues. Run `nous brief apply-amendments` (post-#223 "
162-
f"v2) or apply manually, then resume."
198+
f"these issues. Apply the amendments manually (or wait "
199+
f"for `nous brief apply-amendments`), then resume."
163200
),
164201
blocking=[str(a.get("id", "?")) for a in blocking_unapplied],
165202
applied=sorted(applied_ids),
166203
feasibility=True,
204+
malformed_lines=malformed,
205+
)
206+
207+
# 3b. Malformed-line safety: if the amendments file had any
208+
# unparseable lines, we cannot rule out a hidden BLOCKING entry.
209+
# Asymmetric risk: false revise costs an operator a few minutes
210+
# of inspection; false promote past corruption can waste an
211+
# iteration's tokens. Choose revise.
212+
if malformed > 0:
213+
return _decision(
214+
"revise",
215+
reasoning=(
216+
f"{malformed} malformed line(s) in "
217+
f"runs/iter-{iteration}/inputs/brief_amendments.jsonl "
218+
f"could not be parsed. We cannot rule out a hidden "
219+
f"BLOCKING amendment among them. Inspect the file and "
220+
f"fix the malformed lines before iter-{iteration + 1}."
221+
),
222+
blocking=[],
223+
applied=sorted(applied_ids),
224+
feasibility=True,
225+
malformed_lines=malformed,
167226
)
168227

169228
# 4. Default → promote.
@@ -177,6 +236,7 @@ def evaluate_promote_gate(
177236
blocking=[],
178237
applied=sorted(applied_ids),
179238
feasibility=True,
239+
malformed_lines=0,
180240
)
181241

182242

@@ -187,8 +247,8 @@ def _decision(
187247
blocking: list[str],
188248
applied: list[str],
189249
feasibility: bool,
250+
malformed_lines: int = 0,
190251
) -> dict:
191-
"""Build the result dict in the canonical shape."""
192252
return {
193253
"decision": decision,
194254
"reasoning": reasoning,
@@ -197,4 +257,5 @@ def _decision(
197257
"feasibility_check": {
198258
"passed": feasibility,
199259
},
260+
"malformed_amendment_lines": malformed_lines,
200261
}

orchestrator/sdk_dispatch.py

Lines changed: 29 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -620,24 +620,43 @@ def _maybe_log_silence(self, iteration: int, phase: str) -> None:
620620

621621
def _bundle_recommended_turn_silence_threshold(
622622
self, iteration: int) -> float | None:
623-
"""#226: read the rehearsal-recorded watchdog threshold override
624-
from ``bundle.experiment_spec.timing_observations.recommended_turn_silence_threshold_seconds``
625-
for the iter immediately PRIOR to the current one (the rehearsal
626-
iter, when present).
627-
628-
Returns None if the bundle / field doesn't exist or is malformed.
629-
Pure file reader — no LLM, no I/O beyond a single read.
623+
"""Read the rehearsal-recorded watchdog threshold override from the
624+
prior iter's bundle.experiment_spec.timing_observations.
625+
626+
Returns ``None`` for: iter-1 (no prior bundle), missing bundle file,
627+
unparseable YAML, or absent/malformed
628+
``recommended_turn_silence_threshold_seconds``. On parse failures
629+
(corrupt YAML, missing PyYAML), logs a warning so operators see
630+
why the override didn't apply — silently falling back to the
631+
campaign default would be the silent-failure pattern PR #218
632+
was specifically meant to kill.
630633
"""
631634
if iteration < 2:
632635
return None
633636
prior_iter_dir = self.work_dir / "runs" / f"iter-{iteration - 1}"
634637
bundle_path = prior_iter_dir / "bundle.yaml"
635638
if not bundle_path.exists():
636639
return None
640+
# Import yaml outside the try/except: an ImportError here is
641+
# an environmental defect that should propagate, not a
642+
# silent fallback to the campaign default.
643+
import yaml as _yaml
637644
try:
638-
import yaml as _yaml
639-
data = _yaml.safe_load(bundle_path.read_text())
640-
except (OSError, Exception): # noqa: BLE001
645+
text = bundle_path.read_text()
646+
data = _yaml.safe_load(text)
647+
except OSError as exc:
648+
logger.warning(
649+
"iter-%d bundle unreadable; skipping timing-override "
650+
"(%s: %s)",
651+
iteration - 1, type(exc).__name__, exc,
652+
)
653+
return None
654+
except _yaml.YAMLError as exc:
655+
logger.warning(
656+
"iter-%d bundle YAML invalid; skipping timing-override "
657+
"(falling back to campaign default %.0fs): %s",
658+
iteration - 1, self._turn_silence_threshold, exc,
659+
)
641660
return None
642661
if not isinstance(data, dict):
643662
return None

0 commit comments

Comments
 (0)