fix(review): close PR #227 review findings — silent-failure, docstring, e2e coverage

sriumcp · claude · sriumcp · commit bb99cfdedbae · 2026-05-27T11:10:15.000-04:00
Addresses 11 findings from /pr-review-toolkit:review-pr (5 agents: code-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer). Three are critical regressions of patterns PR #218 had killed; rest are correctness, documentation, and test-coverage improvements. Critical fixes -------------- * **Narrowed `except (OSError, Exception)` → `(OSError, yaml.YAMLError)`** in ``SDKDispatcher._bundle_recommended_turn_silence_threshold``. The prior broad-except swallowed ImportError, MemoryError, and any future YAML library refactor — defeating PR #218's silent-failure guarantees. Both branches now ``logger.warning`` so operators see why the override didn't apply. ImportError on missing PyYAML now propagates as it should (it's an environmental defect, not a runtime fallback case). * **`_format_brief_amendments_summary` docstring corrected.** Previous docstring claimed "schema-validates rows individually (skipping malformed with a visible warning)." Code only does ``json.loads``; no schema validation. Updated docstring to describe what the code actually does — the schema is enforced by the agent that *writes* the file (per methodology), not by this renderer. * **DESIGN-phase REHEARSAL_GUIDANCE path mismatch.** Prior text told agents to write ``runs/iter-N/brief_amendments.md`` (legacy markdown path); the EXECUTE-phase guidance, the schema, the renderer, and the promote gate ALL use ``runs/iter-N/inputs/brief_amendments.jsonl`` (post-#223 structured). DESIGN-following agents would have silently dropped amendments on the floor. Both phases now point at the same JSONL path with the same required-fields list. High-priority correctness ------------------------- * **Promote gate: malformed amendment lines downgrade to revise** (asymmetric-risk choice). ``_read_jsonl_with_skips`` returns ``(rows, malformed_count)``. If brief_amendments.jsonl has any unparseable lines, the gate cannot rule out a hidden BLOCKING entry — silently treating that as "no BLOCKING amendments" risks false promotion past corruption. Now: emits ``revise`` with ``malformed_amendment_lines: N`` in the result dict and reasoning text that names the file path. Operator inspects vs. wastes an iteration's tokens — symmetric cost reversal. * **Promote gate scope explicitly documented as iter-local.** Per the brief_amendments schema, ``id`` is "stable within this iter's amendments" (not globally unique). The gate reads only iter-N's amendments, so iter-1 BLOCKING amendments that were never applied do NOT re-flag at iter-2. Docstring now states this explicitly + notes the v2 work (composite IDs, apply-amendments CLI) needed to cross-iter-scope. Callers MUST run the gate after every iter that emits BLOCKING amendments, not just the last one. * **Restore-after-failure now has a behavioral test.** Post-#218 the ``SDKDispatcher.dispatch`` override-and-restore lives in ``try/finally``, but no test exercised the failure path. New ``test_dispatch_restores_threshold_when_runner_raises`` constructs a runner that raises ``SDKTransientError``, asserts dispatch raises, AND asserts the dispatcher's stored threshold equals the campaign default afterward. A regression that moves the restore out of ``finally`` is now caught. * **End-to-end coupling test across #221+#222+#223+#224.** Each subissue had per-feature tests, but no single test verified the chain (rehearsal mode → execute honors rehearsal_subset → BLOCKING amendment written → gate decides revise). New ``TestEndToEndIntegration.test_rehearsal_emits_blocking_amendment_then_gate_revises`` walks the full pipeline using only public functions and schemas, exercising the most likely future regression class (mode resolver bug, schema field rename, gate logic drift) in one place. Also verifies the apply-amendments-then-promote happy path. Documentation / methodology --------------------------- * **Stripped issue-number references from agent-facing prose.** The agent has no GitHub access; ``(#212)``, ``(#221)``, ``— #222`` in ``EXECUTE_REHEARSAL_GUIDANCE`` etc. were noise. Kept in Python docstrings/comments where developers benefit. Same applies to the ``(post-#223 v2)`` parenthetical that was leaking into the promote_gate's operator-facing reasoning text. * **EXECUTE_REAL_GUIDANCE rewrites the halt-mechanism description.** Previous text said "halt with a failure_note.md" — but the actual halt mechanism (post-#224 v1) is ``decision=revise`` from the promote gate, which the engine acts on (v2 wiring). Updated to describe the agent's role correctly: read amendments, apply them to run config, write findings.json with appropriate status. The failure_note.md is now a fallback for the "I cannot apply this amendment" case, not the primary halt mechanism. * **`_decision` docstring removed** (redundant with function name). Module-level docstring trimmed of v1/v2 task-tracking framing (that lives in the PR description; rots in code). Tests ----- +8 new tests (1137 passed, 1 skipped, 0 regressions). All behavioral. The end-to-end test specifically exercises every artifact + schema + function in the #221-226 cluster as a single chain. Refs PR #227 review findings. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/orchestrator/iteration_mode.py b/orchestrator/iteration_mode.py
@@ -81,10 +81,15 @@ def iteration_mode_for(campaign: dict, iteration: int) -> Mode:
 If you find any campaign-spec or brief inconsistencies (paths the
 validator rejects, broken argv quoting, wall-time claims that don't
 match reality, single-tenant probes when the target requires multi-
-tenant, etc.), write them to ``runs/iter-N/brief_amendments.md`` —
-one entry per finding, with file path + suggested change. The next
-``real`` iteration will read this; future runs of the same campaign
-will benefit indefinitely.
+tenant, etc.), write them to
+``runs/iter-N/inputs/brief_amendments.jsonl`` as one structured JSON
+object per line. Required fields: ``id`` (pattern ``BA-N``),
+``brief_section``, ``problem``, ``fix``, ``priority`` (one of
+``BLOCKING``, ``HIGH``, ``MEDIUM``, ``LOW``, ``INFO``). Optional
+``evidence``, ``impact``. Schema:
+``orchestrator/schemas/brief_amendments.schema.json``. The promote
+gate, the REPORT extractor, and the future ``apply-amendments`` CLI
+all read this structured form.
 
 **Do NOT:**
 - Author full multi-arm bundles. Keep arms minimal.
@@ -132,13 +137,13 @@ def mode_guidance_for(mode: Mode) -> str:
 # entire economic argument for #212.
 
 EXECUTE_REHEARSAL_GUIDANCE = """\
-This iteration is in **REHEARSAL** mode (#212). The DESIGN agent's
-bundle declares the full experimental design (so iter-2 / future runs
-can run it untouched). YOUR JOB this iter:
+This iteration is in **REHEARSAL** mode. The DESIGN agent's bundle
+declares the full experimental design (so iter-2 / future runs can
+run it untouched). YOUR JOB this iter:
 
 1. **Honor the rehearsal scope.** If the bundle's
-   ``experiment_spec.rehearsal_subset`` is populated (#222), execute
-   ONLY that subset (typically: 1 seed × the contrast-pair arms).
+   ``experiment_spec.rehearsal_subset`` is populated, execute ONLY
+   that subset (typically: 1 seed × the contrast-pair arms).
    Do NOT fan out the full ``experiment_spec`` — that's iter-2's job.
    If ``rehearsal_subset`` is missing, default to: first canonical
    seed + ``h-main`` and ``h-control-negative`` arms only.
@@ -148,23 +153,23 @@ def mode_guidance_for(mode: Mode) -> str:
    analysis script fails or returns null where data is present,
    fix the script (or surface the issue) before iter-2 runs.
 
-3. **Append per-policy timing observations** (#226). During the
+3. **Append per-policy timing observations.** During the
    feasibility / contrast-pair runs, measure wall-clock per policy.
    Record into ``experiment_spec.timing_observations``:
    ``expected_wall_time_seconds_per_policy: { ea-wfq: 25, wfq: 23, ... }``
    and a derived ``recommended_turn_silence_threshold_seconds``
    (~3× the slowest observed policy + buffer). iter-2's watchdog
    reads these to calibrate.
 
-4. **Emit ``brief_amendments.jsonl``** (post-#223 structured form) at
+4. **Emit ``brief_amendments.jsonl``** at
    ``runs/iter-N/inputs/brief_amendments.jsonl`` if you find any
    campaign-spec friction (workload params, timing claims, missing
    flags, etc.). One JSON object per line; required fields: ``id``
    (pattern ``BA-N``), ``brief_section``, ``problem``, ``fix``,
    ``priority`` (BLOCKING / HIGH / MEDIUM / LOW / INFO). Optional
    ``evidence``, ``impact``.
 
-5. **Append to ``bundle_amendments.jsonl``** (#211) when you override
+5. **Append to ``bundle_amendments.jsonl``** when you override
    any parameter from ``experiment_spec.verified_parameters``.
 
 6. **Write findings.json with ``mode: rehearsal``** in the outcome,
@@ -180,15 +185,22 @@ def mode_guidance_for(mode: Mode) -> str:
 """
 
 EXECUTE_REAL_GUIDANCE = """\
-This iteration is in **REAL** mode (#212). Run the full experiment_spec
-at the bundle's prescribed scope: all arms, full seed list. If a prior
-``rehearsal`` iter emitted ``brief_amendments.jsonl``, read it before
-launching the experiment — if any ``priority: BLOCKING`` amendments
-exist that haven't been applied to the brief, halt with a
-``failure_note.md`` (promotion-gate logic, #224). Otherwise run the
-full bundle, write ``findings.json`` with ``mode: real`` and a CONFIRMED /
-REFUTED / NULL status per arm, and append ``bundle_amendments.jsonl``
-(#211) for any parameter overrides observed during execution.
+This iteration is in **REAL** mode. Run the full experiment_spec at
+the bundle's prescribed scope: all arms, full seed list.
+
+If a prior ``rehearsal`` iter emitted ``brief_amendments.jsonl``, read
+it BEFORE launching the experiment. Any ``priority: BLOCKING``
+amendments encode constraints iter-2 must respect (e.g., a workload
+parameter the rehearsal verified is required for the experiment to
+engage the mechanism). Apply each BLOCKING amendment to your run
+configuration and proceed; if you cannot apply one, write a
+``failure_note.md`` describing why and STOP — the campaign should
+revise the brief before continuing.
+
+Write ``findings.json`` with ``mode: real`` and a CONFIRMED / REFUTED
+/ NULL status per arm. Append ``bundle_amendments.jsonl`` for any
+parameter overrides observed during execution (silent drift breaks
+reproducibility).
 """
 
 
diff --git a/orchestrator/llm_dispatch.py b/orchestrator/llm_dispatch.py
@@ -140,14 +140,17 @@ def _format_brief_amendments_summary(work_dir: Path) -> str:
     Each amendment is a JSON object with required fields
     ``id, brief_section, problem, fix, priority``. Optional
     ``evidence``, ``impact``. The schema lives at
-    ``orchestrator/schemas/brief_amendments.schema.json``.
-
-    Walks ``runs/iter-*/inputs/brief_amendments.jsonl``, schema-validates
-    rows individually (skipping malformed with a visible warning), and
-    renders a per-iter listing grouped by priority. The REPORT extractor
-    can use this to: (a) cite which amendments shaped the iteration's
-    findings, (b) flag which BLOCKING amendments still need applying
-    to the upstream brief (the cross-run learning loop).
+    ``orchestrator/schemas/brief_amendments.schema.json`` and is
+    enforced by the agent that *writes* the file (per methodology) —
+    this renderer JSON-decodes each row and surfaces a count of
+    lines that failed to parse so the operator sees corruption,
+    but does not itself re-validate against the schema.
+
+    Walks ``runs/iter-*/inputs/brief_amendments.jsonl`` and renders a
+    per-iter listing grouped by priority. The REPORT extractor can use
+    this to: (a) cite which amendments shaped the iteration's findings,
+    (b) flag which BLOCKING amendments still need applying to the
+    upstream brief (the cross-run learning loop).
     """
     runs_dir = work_dir / "runs"
     if not runs_dir.is_dir():
diff --git a/orchestrator/promote_gate.py b/orchestrator/promote_gate.py
@@ -43,23 +43,40 @@
 VALID_DECISIONS: tuple[Decision, ...] = ("promote", "revise", "abort")
 
 
-def _read_jsonl(path: Path) -> list[dict]:
-    """Read a JSONL file; skip malformed lines silently."""
+def _read_jsonl_with_skips(path: Path) -> tuple[list[dict], int]:
+    """Read a JSONL file. Returns ``(valid_rows, malformed_count)``.
+
+    Malformed-line counts are surfaced (not silently dropped) because
+    the gate makes BLOCKING decisions on these files: a corrupt line
+    that was meant to be a BLOCKING amendment must not silently
+    register as "no BLOCKING amendments" and let the campaign promote
+    to its next iteration.
+    """
     if not path.exists():
-        return []
-    rows: list[dict] = []
+        return [], 0
     try:
         text = path.read_text()
     except OSError:
-        return []
+        return [], 0
+    rows: list[dict] = []
+    malformed = 0
     for line in text.splitlines():
         line = line.strip()
         if not line:
             continue
         try:
             rows.append(json.loads(line))
         except json.JSONDecodeError:
-            continue
+            malformed += 1
+    return rows, malformed
+
+
+def _read_jsonl(path: Path) -> list[dict]:
+    """Backward-compat thin wrapper. Drops the malformed count for
+    callers that don't need it (today: ``applied_amendments.jsonl``,
+    where the cost of a corrupted apply log is symmetric — false
+    negatives + false positives both possible)."""
+    rows, _ = _read_jsonl_with_skips(path)
     return rows
 
 
@@ -92,12 +109,26 @@ def evaluate_promote_gate(
            ``<work_dir>/applied_amendments.jsonl`` → ``revise``
            ("blocking amendment must be applied to the brief before
            iter-(N+1) can produce valid data").
+        3b. Any malformed line(s) in ``brief_amendments.jsonl`` →
+           ``revise`` (cannot rule out a hidden BLOCKING entry;
+           asymmetric-risk choice).
         4. Otherwise → ``promote``.
 
+    **Scoping (v1):** the gate reads only iter-N's
+    ``brief_amendments.jsonl``, NOT amendments from earlier iters. Per
+    the schema, ``id`` (e.g. ``BA-1``) is "stable within this iter's
+    amendments" — *not globally unique*. So an iter-1 BLOCKING
+    amendment that the operator never applied will NOT be re-flagged
+    when this function is called for iter-2 (iter-2's gate looks at
+    iter-2's amendments only). The cross-iter "still-pending" view is
+    deferred to v2 (apply-amendments CLI + composite IDs); for v1,
+    callers MUST run the gate after EVERY iter that emits any
+    BLOCKING amendment, not just the last one.
+
     Engine integration (v2): the engine calls this between iterations
     and acts on ``decision``: ``promote`` → start iter-(N+1),
     ``revise`` → halt with ``CampaignStopped("revise")`` so the operator
-    can run ``nous brief apply-amendments``, ``abort`` → halt with
+    can apply amendments, ``abort`` → halt with
     ``CampaignStopped("abort")``.
     """
     work_dir = Path(work_dir)
@@ -132,8 +163,14 @@ def evaluate_promote_gate(
             feasibility=False,
         )
 
-    # 3. Brief-amendments gate.
-    amendments = _read_jsonl(iter_dir / "inputs" / "brief_amendments.jsonl")
+    # 3. Brief-amendments gate. Counts malformed lines separately —
+    # a corrupted line could have been a BLOCKING amendment, and
+    # silently letting it disappear past the gate is exactly the
+    # asymmetric-risk failure (false promote >> false revise) we
+    # want to avoid.
+    amendments, malformed = _read_jsonl_with_skips(
+        iter_dir / "inputs" / "brief_amendments.jsonl",
+    )
     applied_rows = _read_jsonl(work_dir / "applied_amendments.jsonl")
     applied_ids = {
         str(r.get("id"))
@@ -158,12 +195,34 @@ def evaluate_promote_gate(
                 f"{len(blocking_unapplied)} BLOCKING brief_amendment(s) "
                 f"({ids}) have not been applied to the upstream brief. "
                 f"Iter-{iteration + 1} would re-discover or re-trip "
-                f"these issues. Run `nous brief apply-amendments` (post-#223 "
-                f"v2) or apply manually, then resume."
+                f"these issues. Apply the amendments manually (or wait "
+                f"for `nous brief apply-amendments`), then resume."
             ),
             blocking=[str(a.get("id", "?")) for a in blocking_unapplied],
             applied=sorted(applied_ids),
             feasibility=True,
+            malformed_lines=malformed,
+        )
+
+    # 3b. Malformed-line safety: if the amendments file had any
+    # unparseable lines, we cannot rule out a hidden BLOCKING entry.
+    # Asymmetric risk: false revise costs an operator a few minutes
+    # of inspection; false promote past corruption can waste an
+    # iteration's tokens. Choose revise.
+    if malformed > 0:
+        return _decision(
+            "revise",
+            reasoning=(
+                f"{malformed} malformed line(s) in "
+                f"runs/iter-{iteration}/inputs/brief_amendments.jsonl "
+                f"could not be parsed. We cannot rule out a hidden "
+                f"BLOCKING amendment among them. Inspect the file and "
+                f"fix the malformed lines before iter-{iteration + 1}."
+            ),
+            blocking=[],
+            applied=sorted(applied_ids),
+            feasibility=True,
+            malformed_lines=malformed,
         )
 
     # 4. Default → promote.
@@ -177,6 +236,7 @@ def evaluate_promote_gate(
         blocking=[],
         applied=sorted(applied_ids),
         feasibility=True,
+        malformed_lines=0,
     )
 
 
@@ -187,8 +247,8 @@ def _decision(
     blocking: list[str],
     applied: list[str],
     feasibility: bool,
+    malformed_lines: int = 0,
 ) -> dict:
-    """Build the result dict in the canonical shape."""
     return {
         "decision": decision,
         "reasoning": reasoning,
@@ -197,4 +257,5 @@ def _decision(
         "feasibility_check": {
             "passed": feasibility,
         },
+        "malformed_amendment_lines": malformed_lines,
     }
diff --git a/orchestrator/sdk_dispatch.py b/orchestrator/sdk_dispatch.py
@@ -620,24 +620,43 @@ def _maybe_log_silence(self, iteration: int, phase: str) -> None:
 
     def _bundle_recommended_turn_silence_threshold(
             self, iteration: int) -> float | None:
-        """#226: read the rehearsal-recorded watchdog threshold override
-        from ``bundle.experiment_spec.timing_observations.recommended_turn_silence_threshold_seconds``
-        for the iter immediately PRIOR to the current one (the rehearsal
-        iter, when present).
-
-        Returns None if the bundle / field doesn't exist or is malformed.
-        Pure file reader — no LLM, no I/O beyond a single read.
+        """Read the rehearsal-recorded watchdog threshold override from the
+        prior iter's bundle.experiment_spec.timing_observations.
+
+        Returns ``None`` for: iter-1 (no prior bundle), missing bundle file,
+        unparseable YAML, or absent/malformed
+        ``recommended_turn_silence_threshold_seconds``. On parse failures
+        (corrupt YAML, missing PyYAML), logs a warning so operators see
+        why the override didn't apply — silently falling back to the
+        campaign default would be the silent-failure pattern PR #218
+        was specifically meant to kill.
         """
         if iteration < 2:
             return None
         prior_iter_dir = self.work_dir / "runs" / f"iter-{iteration - 1}"
         bundle_path = prior_iter_dir / "bundle.yaml"
         if not bundle_path.exists():
             return None
+        # Import yaml outside the try/except: an ImportError here is
+        # an environmental defect that should propagate, not a
+        # silent fallback to the campaign default.
+        import yaml as _yaml
         try:
-            import yaml as _yaml
-            data = _yaml.safe_load(bundle_path.read_text())
-        except (OSError, Exception):  # noqa: BLE001
+            text = bundle_path.read_text()
+            data = _yaml.safe_load(text)
+        except OSError as exc:
+            logger.warning(
+                "iter-%d bundle unreadable; skipping timing-override "
+                "(%s: %s)",
+                iteration - 1, type(exc).__name__, exc,
+            )
+            return None
+        except _yaml.YAMLError as exc:
+            logger.warning(
+                "iter-%d bundle YAML invalid; skipping timing-override "
+                "(falling back to campaign default %.0fs): %s",
+                iteration - 1, self._turn_silence_threshold, exc,
+            )
             return None
         if not isinstance(data, dict):
             return None
diff --git a/tests/test_experiment_spec.py b/tests/test_experiment_spec.py
diff --git a/tests/test_promote_gate.py b/tests/test_promote_gate.py