fix(meta-findings): fully resolve AI-native-Systems-Research#242 — ledger floor + eager retry_log + nous reports (AI-native-Systems-Research#243)

sriumcp · web-flow · commit 46a288f8dc3c · 2026-05-29T09:00:54.000-04:00
* fix(meta-findings): add ledger.json failure floor + missing-artifact detector (AI-native-Systems-Research#242) The existing nous_asks heuristics all key off retry_log.jsonl or llm_metrics.jsonl. When a campaign's dispatcher dies before writing those artifacts (e.g., a single-iteration campaign that fails at the SDK call), every heuristic short-circuits and meta_findings.json reports 0/0/0 across all three named streams — even though the failure is recorded plainly in ledger.json. Acceptance fixture: paper-burst.post-204-rerun.1779882732/. Its ledger.json has iter-1 status="FAILED" error="SDK returned error after 1 attempt(s): None", but retry_log.jsonl, llm_metrics.jsonl, and runs/iter-1/findings.json are all absent. Pre-fix the campaign's emitted meta_findings.json had 0 entries across all named streams; post-fix it surfaces ≥1 nous_asks entry. Two new pure-Python detectors in orchestrator/meta_findings.py: - _detect_nous_asks_from_ledger_failures(ledger): one nous_ask per iterations[*].status == "FAILED" row, kind="dispatch", citing iter-N and a 120-char-truncated error string. - _detect_nous_asks_from_missing_artifacts(work_dir, state, ledger): one nous_ask when state.iteration >= 1 and last_entered_phase != "IDLE" but retry_log.jsonl and runs/iter-N/findings.json are absent. kind="observability". ledger.json is the right substrate for the failure floor — it's written by the orchestrator itself, not by the dispatcher subprocess, so it survives dispatcher-side crashes by construction. Both detectors degrade silently on missing or malformed input, mirroring the existing detectors. Schema unchanged. Both kinds ("dispatch", "observability") are already in the meta_findings.schema.json nous_ask enum. Two adjacent items raised in the issue body are explicitly out of scope here and will be tracked as separate follow-ups: eager initialisation of retry_log.jsonl at iteration start, and an on-demand `nous reports <run_id>` subcommand. Tests: 13 new (TestLedgerFailureDetection, TestMissingArtifactDetection, TestPost204RerunAcceptanceFixture). Full suite: 1229 passed, 1 skipped. Closes AI-native-Systems-Research#242 * fix(meta-findings): add eager retry_log init + nous reports subcommand (AI-native-Systems-Research#242) Expands the original ledger-floor PR to fully resolve AI-native-Systems-Research#242 by adding the two follow-ups the issue body called out as adjacent items. ## Eager retry_log.jsonl init at iteration start orchestrator/iteration.py:setup_work_dir now touches retry_log.jsonl empty alongside state.json/ledger.json/principles.json. Before this fix, retry_log.jsonl was created lazily by orchestrator.metrics on first dispatch failure — meaning a dispatcher-side crash before any retry left no parseable artifact at all, blinding every retry-log-keyed heuristic in meta_findings.py to the failure. The eager touch guarantees downstream tooling always sees a parseable artifact. The missing-artifact detector in meta_findings.py is updated to recognize an empty retry_log.jsonl as semantically equivalent to "no dispatch retries logged" (the original signal it cared about). The canonical post-AI-native-Systems-Research#242 catastrophic-failure shape is now: ledger row FAILED + empty retry_log + missing findings.json — and the detector fires on it. ## `nous reports` subcommand A new `nous reports <target>` CLI subcommand re-emits meta_findings.json on demand for any work_dir, regardless of whether the campaign reached a clean terminal transition. Pure-Python; zero LLM tokens. Useful for: - Legacy campaigns that pre-date the in-line emitter wired into campaign.py. - Aborted campaigns that never reached the four call sites that invoke the emitter automatically. - Re-emission after this PR's heuristics changes — the post-204-rerun campaign goes from 0 nous_asks to 2 (one dispatch ask citing the ledger FAILED row, one observability ask citing missing artifacts). When the target work_dir is not at phase=DONE/STOPPED, the emitted meta_findings.json is annotated with a `notes` field flagging non-terminal state, so triage tooling doesn't conflate on-demand emission with a clean terminal record. Target accepts a campaign.yaml (preferred — supplies target_system context for instrumentation/documentation heuristics) or a work_dir / run_id resolvable via NOUS_CAMPAIGN_PARENT. ## Tests added (+7) - TestMissingArtifactDetection.test_empty_retry_log_still_triggers — pins the post-AI-native-Systems-Research#242 catastrophic-failure shape. - TestSetupWorkDirLegacyDefault.test_creates_empty_retry_log_jsonl — asserts setup_work_dir touches the file empty. - TestSetupWorkDirLegacyDefault.test_retry_log_existing_content_not_clobbered — idempotency under repeated setup_work_dir calls. - TestCmdReports (4 tests) — work_dir target, yaml target, partial state annotation, terminal state non-annotation. End-to-end: running `nous reports` against the actual paper-burst.post-204-rerun.1779882732/ campaign now emits 2 nous_asks (was 0 before this PR), each citing iter-1 status=FAILED and the missing per-iteration artifacts. Full suite: 1236 passed (+7), 1 skipped, 0 failures.
diff --git a/orchestrator/cli.py b/orchestrator/cli.py
@@ -562,6 +562,89 @@ def _cmd_report(args):
     _generate_report(campaign, work_dir, args.model, agent=args.agent, timeout=args.timeout)
 
 
+def _cmd_reports(args):
+    """On-demand re-emission of meta_findings.json (#242).
+
+    Runs the pure-Python emitter against any work_dir, regardless of
+    whether the campaign reached a clean terminal transition. Useful for
+    legacy campaigns that pre-date the in-line emission wired into
+    campaign.py, and for campaigns that aborted mid-phase and so never
+    reached the four call sites that invoke the emitter automatically.
+
+    Target may be a campaign.yaml (preferred — gives full target_system
+    context for the heuristics) or a work_dir / run_id (emitted with an
+    empty target_system stub).
+    """
+    import json as _json
+    import yaml as _yaml
+    from orchestrator.meta_findings import (
+        emit_meta_findings,
+        write_meta_findings,
+    )
+    from orchestrator.validate import validate_meta_findings
+
+    work_dir = resolve_work_dir(args.target)
+
+    campaign: dict = {"target_system": {}}
+    if args.target.endswith((".yaml", ".yml")):
+        try:
+            data = _yaml.safe_load(Path(args.target).read_text())
+            if isinstance(data, dict):
+                campaign = data
+        except (_yaml.YAMLError, OSError) as exc:
+            print(
+                f"Warning: could not parse {args.target} ({exc}); "
+                f"emitting against empty target_system context.",
+                file=sys.stderr,
+            )
+
+    payload = emit_meta_findings(work_dir, campaign)
+
+    state_path = work_dir / "state.json"
+    is_terminal = False
+    if state_path.exists():
+        try:
+            state = _json.loads(state_path.read_text())
+            phase = state.get("last_entered_phase") or state.get("phase")
+            is_terminal = phase in ("DONE", "STOPPED")
+        except (_json.JSONDecodeError, OSError):
+            pass
+
+    if not is_terminal:
+        prior = payload.get("notes") or ""
+        suffix = (
+            f"Emitted on-demand via `nous reports` against a non-terminal "
+            f"work_dir (state.json: phase is not DONE/STOPPED). The "
+            f"three streams reflect partial state — re-emit after "
+            f"campaign termination for the canonical record."
+        )
+        payload["notes"] = (prior + " " + suffix).strip() if prior else suffix
+
+    target = write_meta_findings(work_dir, payload)
+    result = validate_meta_findings(work_dir)
+    if result["status"] == "fail":
+        print(
+            f"Warning: emitted meta_findings.json failed self-validation: "
+            f"{result['errors']}",
+            file=sys.stderr,
+        )
+
+    n_lessons = len(payload.get("campaign_design_lessons") or [])
+    n_repo = len(payload.get("target_system_asks") or [])
+    n_nous = len(payload.get("nous_asks") or [])
+    print(
+        f"{target}  "
+        f"({n_lessons} design lesson(s), {n_repo} repo ask(s), "
+        f"{n_nous} nous ask(s))"
+    )
+    if not is_terminal:
+        print(
+            "Note: emitted against a non-terminal work_dir; see "
+            "meta_findings.json `notes` field.",
+            file=sys.stderr,
+        )
+
+
 def _cmd_replay(args):
     import subprocess
     import yaml
@@ -847,6 +930,19 @@ def main():
     p_replay.add_argument("--iter", required=True, type=int)
     p_replay.set_defaults(func=_cmd_replay)
 
+    p_reports = subparsers.add_parser(
+        "reports",
+        help="Re-emit meta_findings.json on demand for any work_dir (#242). "
+             "Pure-Python; zero LLM tokens. Works against legacy or aborted "
+             "campaigns that never reached the in-line emitter.",
+    )
+    p_reports.add_argument(
+        "target",
+        help="campaign.yaml (preferred — supplies target_system context) "
+             "OR a work_dir / run_id resolvable via NOUS_CAMPAIGN_PARENT.",
+    )
+    p_reports.set_defaults(func=_cmd_reports)
+
     # `create-campaign` (issue #89): scaffold a heavily-commented
     # campaign.yaml that names the four agent-reachable fields and
     # warns about the domain_adapter_layer trap.
diff --git a/orchestrator/iteration.py b/orchestrator/iteration.py
@@ -799,6 +799,17 @@ def setup_work_dir(run_id: str, repo_path: str | None = None) -> Path:
         dest = work_dir / t
         if not dest.exists():
             shutil.copy(TEMPLATES_DIR / t, dest)
+
+    # #242: eagerly create an empty retry_log.jsonl. The orchestrator
+    # writes it on first dispatch failure, so a dispatcher-side crash
+    # before any retry would leave no trail at all — making the
+    # retry-log-keyed heuristics in meta_findings.py blind to the
+    # failure. Touching it here guarantees downstream tooling always
+    # sees a parseable artifact, even if it's empty.
+    retry_log = work_dir / "retry_log.jsonl"
+    if not retry_log.exists():
+        retry_log.touch()
+
     state = json.loads((work_dir / "state.json").read_text())
     state["run_id"] = run_id
     # #239: record resolved paths as per-campaign source of truth.
diff --git a/orchestrator/meta_findings.py b/orchestrator/meta_findings.py
@@ -345,6 +345,106 @@ def _detect_nous_asks(
     return asks
 
 
+def _detect_nous_asks_from_ledger_failures(ledger: dict | list | None) -> list[dict]:
+    """Surface ledger.json rows where status="FAILED" as nous_asks (#242).
+
+    Closes the structural blind spot where every other detector keys off
+    retry_log.jsonl: a dispatcher that dies before writing the retry log
+    leaves only ledger.json as evidence, and the previous emitter
+    silently produced 0/0/0 streams. ledger.json is written by the
+    orchestrator itself, so it survives dispatcher-side crashes.
+    """
+    asks: list[dict] = []
+    if not isinstance(ledger, dict):
+        return asks
+    for row in ledger.get("iterations") or []:
+        if not isinstance(row, dict):
+            continue
+        if row.get("status") != "FAILED":
+            continue
+        iteration = row.get("iteration")
+        if not isinstance(iteration, int):
+            continue
+        err = (row.get("error") or "").strip()
+        err_short = err[:120] if err else "(no error text)"
+        asks.append({
+            "ask": (
+                "Investigate the dispatcher failure that prevented this "
+                "iteration from completing — the iteration ended without "
+                "producing findings.json or retry_log.jsonl, leaving the "
+                "ledger row as the only diagnostic surface."
+            ),
+            "evidence": (
+                f"ledger.json: iter-{iteration} status=FAILED, "
+                f"error=\"{err_short}\""
+            ),
+            "kind": "dispatch",
+        })
+    return asks
+
+
+def _detect_nous_asks_from_missing_artifacts(
+    work_dir: Path, state: dict | list | None, ledger: dict | list | None,
+) -> list[dict]:
+    """Flag campaigns where state.json shows iteration progressed but
+    per-iteration artifacts are absent (#242).
+
+    Triggered when state.iteration >= 1 and state.last_entered_phase
+    is not IDLE, ledger.json has at least one row with iteration >= 1,
+    retry_log.jsonl is absent, and runs/iter-N/findings.json is absent
+    for the highest ledger iter N. This is the post-#204 rerun shape:
+    the dispatcher died after the orchestrator advanced state but
+    before writing any per-iteration artifacts.
+    """
+    if not isinstance(state, dict) or not isinstance(ledger, dict):
+        return []
+    iteration = state.get("iteration")
+    phase = state.get("last_entered_phase")
+    if not isinstance(iteration, int) or iteration < 1:
+        return []
+    if not phase or phase == "IDLE":
+        return []
+
+    ledger_iters: list[int] = []
+    for row in ledger.get("iterations") or []:
+        if not isinstance(row, dict):
+            continue
+        n = row.get("iteration")
+        if isinstance(n, int) and n >= 1:
+            ledger_iters.append(n)
+    if not ledger_iters:
+        return []
+    latest_iter = max(ledger_iters)
+
+    # After #242's eager init, retry_log.jsonl is touched at setup_work_dir
+    # so it always exists for fresh campaigns. Treat a 0-row file as
+    # equivalent to "no dispatch retries logged" — the original semantic
+    # the detector cares about.
+    retry_log = work_dir / "retry_log.jsonl"
+    has_retry_rows = retry_log.exists() and retry_log.stat().st_size > 0
+    findings = work_dir / "runs" / f"iter-{latest_iter}" / "findings.json"
+
+    if has_retry_rows:
+        return []
+    if findings.exists():
+        return []
+
+    retry_state = "absent" if not retry_log.exists() else "empty"
+    return [{
+        "ask": (
+            "Investigate why the iteration progressed in state.json but "
+            "the dispatcher produced no per-iteration artifacts. The "
+            "iteration ended with no findings.json and no retry rows."
+        ),
+        "evidence": (
+            f"state.json: iteration={iteration} phase={phase} — "
+            f"runs/iter-{latest_iter}/findings.json absent, "
+            f"retry_log.jsonl {retry_state}."
+        ),
+        "kind": "observability",
+    }]
+
+
 def _detect_design_lessons(work_dir: Path) -> list[dict]:
     """Find lessons about campaign design from per-iteration findings."""
     lessons: list[dict] = []
@@ -453,7 +553,11 @@ def emit_meta_findings(
         "iterations_completed": iterations_completed,
         "campaign_design_lessons": _detect_design_lessons(work_dir),
         "target_system_asks": _detect_target_system_asks(campaign, retries),
-        "nous_asks": _detect_nous_asks(metrics, retries),
+        "nous_asks": (
+            _detect_nous_asks(metrics, retries)
+            + _detect_nous_asks_from_ledger_failures(ledger)
+            + _detect_nous_asks_from_missing_artifacts(work_dir, state, ledger)
+        ),
     }
 
     # Deployment recommendation (issue #170): every campaign emits a
diff --git a/tests/test_cli.py b/tests/test_cli.py
@@ -5,7 +5,7 @@
 from pathlib import Path
 from unittest.mock import patch, MagicMock
 
-from orchestrator.cli import resolve_work_dir, _cmd_run, _cmd_resume, _cmd_validate, _cmd_status, _cmd_cost, _cmd_report, _cmd_replay
+from orchestrator.cli import resolve_work_dir, _cmd_run, _cmd_resume, _cmd_validate, _cmd_status, _cmd_cost, _cmd_report, _cmd_replay, _cmd_reports
 
 
 class TestResolveWorkDir:
@@ -379,3 +379,141 @@ def test_replay_reports_failed_command(self, tmp_path, capsys):
                 _cmd_replay(args)
         err = capsys.readouterr().err
         assert "h-main/bad" in err
+
+
+class TestCmdReports:
+    """`nous reports` re-emits meta_findings.json on demand (#242)."""
+
+    def test_emits_meta_findings_against_workdir(
+            self, tmp_path: Path, capsys: pytest.CaptureFixture) -> None:
+        """Against a work_dir directly (run_id with no yaml in scope),
+        the command emits a meta_findings.json with empty target_system
+        context. Useful for triaging legacy or aborted campaigns.
+        """
+        work_dir = tmp_path / ".nous" / "legacy-run"
+        work_dir.mkdir(parents=True)
+        (work_dir / "state.json").write_text(json.dumps({
+            "run_id": "legacy-run",
+            "iteration": 1,
+            "last_entered_phase": "DONE",
+        }))
+        (work_dir / "ledger.json").write_text(json.dumps({
+            "iterations": [{"iteration": 1, "status": "FAILED",
+                            "error": "SDK returned error after 1 attempt(s): None"}],
+        }))
+
+        args = argparse.Namespace(target=str(work_dir))
+        _cmd_reports(args)
+        out = capsys.readouterr().out
+
+        # Output one-liner reports the artifact path + counts.
+        assert "meta_findings.json" in out
+        assert "nous ask" in out
+
+        # Artifact must exist on disk and be schema-valid.
+        mf_path = work_dir / "meta_findings.json"
+        assert mf_path.exists()
+        payload = json.loads(mf_path.read_text())
+        assert payload["schema_version"] == "1"
+        # The acceptance fixture from #242: ledger FAILED row → ≥ 1 nous_ask.
+        assert payload["nous_asks"], payload
+
+    def test_marks_partial_when_workdir_not_terminal(
+            self, tmp_path: Path, capsys: pytest.CaptureFixture) -> None:
+        """A campaign whose state.json shows phase != DONE/STOPPED is
+        emitted with a `notes` field flagging partial state, so triage
+        tooling doesn't conflate on-demand emission with a clean
+        terminal record.
+        """
+        work_dir = tmp_path / ".nous" / "midflight"
+        work_dir.mkdir(parents=True)
+        (work_dir / "state.json").write_text(json.dumps({
+            "run_id": "midflight",
+            "iteration": 1,
+            "last_entered_phase": "EXECUTE_ANALYZE",
+        }))
+        (work_dir / "ledger.json").write_text(json.dumps({
+            "iterations": [{"iteration": 1, "status": "FAILED",
+                            "error": "abort"}],
+        }))
+
+        args = argparse.Namespace(target=str(work_dir))
+        _cmd_reports(args)
+
+        payload = json.loads((work_dir / "meta_findings.json").read_text())
+        assert "notes" in payload
+        assert "non-terminal" in payload["notes"].lower()
+        err = capsys.readouterr().err
+        assert "non-terminal" in err.lower()
+
+    def test_does_not_mark_partial_when_terminal(
+            self, tmp_path: Path) -> None:
+        """A campaign at phase=DONE must NOT be flagged as non-terminal."""
+        work_dir = tmp_path / ".nous" / "done-run"
+        work_dir.mkdir(parents=True)
+        (work_dir / "state.json").write_text(json.dumps({
+            "run_id": "done-run",
+            "iteration": 1,
+            "last_entered_phase": "DONE",
+        }))
+        (work_dir / "ledger.json").write_text(json.dumps({
+            "iterations": [{"iteration": 1, "status": "completed"}],
+        }))
+        # Build a complete iter-1 finding so the heuristic streams stay quiet.
+        iter_dir = work_dir / "runs" / "iter-1"
+        iter_dir.mkdir(parents=True)
+        (iter_dir / "findings.json").write_text(json.dumps({
+            "iteration": 1, "bundle_ref": "stub", "arms": [],
+            "experiment_valid": True, "discrepancy_analysis": "stub",
+        }))
+
+        args = argparse.Namespace(target=str(work_dir))
+        _cmd_reports(args)
+
+        payload = json.loads((work_dir / "meta_findings.json").read_text())
+        notes = payload.get("notes") or ""
+        assert "non-terminal" not in notes.lower(), notes
+
+    def test_yaml_target_loads_full_campaign_context(
+            self, tmp_path: Path) -> None:
+        """When target is a campaign.yaml, target_system fields drive the
+        instrumentation/documentation heuristics. Absence of declared
+        observable_metrics should still surface a target_system_ask.
+        """
+        import yaml as _yaml
+
+        repo = tmp_path / "myrepo"
+        repo.mkdir()
+        work_dir = repo / ".nous" / "yaml-run"
+        work_dir.mkdir(parents=True)
+        (work_dir / "state.json").write_text(json.dumps({
+            "run_id": "yaml-run",
+            "iteration": 1,
+            "last_entered_phase": "DONE",
+        }))
+        (work_dir / "ledger.json").write_text(json.dumps({
+            "iterations": [{"iteration": 1, "status": "completed"}],
+        }))
+        iter_dir = work_dir / "runs" / "iter-1"
+        iter_dir.mkdir(parents=True)
+        (iter_dir / "findings.json").write_text(json.dumps({
+            "iteration": 1, "bundle_ref": "stub", "arms": [],
+            "experiment_valid": True, "discrepancy_analysis": "stub",
+        }))
+
+        campaign_yaml = tmp_path / "campaign.yaml"
+        campaign_yaml.write_text(_yaml.dump({
+            "run_id": "yaml-run",
+            "target_system": {
+                "name": "demo",
+                "repo_path": str(repo),
+                # Intentionally no observable_metrics → instrumentation ask.
+            },
+        }))
+
+        args = argparse.Namespace(target=str(campaign_yaml))
+        _cmd_reports(args)
+
+        payload = json.loads((work_dir / "meta_findings.json").read_text())
+        kinds = {a.get("kind") for a in payload["target_system_asks"]}
+        assert "instrumentation" in kinds, payload["target_system_asks"]
diff --git a/tests/test_meta_findings.py b/tests/test_meta_findings.py
diff --git a/tests/test_work_dir_resolver.py b/tests/test_work_dir_resolver.py