Skip to content

Commit 46a288f

Browse files
authored
fix(meta-findings): fully resolve AI-native-Systems-Research#242 — ledger floor + eager retry_log + nous reports (AI-native-Systems-Research#243)
* fix(meta-findings): add ledger.json failure floor + missing-artifact detector (AI-native-Systems-Research#242) The existing nous_asks heuristics all key off retry_log.jsonl or llm_metrics.jsonl. When a campaign's dispatcher dies before writing those artifacts (e.g., a single-iteration campaign that fails at the SDK call), every heuristic short-circuits and meta_findings.json reports 0/0/0 across all three named streams — even though the failure is recorded plainly in ledger.json. Acceptance fixture: paper-burst.post-204-rerun.1779882732/. Its ledger.json has iter-1 status="FAILED" error="SDK returned error after 1 attempt(s): None", but retry_log.jsonl, llm_metrics.jsonl, and runs/iter-1/findings.json are all absent. Pre-fix the campaign's emitted meta_findings.json had 0 entries across all named streams; post-fix it surfaces ≥1 nous_asks entry. Two new pure-Python detectors in orchestrator/meta_findings.py: - _detect_nous_asks_from_ledger_failures(ledger): one nous_ask per iterations[*].status == "FAILED" row, kind="dispatch", citing iter-N and a 120-char-truncated error string. - _detect_nous_asks_from_missing_artifacts(work_dir, state, ledger): one nous_ask when state.iteration >= 1 and last_entered_phase != "IDLE" but retry_log.jsonl and runs/iter-N/findings.json are absent. kind="observability". ledger.json is the right substrate for the failure floor — it's written by the orchestrator itself, not by the dispatcher subprocess, so it survives dispatcher-side crashes by construction. Both detectors degrade silently on missing or malformed input, mirroring the existing detectors. Schema unchanged. Both kinds ("dispatch", "observability") are already in the meta_findings.schema.json nous_ask enum. Two adjacent items raised in the issue body are explicitly out of scope here and will be tracked as separate follow-ups: eager initialisation of retry_log.jsonl at iteration start, and an on-demand `nous reports <run_id>` subcommand. Tests: 13 new (TestLedgerFailureDetection, TestMissingArtifactDetection, TestPost204RerunAcceptanceFixture). Full suite: 1229 passed, 1 skipped. Closes AI-native-Systems-Research#242 * fix(meta-findings): add eager retry_log init + nous reports subcommand (AI-native-Systems-Research#242) Expands the original ledger-floor PR to fully resolve AI-native-Systems-Research#242 by adding the two follow-ups the issue body called out as adjacent items. ## Eager retry_log.jsonl init at iteration start orchestrator/iteration.py:setup_work_dir now touches retry_log.jsonl empty alongside state.json/ledger.json/principles.json. Before this fix, retry_log.jsonl was created lazily by orchestrator.metrics on first dispatch failure — meaning a dispatcher-side crash before any retry left no parseable artifact at all, blinding every retry-log-keyed heuristic in meta_findings.py to the failure. The eager touch guarantees downstream tooling always sees a parseable artifact. The missing-artifact detector in meta_findings.py is updated to recognize an empty retry_log.jsonl as semantically equivalent to "no dispatch retries logged" (the original signal it cared about). The canonical post-AI-native-Systems-Research#242 catastrophic-failure shape is now: ledger row FAILED + empty retry_log + missing findings.json — and the detector fires on it. ## `nous reports` subcommand A new `nous reports <target>` CLI subcommand re-emits meta_findings.json on demand for any work_dir, regardless of whether the campaign reached a clean terminal transition. Pure-Python; zero LLM tokens. Useful for: - Legacy campaigns that pre-date the in-line emitter wired into campaign.py. - Aborted campaigns that never reached the four call sites that invoke the emitter automatically. - Re-emission after this PR's heuristics changes — the post-204-rerun campaign goes from 0 nous_asks to 2 (one dispatch ask citing the ledger FAILED row, one observability ask citing missing artifacts). When the target work_dir is not at phase=DONE/STOPPED, the emitted meta_findings.json is annotated with a `notes` field flagging non-terminal state, so triage tooling doesn't conflate on-demand emission with a clean terminal record. Target accepts a campaign.yaml (preferred — supplies target_system context for instrumentation/documentation heuristics) or a work_dir / run_id resolvable via NOUS_CAMPAIGN_PARENT. ## Tests added (+7) - TestMissingArtifactDetection.test_empty_retry_log_still_triggers — pins the post-AI-native-Systems-Research#242 catastrophic-failure shape. - TestSetupWorkDirLegacyDefault.test_creates_empty_retry_log_jsonl — asserts setup_work_dir touches the file empty. - TestSetupWorkDirLegacyDefault.test_retry_log_existing_content_not_clobbered — idempotency under repeated setup_work_dir calls. - TestCmdReports (4 tests) — work_dir target, yaml target, partial state annotation, terminal state non-annotation. End-to-end: running `nous reports` against the actual paper-burst.post-204-rerun.1779882732/ campaign now emits 2 nous_asks (was 0 before this PR), each citing iter-1 status=FAILED and the missing per-iteration artifacts. Full suite: 1236 passed (+7), 1 skipped, 0 failures.
1 parent 18303a4 commit 46a288f

6 files changed

Lines changed: 765 additions & 2 deletions

File tree

orchestrator/cli.py

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -562,6 +562,89 @@ def _cmd_report(args):
562562
_generate_report(campaign, work_dir, args.model, agent=args.agent, timeout=args.timeout)
563563

564564

565+
def _cmd_reports(args):
566+
"""On-demand re-emission of meta_findings.json (#242).
567+
568+
Runs the pure-Python emitter against any work_dir, regardless of
569+
whether the campaign reached a clean terminal transition. Useful for
570+
legacy campaigns that pre-date the in-line emission wired into
571+
campaign.py, and for campaigns that aborted mid-phase and so never
572+
reached the four call sites that invoke the emitter automatically.
573+
574+
Target may be a campaign.yaml (preferred — gives full target_system
575+
context for the heuristics) or a work_dir / run_id (emitted with an
576+
empty target_system stub).
577+
"""
578+
import json as _json
579+
import yaml as _yaml
580+
from orchestrator.meta_findings import (
581+
emit_meta_findings,
582+
write_meta_findings,
583+
)
584+
from orchestrator.validate import validate_meta_findings
585+
586+
work_dir = resolve_work_dir(args.target)
587+
588+
campaign: dict = {"target_system": {}}
589+
if args.target.endswith((".yaml", ".yml")):
590+
try:
591+
data = _yaml.safe_load(Path(args.target).read_text())
592+
if isinstance(data, dict):
593+
campaign = data
594+
except (_yaml.YAMLError, OSError) as exc:
595+
print(
596+
f"Warning: could not parse {args.target} ({exc}); "
597+
f"emitting against empty target_system context.",
598+
file=sys.stderr,
599+
)
600+
601+
payload = emit_meta_findings(work_dir, campaign)
602+
603+
state_path = work_dir / "state.json"
604+
is_terminal = False
605+
if state_path.exists():
606+
try:
607+
state = _json.loads(state_path.read_text())
608+
phase = state.get("last_entered_phase") or state.get("phase")
609+
is_terminal = phase in ("DONE", "STOPPED")
610+
except (_json.JSONDecodeError, OSError):
611+
pass
612+
613+
if not is_terminal:
614+
prior = payload.get("notes") or ""
615+
suffix = (
616+
f"Emitted on-demand via `nous reports` against a non-terminal "
617+
f"work_dir (state.json: phase is not DONE/STOPPED). The "
618+
f"three streams reflect partial state — re-emit after "
619+
f"campaign termination for the canonical record."
620+
)
621+
payload["notes"] = (prior + " " + suffix).strip() if prior else suffix
622+
623+
target = write_meta_findings(work_dir, payload)
624+
result = validate_meta_findings(work_dir)
625+
if result["status"] == "fail":
626+
print(
627+
f"Warning: emitted meta_findings.json failed self-validation: "
628+
f"{result['errors']}",
629+
file=sys.stderr,
630+
)
631+
632+
n_lessons = len(payload.get("campaign_design_lessons") or [])
633+
n_repo = len(payload.get("target_system_asks") or [])
634+
n_nous = len(payload.get("nous_asks") or [])
635+
print(
636+
f"{target} "
637+
f"({n_lessons} design lesson(s), {n_repo} repo ask(s), "
638+
f"{n_nous} nous ask(s))"
639+
)
640+
if not is_terminal:
641+
print(
642+
"Note: emitted against a non-terminal work_dir; see "
643+
"meta_findings.json `notes` field.",
644+
file=sys.stderr,
645+
)
646+
647+
565648
def _cmd_replay(args):
566649
import subprocess
567650
import yaml
@@ -847,6 +930,19 @@ def main():
847930
p_replay.add_argument("--iter", required=True, type=int)
848931
p_replay.set_defaults(func=_cmd_replay)
849932

933+
p_reports = subparsers.add_parser(
934+
"reports",
935+
help="Re-emit meta_findings.json on demand for any work_dir (#242). "
936+
"Pure-Python; zero LLM tokens. Works against legacy or aborted "
937+
"campaigns that never reached the in-line emitter.",
938+
)
939+
p_reports.add_argument(
940+
"target",
941+
help="campaign.yaml (preferred — supplies target_system context) "
942+
"OR a work_dir / run_id resolvable via NOUS_CAMPAIGN_PARENT.",
943+
)
944+
p_reports.set_defaults(func=_cmd_reports)
945+
850946
# `create-campaign` (issue #89): scaffold a heavily-commented
851947
# campaign.yaml that names the four agent-reachable fields and
852948
# warns about the domain_adapter_layer trap.

orchestrator/iteration.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -799,6 +799,17 @@ def setup_work_dir(run_id: str, repo_path: str | None = None) -> Path:
799799
dest = work_dir / t
800800
if not dest.exists():
801801
shutil.copy(TEMPLATES_DIR / t, dest)
802+
803+
# #242: eagerly create an empty retry_log.jsonl. The orchestrator
804+
# writes it on first dispatch failure, so a dispatcher-side crash
805+
# before any retry would leave no trail at all — making the
806+
# retry-log-keyed heuristics in meta_findings.py blind to the
807+
# failure. Touching it here guarantees downstream tooling always
808+
# sees a parseable artifact, even if it's empty.
809+
retry_log = work_dir / "retry_log.jsonl"
810+
if not retry_log.exists():
811+
retry_log.touch()
812+
802813
state = json.loads((work_dir / "state.json").read_text())
803814
state["run_id"] = run_id
804815
# #239: record resolved paths as per-campaign source of truth.

orchestrator/meta_findings.py

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,106 @@ def _detect_nous_asks(
345345
return asks
346346

347347

348+
def _detect_nous_asks_from_ledger_failures(ledger: dict | list | None) -> list[dict]:
349+
"""Surface ledger.json rows where status="FAILED" as nous_asks (#242).
350+
351+
Closes the structural blind spot where every other detector keys off
352+
retry_log.jsonl: a dispatcher that dies before writing the retry log
353+
leaves only ledger.json as evidence, and the previous emitter
354+
silently produced 0/0/0 streams. ledger.json is written by the
355+
orchestrator itself, so it survives dispatcher-side crashes.
356+
"""
357+
asks: list[dict] = []
358+
if not isinstance(ledger, dict):
359+
return asks
360+
for row in ledger.get("iterations") or []:
361+
if not isinstance(row, dict):
362+
continue
363+
if row.get("status") != "FAILED":
364+
continue
365+
iteration = row.get("iteration")
366+
if not isinstance(iteration, int):
367+
continue
368+
err = (row.get("error") or "").strip()
369+
err_short = err[:120] if err else "(no error text)"
370+
asks.append({
371+
"ask": (
372+
"Investigate the dispatcher failure that prevented this "
373+
"iteration from completing — the iteration ended without "
374+
"producing findings.json or retry_log.jsonl, leaving the "
375+
"ledger row as the only diagnostic surface."
376+
),
377+
"evidence": (
378+
f"ledger.json: iter-{iteration} status=FAILED, "
379+
f"error=\"{err_short}\""
380+
),
381+
"kind": "dispatch",
382+
})
383+
return asks
384+
385+
386+
def _detect_nous_asks_from_missing_artifacts(
387+
work_dir: Path, state: dict | list | None, ledger: dict | list | None,
388+
) -> list[dict]:
389+
"""Flag campaigns where state.json shows iteration progressed but
390+
per-iteration artifacts are absent (#242).
391+
392+
Triggered when state.iteration >= 1 and state.last_entered_phase
393+
is not IDLE, ledger.json has at least one row with iteration >= 1,
394+
retry_log.jsonl is absent, and runs/iter-N/findings.json is absent
395+
for the highest ledger iter N. This is the post-#204 rerun shape:
396+
the dispatcher died after the orchestrator advanced state but
397+
before writing any per-iteration artifacts.
398+
"""
399+
if not isinstance(state, dict) or not isinstance(ledger, dict):
400+
return []
401+
iteration = state.get("iteration")
402+
phase = state.get("last_entered_phase")
403+
if not isinstance(iteration, int) or iteration < 1:
404+
return []
405+
if not phase or phase == "IDLE":
406+
return []
407+
408+
ledger_iters: list[int] = []
409+
for row in ledger.get("iterations") or []:
410+
if not isinstance(row, dict):
411+
continue
412+
n = row.get("iteration")
413+
if isinstance(n, int) and n >= 1:
414+
ledger_iters.append(n)
415+
if not ledger_iters:
416+
return []
417+
latest_iter = max(ledger_iters)
418+
419+
# After #242's eager init, retry_log.jsonl is touched at setup_work_dir
420+
# so it always exists for fresh campaigns. Treat a 0-row file as
421+
# equivalent to "no dispatch retries logged" — the original semantic
422+
# the detector cares about.
423+
retry_log = work_dir / "retry_log.jsonl"
424+
has_retry_rows = retry_log.exists() and retry_log.stat().st_size > 0
425+
findings = work_dir / "runs" / f"iter-{latest_iter}" / "findings.json"
426+
427+
if has_retry_rows:
428+
return []
429+
if findings.exists():
430+
return []
431+
432+
retry_state = "absent" if not retry_log.exists() else "empty"
433+
return [{
434+
"ask": (
435+
"Investigate why the iteration progressed in state.json but "
436+
"the dispatcher produced no per-iteration artifacts. The "
437+
"iteration ended with no findings.json and no retry rows."
438+
),
439+
"evidence": (
440+
f"state.json: iteration={iteration} phase={phase} — "
441+
f"runs/iter-{latest_iter}/findings.json absent, "
442+
f"retry_log.jsonl {retry_state}."
443+
),
444+
"kind": "observability",
445+
}]
446+
447+
348448
def _detect_design_lessons(work_dir: Path) -> list[dict]:
349449
"""Find lessons about campaign design from per-iteration findings."""
350450
lessons: list[dict] = []
@@ -453,7 +553,11 @@ def emit_meta_findings(
453553
"iterations_completed": iterations_completed,
454554
"campaign_design_lessons": _detect_design_lessons(work_dir),
455555
"target_system_asks": _detect_target_system_asks(campaign, retries),
456-
"nous_asks": _detect_nous_asks(metrics, retries),
556+
"nous_asks": (
557+
_detect_nous_asks(metrics, retries)
558+
+ _detect_nous_asks_from_ledger_failures(ledger)
559+
+ _detect_nous_asks_from_missing_artifacts(work_dir, state, ledger)
560+
),
457561
}
458562

459563
# Deployment recommendation (issue #170): every campaign emits a

tests/test_cli.py

Lines changed: 139 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from pathlib import Path
66
from unittest.mock import patch, MagicMock
77

8-
from orchestrator.cli import resolve_work_dir, _cmd_run, _cmd_resume, _cmd_validate, _cmd_status, _cmd_cost, _cmd_report, _cmd_replay
8+
from orchestrator.cli import resolve_work_dir, _cmd_run, _cmd_resume, _cmd_validate, _cmd_status, _cmd_cost, _cmd_report, _cmd_replay, _cmd_reports
99

1010

1111
class TestResolveWorkDir:
@@ -379,3 +379,141 @@ def test_replay_reports_failed_command(self, tmp_path, capsys):
379379
_cmd_replay(args)
380380
err = capsys.readouterr().err
381381
assert "h-main/bad" in err
382+
383+
384+
class TestCmdReports:
385+
"""`nous reports` re-emits meta_findings.json on demand (#242)."""
386+
387+
def test_emits_meta_findings_against_workdir(
388+
self, tmp_path: Path, capsys: pytest.CaptureFixture) -> None:
389+
"""Against a work_dir directly (run_id with no yaml in scope),
390+
the command emits a meta_findings.json with empty target_system
391+
context. Useful for triaging legacy or aborted campaigns.
392+
"""
393+
work_dir = tmp_path / ".nous" / "legacy-run"
394+
work_dir.mkdir(parents=True)
395+
(work_dir / "state.json").write_text(json.dumps({
396+
"run_id": "legacy-run",
397+
"iteration": 1,
398+
"last_entered_phase": "DONE",
399+
}))
400+
(work_dir / "ledger.json").write_text(json.dumps({
401+
"iterations": [{"iteration": 1, "status": "FAILED",
402+
"error": "SDK returned error after 1 attempt(s): None"}],
403+
}))
404+
405+
args = argparse.Namespace(target=str(work_dir))
406+
_cmd_reports(args)
407+
out = capsys.readouterr().out
408+
409+
# Output one-liner reports the artifact path + counts.
410+
assert "meta_findings.json" in out
411+
assert "nous ask" in out
412+
413+
# Artifact must exist on disk and be schema-valid.
414+
mf_path = work_dir / "meta_findings.json"
415+
assert mf_path.exists()
416+
payload = json.loads(mf_path.read_text())
417+
assert payload["schema_version"] == "1"
418+
# The acceptance fixture from #242: ledger FAILED row → ≥ 1 nous_ask.
419+
assert payload["nous_asks"], payload
420+
421+
def test_marks_partial_when_workdir_not_terminal(
422+
self, tmp_path: Path, capsys: pytest.CaptureFixture) -> None:
423+
"""A campaign whose state.json shows phase != DONE/STOPPED is
424+
emitted with a `notes` field flagging partial state, so triage
425+
tooling doesn't conflate on-demand emission with a clean
426+
terminal record.
427+
"""
428+
work_dir = tmp_path / ".nous" / "midflight"
429+
work_dir.mkdir(parents=True)
430+
(work_dir / "state.json").write_text(json.dumps({
431+
"run_id": "midflight",
432+
"iteration": 1,
433+
"last_entered_phase": "EXECUTE_ANALYZE",
434+
}))
435+
(work_dir / "ledger.json").write_text(json.dumps({
436+
"iterations": [{"iteration": 1, "status": "FAILED",
437+
"error": "abort"}],
438+
}))
439+
440+
args = argparse.Namespace(target=str(work_dir))
441+
_cmd_reports(args)
442+
443+
payload = json.loads((work_dir / "meta_findings.json").read_text())
444+
assert "notes" in payload
445+
assert "non-terminal" in payload["notes"].lower()
446+
err = capsys.readouterr().err
447+
assert "non-terminal" in err.lower()
448+
449+
def test_does_not_mark_partial_when_terminal(
450+
self, tmp_path: Path) -> None:
451+
"""A campaign at phase=DONE must NOT be flagged as non-terminal."""
452+
work_dir = tmp_path / ".nous" / "done-run"
453+
work_dir.mkdir(parents=True)
454+
(work_dir / "state.json").write_text(json.dumps({
455+
"run_id": "done-run",
456+
"iteration": 1,
457+
"last_entered_phase": "DONE",
458+
}))
459+
(work_dir / "ledger.json").write_text(json.dumps({
460+
"iterations": [{"iteration": 1, "status": "completed"}],
461+
}))
462+
# Build a complete iter-1 finding so the heuristic streams stay quiet.
463+
iter_dir = work_dir / "runs" / "iter-1"
464+
iter_dir.mkdir(parents=True)
465+
(iter_dir / "findings.json").write_text(json.dumps({
466+
"iteration": 1, "bundle_ref": "stub", "arms": [],
467+
"experiment_valid": True, "discrepancy_analysis": "stub",
468+
}))
469+
470+
args = argparse.Namespace(target=str(work_dir))
471+
_cmd_reports(args)
472+
473+
payload = json.loads((work_dir / "meta_findings.json").read_text())
474+
notes = payload.get("notes") or ""
475+
assert "non-terminal" not in notes.lower(), notes
476+
477+
def test_yaml_target_loads_full_campaign_context(
478+
self, tmp_path: Path) -> None:
479+
"""When target is a campaign.yaml, target_system fields drive the
480+
instrumentation/documentation heuristics. Absence of declared
481+
observable_metrics should still surface a target_system_ask.
482+
"""
483+
import yaml as _yaml
484+
485+
repo = tmp_path / "myrepo"
486+
repo.mkdir()
487+
work_dir = repo / ".nous" / "yaml-run"
488+
work_dir.mkdir(parents=True)
489+
(work_dir / "state.json").write_text(json.dumps({
490+
"run_id": "yaml-run",
491+
"iteration": 1,
492+
"last_entered_phase": "DONE",
493+
}))
494+
(work_dir / "ledger.json").write_text(json.dumps({
495+
"iterations": [{"iteration": 1, "status": "completed"}],
496+
}))
497+
iter_dir = work_dir / "runs" / "iter-1"
498+
iter_dir.mkdir(parents=True)
499+
(iter_dir / "findings.json").write_text(json.dumps({
500+
"iteration": 1, "bundle_ref": "stub", "arms": [],
501+
"experiment_valid": True, "discrepancy_analysis": "stub",
502+
}))
503+
504+
campaign_yaml = tmp_path / "campaign.yaml"
505+
campaign_yaml.write_text(_yaml.dump({
506+
"run_id": "yaml-run",
507+
"target_system": {
508+
"name": "demo",
509+
"repo_path": str(repo),
510+
# Intentionally no observable_metrics → instrumentation ask.
511+
},
512+
}))
513+
514+
args = argparse.Namespace(target=str(campaign_yaml))
515+
_cmd_reports(args)
516+
517+
payload = json.loads((work_dir / "meta_findings.json").read_text())
518+
kinds = {a.get("kind") for a in payload["target_system_asks"]}
519+
assert "instrumentation" in kinds, payload["target_system_asks"]

0 commit comments

Comments
 (0)