Skip to content

Commit 6d4345e

Browse files
sriumcpclaude
andauthored
Tracking AI-native-Systems-Research#176: sort_bench-surfaced integration gaps (3 children) (AI-native-Systems-Research#180)
* fix: wire update_best_found into iteration finalize — best_found.json now written on every iteration The sort_bench dry-run on 2026-05-25 surfaced a Phase-A-without-Phase-B gap: PR AI-native-Systems-Research#172 shipped orchestrator.composite_score.update_best_found with passing unit tests, but no production code path called it after findings.json was finalized. A live `nous run` left best_found.json missing at the work_dir root, which cascaded into the deployment recommendation reporting `fall_back_to_baseline` on a 100%-CONFIRMED campaign (AI-native-Systems-Research#178 covers the cascade hardening; this commit fixes the root cause). What lands: * New `finalize_iteration(work_dir, iter_dir, iteration, campaign)` public seam in orchestrator.iteration. Encapsulates the deterministic post-gate Python steps: 1. _merge_principles (existing). 2. update_best_found (NEW — closes the missing wire-up). 3. claude_md.regenerate_from_disk (existing, best-effort). Tests drive this seam directly with fixture findings; the unit tests for update_best_found that already passed in CI didn't catch the gap because they invoked the function in isolation, not via the production code path. * `_resolve_objective(campaign)` reads `objective` or `objective_preset` from campaign.yaml and returns an ObjectiveSpec. Tolerant of malformed declarations (returns None, falls through to legacy status-based ranking). * run_iteration() now calls finalize_iteration instead of inlining the steps, so the production path and the test path are the same path. Same console output as before, plus a new line confirming best_found.json was updated. Backward-compat: campaigns without an `objective:` block (the sort_bench-style case that surfaced this) still get a populated best_found.json via the legacy CONFIRMED=1.0 / PARTIALLY_CONFIRMED=0.5 / REFUTED=0.0 ranking already implemented in update_best_found. Behavioral tests (tests/test_iteration_finalize.py): 4 cases covering: - best_found.json is written with non-empty top_k (regression for AI-native-Systems-Research#177) - legacy fallback (no objective declared) still produces best_found - objective_preset is honored (compound-return-style weights match) - missing findings.json is tolerated; empty top_k is the result All tests drive `finalize_iteration` directly with fixture state — no live LLM, no subprocess, no orchestrator full-loop simulation. Closes AI-native-Systems-Research#177. Refs AI-native-Systems-Research#176, AI-native-Systems-Research#168, AI-native-Systems-Research#172. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: deployment_recommendation distinguishes missing vs empty best_found.json The sort_bench dry-run on 2026-05-25 produced a `fall_back_to_baseline` verdict with empty caveats on a 100%-CONFIRMED campaign. The verdict was technically conservative-correct (best_found.json was missing, so there was nothing to rank), but the empty caveats list made it indistinguishable from a real "no candidate beat baseline" outcome. This commit hardens make_deployment_recommendation against three distinct failure modes: 1. best_found.json is missing entirely — upstream wiring gap (AI-native-Systems-Research#177's root cause). Caveat now cites the filename, names update_best_found as the function that should have written it, and references issue AI-native-Systems-Research#177 so the operator knows where to look. 2. best_found.json is present but top_k is empty — the legitimate "search ran, nothing beat baseline" case. Caveat distinguishes this from AI-native-Systems-Research#1 by citing the file and the empty-top_k state. 3. best_found.json is present but top_k[0] is corrupt (unexpected type). Caveat reports the actual type observed and points at AI-native-Systems-Research#177 for investigation. The verdict stays `fall_back_to_baseline` in all three cases — that's the conservative, safe answer. What changes is the caveats list, so the operator can act on the real cause rather than misreading silence as a failed campaign. All auto-generated caveats pass meta_findings.validate_caveat (AI-native-Systems-Research#170 floor): each cites a concrete artifact path, a numeric indicator, or an issue/code reference. The validator-floor regression tests assert this directly — vague aspirations cannot ship as deployment caveats regardless of source. Behavioral tests (tests/test_deployment_recommendation.py): 4 new cases covering the missing case, the empty-top_k case, and validator- floor compliance for both. The original sort_bench symptom (silent fall_back with empty caveats) would now be caught by the test_missing_best_found_caveat_cites_filename_and_issue regression. Closes AI-native-Systems-Research#178. Refs AI-native-Systems-Research#176, AI-native-Systems-Research#170, AI-native-Systems-Research#172. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: principles classifier + validator warning for empirical_content adoption The sort_bench dry-run on 2026-05-25 surfaced that extracted principles ship with empirical_content / derivation_type unset because the AI-native-Systems-Research#86 methodology prompt is advisory and the schema treats both fields as optional. RP-2 in that run was a clear empirical observation ("timsort uses 460 comparisons on nearly-sorted input") but was filed silently with both fields None. This commit closes the gap with the A+B composition recommended in issue AI-native-Systems-Research#179: deterministic auto-classifier (A) + soft validator warning (B). No prompt change — the methodology blurb is left intact as a hint, but adoption no longer depends on prompt compliance. What lands: * New orchestrator/principles_classifier.py: - classify_principle(p): pure function returning a copy with empirical_content / derivation_type filled in based on text heuristics. Existing explicit values are preserved (explicit > heuristic). - classify_principles(ps): batch wrapper. - classify_principle_updates_in_place(iter_dir): rewrites runs/iter-N/principle_updates.json atomically. Idempotent: re- running on an already-classified file produces byte-equal output. Heuristic priority: definitional ('by definition', 'is defined as') > algebraic ('iff', 'identity', 'theorem', 'algebraic') > empirical (iter-N + numeric measurement + process verbs). Empirical requires >= 2 markers — a lone iter-N reference is too weak. When neither side fires strongly, fields are left None and the validator warning surfaces the residual. * orchestrator/validate.py gains validate_principles_have_empirical_content(ps). Returns WARN-prefixed strings for category=domain principles with unset fields after classification. Meta-category principles (constraint principles emitted by orchestrator.refute_constraints per AI-native-Systems-Research#169) are exempt — they're orchestrator-emitted facts, not LLM-extracted observations. * orchestrator/iteration.finalize_iteration now calls the classifier BEFORE _merge_principles, so the merged principles.json reflects the tags on its very first write. After the merge, the validator scans principles.json and logs WARN-prefixed messages for any residuals — visible in the run log without rolling back the merge. Behavioral tests (tests/test_principles_classifier.py): 15 cases covering the obvious-empirical case (RP-2 from sort_bench), obvious-algebraic case (RP-9 from AI-native-Systems-Research#84/AI-native-Systems-Research#86), definitional case, explicit-tag preservation, partial-tag fill-in, neutral statement left-unclassified, batch processing, in-place file mutation + idempotence, validator behavior on unset / classified / meta principles, and the end-to-end finalize → classifier → merge path. This commit closes the parent tracker: with AI-native-Systems-Research#177, AI-native-Systems-Research#178, and AI-native-Systems-Research#179 all landed in this PR, every gap surfaced by the sort_bench dry-run is closed. Closes AI-native-Systems-Research#179. Closes AI-native-Systems-Research#176. Refs AI-native-Systems-Research#86, AI-native-Systems-Research#169, AI-native-Systems-Research#170, AI-native-Systems-Research#172, AI-native-Systems-Research#174. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d528db3 commit 6d4345e

7 files changed

Lines changed: 925 additions & 16 deletions

orchestrator/deployment_recommendation.py

Lines changed: 38 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -173,14 +173,48 @@ def make_deployment_recommendation(
173173
caveats) when there's nothing to recommend — never raises.
174174
"""
175175
work_dir = Path(work_dir)
176-
best_found = _read_json(work_dir / "best_found.json")
176+
best_found_path = work_dir / "best_found.json"
177+
best_found = _read_json(best_found_path)
178+
179+
# Issue #178: distinguish "no candidate beat baseline" (genuine
180+
# fall-back) from "best_found.json is missing" (upstream wiring
181+
# gap — see #177). Both keep the conservative fall_back_to_baseline
182+
# verdict, but the caveats now tell the operator what actually
183+
# happened. Each caveat passes meta_findings.validate_caveat
184+
# (cites a concrete artifact name + numeric / issue reference).
185+
if best_found is None:
186+
return DeploymentRecommendation(
187+
verdict="fall_back_to_baseline",
188+
caveats=[
189+
f"best_found.json not present at {best_found_path}; "
190+
f"cannot rank candidates. The iteration finalize step "
191+
f"either did not run or did not call update_best_found. "
192+
f"See issue #177 in orchestrator/iteration.py."
193+
],
194+
)
177195

178-
if not best_found or not best_found.get("top_k"):
179-
return DeploymentRecommendation(verdict="fall_back_to_baseline")
196+
if not best_found.get("top_k"):
197+
return DeploymentRecommendation(
198+
verdict="fall_back_to_baseline",
199+
caveats=[
200+
f"best_found.json present at {best_found_path} but "
201+
f"top_k is empty (k={best_found.get('k', 0)}); no "
202+
f"candidate scored above baseline across the iterations "
203+
f"recorded in runs/iter-N/findings.json."
204+
],
205+
)
180206

181207
top = best_found["top_k"][0]
182208
if not isinstance(top, dict):
183-
return DeploymentRecommendation(verdict="fall_back_to_baseline")
209+
return DeploymentRecommendation(
210+
verdict="fall_back_to_baseline",
211+
caveats=[
212+
f"best_found.json top_k[0] has unexpected type "
213+
f"{type(top).__name__!r} at {best_found_path}; "
214+
f"expected dict. Investigate whether update_best_found "
215+
f"wrote a corrupt entry — see issue #177."
216+
],
217+
)
184218

185219
best_score = float(top.get("score", 0.0))
186220
iteration = int(top.get("iteration", 0))

orchestrator/iteration.py

Lines changed: 105 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,102 @@ def _enter_phase(engine, phase):
162162
return True
163163

164164

165+
def _resolve_objective(campaign: dict):
166+
"""Resolve campaign.yaml's objective block to an ObjectiveSpec, or None.
167+
168+
Issue #177: the iteration finalize step calls update_best_found with
169+
this objective. Legacy campaigns without `objective` or `objective_preset`
170+
fall through to the legacy status-based ranking inside update_best_found.
171+
"""
172+
if not isinstance(campaign, dict):
173+
return None
174+
from orchestrator.composite_score import ObjectiveSpec, get_preset
175+
176+
if (preset := campaign.get("objective_preset")):
177+
try:
178+
return get_preset(str(preset))
179+
except ValueError:
180+
return None
181+
182+
obj = campaign.get("objective")
183+
if isinstance(obj, dict) and obj.get("weights"):
184+
try:
185+
return ObjectiveSpec(
186+
weights={str(k): float(v) for k, v in obj["weights"].items()},
187+
metric_extractors=dict(obj.get("metric_extractors") or {}),
188+
deploy_threshold=float(obj.get("deploy_threshold", 0.1)),
189+
)
190+
except (TypeError, ValueError):
191+
return None
192+
return None
193+
194+
195+
def finalize_iteration(
196+
*,
197+
work_dir: Path,
198+
iter_dir: Path,
199+
iteration: int,
200+
campaign: dict,
201+
) -> None:
202+
"""Run the deterministic post-gate finalize steps for an iteration.
203+
204+
Public seam (issue #177) so integration tests can drive the same
205+
code path that ``run_iteration`` calls after HUMAN_FINDINGS_GATE
206+
approves. The sort_bench dry-run on 2026-05-25 surfaced the gap:
207+
``update_best_found`` shipped in PR #172 with passing unit tests
208+
but no caller — this function is the caller.
209+
210+
Steps (deterministic Python, no LLM):
211+
1. Classify principle_updates.json in place — fill empirical_content
212+
/ derivation_type from text heuristics (issue #179).
213+
2. Merge ``principle_updates.json`` into ``principles.json``.
214+
3. Re-rank candidates and atomically rewrite ``best_found.json``
215+
(issue #168 / #177).
216+
4. Surface validator warnings for any residual unclassified
217+
domain principles (issue #179, #86).
218+
5. Regenerate per-campaign ``CLAUDE.md`` so the next iteration's
219+
session sees the updated principles + handoff (issue #131).
220+
221+
Tolerant of partial fixtures: missing principle_updates.json,
222+
missing findings.json, and CLAUDE.md regeneration failures all
223+
soft-fail — the iteration's terminal artifacts (``best_found.json``,
224+
``principles.json``) are still written.
225+
"""
226+
from orchestrator.composite_score import update_best_found
227+
from orchestrator.principles_classifier import classify_principle_updates_in_place
228+
from orchestrator.validate import validate_principles_have_empirical_content
229+
230+
# Classify BEFORE merge so principles.json reflects the tags on its
231+
# very first write (issue #179).
232+
classify_principle_updates_in_place(iter_dir)
233+
234+
_merge_principles(work_dir, iter_dir)
235+
236+
objective = _resolve_objective(campaign)
237+
update_best_found(work_dir, objective=objective, top_k=5)
238+
239+
# Surface validator warnings for residual unclassified domain
240+
# principles. Advisory only — doesn't roll back the merge.
241+
principles_path = work_dir / "principles.json"
242+
if principles_path.exists():
243+
try:
244+
store = json.loads(principles_path.read_text())
245+
for warning in validate_principles_have_empirical_content(
246+
store.get("principles", []),
247+
):
248+
logger.warning("%s", warning)
249+
except (OSError, json.JSONDecodeError):
250+
pass
251+
252+
# CLAUDE.md regenerate is best-effort; failure here doesn't roll back
253+
# the merged principles or the best_found ranking.
254+
try:
255+
from orchestrator.claude_md import regenerate_from_disk
256+
regenerate_from_disk(work_dir, campaign, iteration=iteration)
257+
except (OSError, RuntimeError) as exc:
258+
logger.warning("Failed to regenerate CLAUDE.md: %s", exc)
259+
260+
165261
def _merge_principles(work_dir: Path, iter_dir: Path) -> None:
166262
"""Merge principle_updates.json into the shared principles.json store."""
167263
updates_path = iter_dir / "principle_updates.json"
@@ -534,19 +630,16 @@ def _max_turns_for(phase_key: str) -> int:
534630
print("Aborted.")
535631
return IterationOutcome.ABORTED
536632

537-
# ─── PRINCIPLE MERGE (Python, no LLM) ─────────────────────────────────
538-
_merge_principles(work_dir, iter_dir)
633+
# ─── FINALIZE: merge principles + write best_found.json + CLAUDE.md ───
634+
# Issue #177: the sort_bench dry-run on 2026-05-25 surfaced that
635+
# update_best_found (#168) had no caller in the production path.
636+
# finalize_iteration is the caller. Tests drive it directly.
637+
finalize_iteration(
638+
work_dir=work_dir, iter_dir=iter_dir,
639+
iteration=iteration, campaign=campaign,
640+
)
539641
print(f" -> Principles merged into {work_dir / 'principles.json'}")
540-
541-
# ─── CLAUDE.md REGENERATE (Python, no LLM) — issue #131 ───────────────
542-
# Refresh per-campaign CLAUDE.md so the next iteration's session loads
543-
# the updated principles + handoff via Claude Code's auto-context loading.
544-
try:
545-
from orchestrator.claude_md import regenerate_from_disk
546-
regenerate_from_disk(work_dir, campaign, iteration=iteration)
547-
except (OSError, RuntimeError) as exc:
548-
# Best-effort: a CLAUDE.md write failure shouldn't abort the iteration.
549-
logger.warning("Failed to regenerate CLAUDE.md: %s", exc)
642+
print(f" -> best_found.json updated at {work_dir / 'best_found.json'}")
550643

551644
if final:
552645
engine.transition("DONE")
Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
"""Deterministic post-extraction classifier for principle empiricism (issue #179).
2+
3+
The sort_bench dry-run on 2026-05-25 surfaced that extracted principles
4+
ship with `empirical_content` and `derivation_type` (issue #86) unset
5+
because the methodology prompt is advisory and the schema treats them
6+
as optional. RP-2 in that run was a clear empirical observation
7+
(*"timsort uses 460 comparisons on nearly-sorted input"*) but was
8+
silently filed without tags.
9+
10+
This module provides a deterministic Python heuristic that runs on
11+
``principle_updates.json`` before merge into ``principles.json``,
12+
filling the fields when the statement is classifiable. Residual
13+
unclassifiable principles are caught by the validator warning in
14+
``orchestrator.validate.validate_principles_have_empirical_content``.
15+
16+
Approach: composable A+B from issue #179.
17+
* A: This module — deterministic auto-classifier.
18+
* B: ``validate.py`` — soft validator emitting WARN on residual misses.
19+
20+
Heuristic priority:
21+
1. Existing explicit tags are preserved (explicit > heuristic).
22+
2. Algebraic markers (`iff`, `algebraic`, `identity`, `theorem`) → algebraic.
23+
3. Definitional markers (`is defined as`, `by definition`) → definitional.
24+
4. Empirical markers (`iter-N`, numeric measurements with units,
25+
`observed`, `measured`, `experiments`) → empirical.
26+
5. Otherwise leave None — the validator warning surfaces to the human.
27+
28+
No LLM, no live calls. Tests assert on the heuristic's verdicts for
29+
known statement shapes.
30+
"""
31+
from __future__ import annotations
32+
33+
import json
34+
import re
35+
from copy import deepcopy
36+
from pathlib import Path
37+
38+
from orchestrator.util import atomic_write
39+
40+
41+
_ALGEBRAIC_MARKERS = (
42+
re.compile(r"\biff\b", re.IGNORECASE),
43+
re.compile(r"\bif\s+and\s+only\s+if\b", re.IGNORECASE),
44+
re.compile(r"\balgebraic(?:ally)?\b", re.IGNORECASE),
45+
re.compile(r"\bidentity\b", re.IGNORECASE),
46+
re.compile(r"\bequivalent(?:ly)?\s+to\b", re.IGNORECASE),
47+
re.compile(r"\bfollows\s+from\b", re.IGNORECASE),
48+
re.compile(r"\btheorem\b", re.IGNORECASE),
49+
re.compile(r"\baxiom\b", re.IGNORECASE),
50+
re.compile(r"\bproof\b", re.IGNORECASE),
51+
)
52+
53+
_DEFINITIONAL_MARKERS = (
54+
re.compile(r"\bis\s+defined\s+as\b", re.IGNORECASE),
55+
re.compile(r"\bby\s+definition\b", re.IGNORECASE),
56+
re.compile(r"\bdefinitional(?:ly)?\b", re.IGNORECASE),
57+
)
58+
59+
_EMPIRICAL_MARKERS = (
60+
# Iteration / arm citations
61+
re.compile(r"\biter[-_ ]?\d+\b", re.IGNORECASE),
62+
re.compile(r"\barm[-_]?\w+\b", re.IGNORECASE),
63+
# Empirical-process verbs
64+
re.compile(r"\bobserved\b", re.IGNORECASE),
65+
re.compile(r"\bmeasured\b", re.IGNORECASE),
66+
re.compile(r"\bfound\s+that\b", re.IGNORECASE),
67+
re.compile(r"\bexperiments?\b", re.IGNORECASE),
68+
re.compile(r"\bdiscover(?:ed|y)?\b", re.IGNORECASE),
69+
re.compile(r"\bempirical(?:ly)?\b", re.IGNORECASE),
70+
# Numeric measurements with units (high signal)
71+
re.compile(
72+
r"\b\d+(?:\.\d+)?\s*"
73+
r"(?:%|ms|us|s|MB|GB|comparisons?|tokens?|seeds?|x)\b",
74+
re.IGNORECASE,
75+
),
76+
# Concrete equations / values: "= 460", "approximately 0.85"
77+
re.compile(r"=\s*\d{2,}"),
78+
re.compile(r"\bratio\s*=?\s*\d", re.IGNORECASE),
79+
)
80+
81+
82+
def classify_principle(p: dict) -> dict:
83+
"""Return a copy of ``p`` with ``empirical_content`` / ``derivation_type``
84+
filled in if the heuristic fires and the field is currently unset.
85+
86+
Pure: does not mutate the input. Existing values are preserved
87+
(explicit > heuristic). When neither side fires strongly, returns
88+
a copy with the fields still ``None`` — the validator warning
89+
surfaces the residual to the human.
90+
"""
91+
if not isinstance(p, dict):
92+
return p # malformed; let downstream validators catch it
93+
out = deepcopy(p)
94+
95+
# If both fields are already set, no change.
96+
has_empirical = out.get("empirical_content") is not None
97+
has_derivation = out.get("derivation_type") is not None
98+
if has_empirical and has_derivation:
99+
return out
100+
101+
statement = str(out.get("statement") or "")
102+
algebraic_hits = sum(1 for r in _ALGEBRAIC_MARKERS if r.search(statement))
103+
definitional_hits = sum(1 for r in _DEFINITIONAL_MARKERS if r.search(statement))
104+
empirical_hits = sum(1 for r in _EMPIRICAL_MARKERS if r.search(statement))
105+
106+
# Case 1: ``empirical_content`` was explicitly set; derivation_type
107+
# follows. True ⇒ empirical; False ⇒ algebraic or definitional
108+
# depending on which marker family dominates.
109+
if has_empirical and not has_derivation:
110+
if out.get("empirical_content") is True:
111+
out["derivation_type"] = "empirical"
112+
else:
113+
if definitional_hits >= 1 and definitional_hits >= algebraic_hits:
114+
out["derivation_type"] = "definitional"
115+
else:
116+
out["derivation_type"] = "algebraic"
117+
return out
118+
119+
# Case 2: ``derivation_type`` was explicitly set; empirical_content
120+
# follows by definition (only "empirical" → True; the others → False).
121+
if has_derivation and not has_empirical:
122+
out["empirical_content"] = (out.get("derivation_type") == "empirical")
123+
return out
124+
125+
# Case 3: neither set. Apply the heuristic with priority:
126+
# definitional > algebraic > empirical.
127+
128+
# Definitional markers are most specific — "is defined as" /
129+
# "by definition" override algebraic markers that may co-occur.
130+
if definitional_hits >= 1:
131+
out["empirical_content"] = False
132+
out["derivation_type"] = "definitional"
133+
return out
134+
135+
# Algebraic markers — at least one of {iff, theorem, identity, …}
136+
# AND no stronger empirical signal.
137+
if algebraic_hits >= 1 and algebraic_hits >= empirical_hits:
138+
out["empirical_content"] = False
139+
out["derivation_type"] = "algebraic"
140+
return out
141+
142+
# Empirical markers — require at least 2 (single iter-N alone is too
143+
# weak; we want corroborating evidence like a numeric measurement
144+
# or a process verb).
145+
if empirical_hits >= 2 and empirical_hits > algebraic_hits:
146+
out["empirical_content"] = True
147+
out["derivation_type"] = "empirical"
148+
return out
149+
150+
# Neither side fired strongly. Leave fields as-is (likely None) —
151+
# validator will warn for category=domain principles.
152+
return out
153+
154+
155+
def classify_principles(principles: list[dict]) -> list[dict]:
156+
"""Classify a list of principle dicts; returns a new list."""
157+
if not isinstance(principles, list):
158+
return principles
159+
return [classify_principle(p) for p in principles]
160+
161+
162+
def classify_principle_updates_in_place(iter_dir: Path) -> None:
163+
"""Read ``runs/iter-N/principle_updates.json``, classify, and write back atomically.
164+
165+
No-op if the file is missing or malformed. Idempotent: re-running on
166+
an already-classified file produces byte-equal output.
167+
168+
This is the seam ``finalize_iteration`` calls before
169+
``_merge_principles``, so the merged ``principles.json`` reflects
170+
the tags on its very first write.
171+
"""
172+
updates_path = Path(iter_dir) / "principle_updates.json"
173+
if not updates_path.exists():
174+
return
175+
try:
176+
updates = json.loads(updates_path.read_text())
177+
except (OSError, json.JSONDecodeError):
178+
return
179+
if not isinstance(updates, list):
180+
return
181+
182+
classified = classify_principles(updates)
183+
atomic_write(updates_path, json.dumps(classified, indent=2) + "\n")

0 commit comments

Comments
 (0)