Commit 6d4345e
Tracking AI-native-Systems-Research#176: sort_bench-surfaced integration gaps (3 children) (AI-native-Systems-Research#180)
* fix: wire update_best_found into iteration finalize — best_found.json now written on every iteration
The sort_bench dry-run on 2026-05-25 surfaced a Phase-A-without-Phase-B
gap: PR AI-native-Systems-Research#172 shipped orchestrator.composite_score.update_best_found
with passing unit tests, but no production code path called it after
findings.json was finalized. A live `nous run` left best_found.json
missing at the work_dir root, which cascaded into the deployment
recommendation reporting `fall_back_to_baseline` on a 100%-CONFIRMED
campaign (AI-native-Systems-Research#178 covers the cascade hardening; this commit fixes the
root cause).
What lands:
* New `finalize_iteration(work_dir, iter_dir, iteration, campaign)`
public seam in orchestrator.iteration. Encapsulates the deterministic
post-gate Python steps:
1. _merge_principles (existing).
2. update_best_found (NEW — closes the missing wire-up).
3. claude_md.regenerate_from_disk (existing, best-effort).
Tests drive this seam directly with fixture findings; the unit tests
for update_best_found that already passed in CI didn't catch the gap
because they invoked the function in isolation, not via the
production code path.
* `_resolve_objective(campaign)` reads `objective` or `objective_preset`
from campaign.yaml and returns an ObjectiveSpec. Tolerant of
malformed declarations (returns None, falls through to legacy
status-based ranking).
* run_iteration() now calls finalize_iteration instead of inlining the
steps, so the production path and the test path are the same path.
Same console output as before, plus a new line confirming
best_found.json was updated.
Backward-compat: campaigns without an `objective:` block (the
sort_bench-style case that surfaced this) still get a populated
best_found.json via the legacy CONFIRMED=1.0 / PARTIALLY_CONFIRMED=0.5
/ REFUTED=0.0 ranking already implemented in update_best_found.
Behavioral tests (tests/test_iteration_finalize.py): 4 cases covering:
- best_found.json is written with non-empty top_k (regression for AI-native-Systems-Research#177)
- legacy fallback (no objective declared) still produces best_found
- objective_preset is honored (compound-return-style weights match)
- missing findings.json is tolerated; empty top_k is the result
All tests drive `finalize_iteration` directly with fixture state — no
live LLM, no subprocess, no orchestrator full-loop simulation.
Closes AI-native-Systems-Research#177.
Refs AI-native-Systems-Research#176, AI-native-Systems-Research#168, AI-native-Systems-Research#172.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: deployment_recommendation distinguishes missing vs empty best_found.json
The sort_bench dry-run on 2026-05-25 produced a `fall_back_to_baseline`
verdict with empty caveats on a 100%-CONFIRMED campaign. The verdict
was technically conservative-correct (best_found.json was missing,
so there was nothing to rank), but the empty caveats list made it
indistinguishable from a real "no candidate beat baseline" outcome.
This commit hardens make_deployment_recommendation against three
distinct failure modes:
1. best_found.json is missing entirely — upstream wiring gap
(AI-native-Systems-Research#177's root cause). Caveat now cites the filename, names
update_best_found as the function that should have written it,
and references issue AI-native-Systems-Research#177 so the operator knows where to look.
2. best_found.json is present but top_k is empty — the legitimate
"search ran, nothing beat baseline" case. Caveat distinguishes
this from AI-native-Systems-Research#1 by citing the file and the empty-top_k state.
3. best_found.json is present but top_k[0] is corrupt (unexpected
type). Caveat reports the actual type observed and points at
AI-native-Systems-Research#177 for investigation.
The verdict stays `fall_back_to_baseline` in all three cases — that's
the conservative, safe answer. What changes is the caveats list, so
the operator can act on the real cause rather than misreading
silence as a failed campaign.
All auto-generated caveats pass meta_findings.validate_caveat
(AI-native-Systems-Research#170 floor): each cites a concrete artifact path, a numeric
indicator, or an issue/code reference. The validator-floor regression
tests assert this directly — vague aspirations cannot ship as
deployment caveats regardless of source.
Behavioral tests (tests/test_deployment_recommendation.py): 4 new
cases covering the missing case, the empty-top_k case, and validator-
floor compliance for both. The original sort_bench symptom (silent
fall_back with empty caveats) would now be caught by the
test_missing_best_found_caveat_cites_filename_and_issue regression.
Closes AI-native-Systems-Research#178.
Refs AI-native-Systems-Research#176, AI-native-Systems-Research#170, AI-native-Systems-Research#172.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: principles classifier + validator warning for empirical_content adoption
The sort_bench dry-run on 2026-05-25 surfaced that extracted principles
ship with empirical_content / derivation_type unset because the AI-native-Systems-Research#86
methodology prompt is advisory and the schema treats both fields as
optional. RP-2 in that run was a clear empirical observation
("timsort uses 460 comparisons on nearly-sorted input") but was filed
silently with both fields None.
This commit closes the gap with the A+B composition recommended in
issue AI-native-Systems-Research#179: deterministic auto-classifier (A) + soft validator
warning (B). No prompt change — the methodology blurb is left intact
as a hint, but adoption no longer depends on prompt compliance.
What lands:
* New orchestrator/principles_classifier.py:
- classify_principle(p): pure function returning a copy with
empirical_content / derivation_type filled in based on text
heuristics. Existing explicit values are preserved (explicit >
heuristic).
- classify_principles(ps): batch wrapper.
- classify_principle_updates_in_place(iter_dir): rewrites
runs/iter-N/principle_updates.json atomically. Idempotent: re-
running on an already-classified file produces byte-equal output.
Heuristic priority: definitional ('by definition', 'is defined as')
> algebraic ('iff', 'identity', 'theorem', 'algebraic') > empirical
(iter-N + numeric measurement + process verbs). Empirical requires
>= 2 markers — a lone iter-N reference is too weak.
When neither side fires strongly, fields are left None and the
validator warning surfaces the residual.
* orchestrator/validate.py gains validate_principles_have_empirical_content(ps).
Returns WARN-prefixed strings for category=domain principles with
unset fields after classification. Meta-category principles
(constraint principles emitted by orchestrator.refute_constraints
per AI-native-Systems-Research#169) are exempt — they're orchestrator-emitted facts, not
LLM-extracted observations.
* orchestrator/iteration.finalize_iteration now calls the classifier
BEFORE _merge_principles, so the merged principles.json reflects
the tags on its very first write. After the merge, the validator
scans principles.json and logs WARN-prefixed messages for any
residuals — visible in the run log without rolling back the merge.
Behavioral tests (tests/test_principles_classifier.py): 15 cases
covering the obvious-empirical case (RP-2 from sort_bench),
obvious-algebraic case (RP-9 from AI-native-Systems-Research#84/AI-native-Systems-Research#86), definitional case,
explicit-tag preservation, partial-tag fill-in, neutral statement
left-unclassified, batch processing, in-place file mutation +
idempotence, validator behavior on unset / classified / meta
principles, and the end-to-end finalize → classifier → merge path.
This commit closes the parent tracker: with AI-native-Systems-Research#177, AI-native-Systems-Research#178, and AI-native-Systems-Research#179 all
landed in this PR, every gap surfaced by the sort_bench dry-run is
closed.
Closes AI-native-Systems-Research#179.
Closes AI-native-Systems-Research#176.
Refs AI-native-Systems-Research#86, AI-native-Systems-Research#169, AI-native-Systems-Research#170, AI-native-Systems-Research#172, AI-native-Systems-Research#174.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d528db3 commit 6d4345e
7 files changed
Lines changed: 925 additions & 16 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
176 | | - | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
177 | 195 | | |
178 | | - | |
179 | | - | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
180 | 206 | | |
181 | 207 | | |
182 | 208 | | |
183 | | - | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
184 | 218 | | |
185 | 219 | | |
186 | 220 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
162 | 162 | | |
163 | 163 | | |
164 | 164 | | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
165 | 261 | | |
166 | 262 | | |
167 | 263 | | |
| |||
534 | 630 | | |
535 | 631 | | |
536 | 632 | | |
537 | | - | |
538 | | - | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
539 | 641 | | |
540 | | - | |
541 | | - | |
542 | | - | |
543 | | - | |
544 | | - | |
545 | | - | |
546 | | - | |
547 | | - | |
548 | | - | |
549 | | - | |
| 642 | + | |
550 | 643 | | |
551 | 644 | | |
552 | 645 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
0 commit comments