Three issues uncovered by the first real campaign that exercised the trackers from #162 / #166 / #173: a tiny sort_bench campaign comparing timsort vs naive quicksort. The campaign confirmed both arms cleanly (h-main: 85.3× ratio CONFIRMED, h-control-negative: 2.16× ratio CONFIRMED, prediction accuracy 2/2 = 100%) — but three of the artifacts the merged work was supposed to produce came out wrong or unset.
How they were found
A single live nous run against /tmp/sort_bench_target/ on 2026-05-25 produced findings.json correctly (the per-iteration test path everyone hits in CI) but failed to produce or correctly populate three downstream artifacts. Three PRs and ~205 unit tests passed; the first real campaign run hit all three gaps immediately. Classic Phase-A-shipped-but-Phase-B-not-wired pattern (#177, #178); plus a prompt-vs-validator gap (#179).
Children
Why one tracker for all three
#177 is the root cause; #178 is the cascade symptom. Fixing #177 makes #178 harmless in the happy path, but #178 is still worth fixing for robustness — a corrupt or partially-written best_found.json should surface a clear caveat, not a silent fall_back_to_baseline verdict.
#179 is a different surface area (prompt + validator vs orchestrator wiring) but stems from the same lesson: the unit tests passed, the real campaign exposed the gap. All three failures are downstream artifacts that no integration test was watching for.
Single-PR landing
One branch off upstream/reflective, three feat commits (one per child), tracking PR title Tracking #176: sort_bench-surfaced integration gaps (3 children). Mirrors #153/#161/#171/#172/#174.
Recommended ordering: #177 first (root cause), then #178 (cascade hardening), then #179 (separate concern but composes cleanly with the others' integration test fixture).
Test discipline (CLAUDE.md)
The missing safeguard is end-to-end tests, not more units. The fix here MUST include integration tests:
No live LLM, MCMC, Optuna trial, or subprocess in tests. The fixture is fully scripted findings + minimal state.json.
/goal predicate (for the tracking issue)
A PR exists with base upstream/reflective and head sriumcp:<branch> AND the working tree of that PR satisfies ALL of:
- the production code path that finalizes a
runs/iter-N/findings.json also writes best_found.json at the work_dir root before transitioning to HUMAN_FINDINGS_GATE
tests/ includes an end-to-end test asserting best_found.json exists with non-empty top_k after a fixture iteration where the bundle declares ≥ 1 arm with metric metadata
make_deployment_recommendation distinguishes the missing-best_found case from the no-competitive-candidate case (verdict + caveats)
tests/ includes a regression test for the missing-best_found case asserting the verdict is NOT silently fall_back_to_baseline with empty payload
- A principles classifier exists and is invoked on
principle_updates.json before merging into principles.json
tests/ includes a regression test asserting that an obvious-empirical statement (numeric measurements + iter-N reference) gets tagged empirical_content: true, and an obvious-algebraic statement (iff, by definition) gets tagged empirical_content: false
- The validator emits a WARN when a
category=domain principle has unset empirical_content after auto-classification
pytest -q exit code is 0
- the PR body references
Closes #177, Closes #178, Closes #179, and Refs #176
Relationship to prior trackers
Flagged as the highest-leverage post-merge work after the sort_bench dry-run on 2026-05-25 (see PR #175's merge for the campaign that surfaced this). Two of three children regressions stem from PR #172 (Tracking #166 — Search-oriented Nous), which shipped update_best_found and make_deployment_recommendation as Phase-A functions without wiring them into the iteration loop. The third (#179) stems from PR #174 (Tracking #173 — Trustworthy & transferable knowledge), which shipped empirical_content / derivation_type as schema-additive fields with only an advisory prompt for adoption.
Three issues uncovered by the first real campaign that exercised the trackers from #162 / #166 / #173: a tiny
sort_benchcampaign comparing timsort vs naive quicksort. The campaign confirmed both arms cleanly (h-main: 85.3× ratio CONFIRMED, h-control-negative: 2.16× ratio CONFIRMED, prediction accuracy 2/2 = 100%) — but three of the artifacts the merged work was supposed to produce came out wrong or unset.How they were found
A single live
nous runagainst/tmp/sort_bench_target/on 2026-05-25 producedfindings.jsoncorrectly (the per-iteration test path everyone hits in CI) but failed to produce or correctly populate three downstream artifacts. Three PRs and ~205 unit tests passed; the first real campaign run hit all three gaps immediately. Classic Phase-A-shipped-but-Phase-B-not-wired pattern (#177, #178); plus a prompt-vs-validator gap (#179).Children
best_found.jsonis never written during a real iteration (update_best_foundexists as a function with passing unit tests, but is not called from any production code path)deployment_recommendationreportsfall_back_to_baselinewith an empty payload wheneverbest_found.jsonis missing, silently misrepresenting wildly successful campaigns as non-resultsempirical_content/derivation_typeunset because the methodology prompt is advisory and the schema treats them as optionalWhy one tracker for all three
#177 is the root cause; #178 is the cascade symptom. Fixing #177 makes #178 harmless in the happy path, but #178 is still worth fixing for robustness — a corrupt or partially-written
best_found.jsonshould surface a clear caveat, not a silentfall_back_to_baselineverdict.#179 is a different surface area (prompt + validator vs orchestrator wiring) but stems from the same lesson: the unit tests passed, the real campaign exposed the gap. All three failures are downstream artifacts that no integration test was watching for.
Single-PR landing
One branch off
upstream/reflective, three feat commits (one per child), tracking PR titleTracking #176: sort_bench-surfaced integration gaps (3 children). Mirrors #153/#161/#171/#172/#174.Recommended ordering: #177 first (root cause), then #178 (cascade hardening), then #179 (separate concern but composes cleanly with the others' integration test fixture).
Test discipline (CLAUDE.md)
The missing safeguard is end-to-end tests, not more units. The fix here MUST include integration tests:
runs/iter-N/findings.json+principle_updates.jsonas a fixture (no LLM).nous runexercises).best_found.jsonexists at the work_dir root with non-emptytop_k(bug: best_found.json never written during real iteration — update_best_found not wired in #177).meta_findings.json::deployment_recommendationreflects the actual top candidate, not a fall-back, and that the missing/empty cases produce caveats (bug: deployment_recommendation silently falls back when best_found.json is missing #178).empirical_contentset on obvious-empirical and obvious-algebraic statements (ux: extracted principles ship with empirical_content/derivation_type unset #179).No live LLM, MCMC, Optuna trial, or subprocess in tests. The fixture is fully scripted findings + minimal state.json.
/goal predicate (for the tracking issue)
A PR exists with base
upstream/reflectiveand headsriumcp:<branch>AND the working tree of that PR satisfies ALL of:runs/iter-N/findings.jsonalso writesbest_found.jsonat the work_dir root before transitioning to HUMAN_FINDINGS_GATEtests/includes an end-to-end test assertingbest_found.jsonexists with non-emptytop_kafter a fixture iteration where the bundle declares ≥ 1 arm with metric metadatamake_deployment_recommendationdistinguishes the missing-best_found case from the no-competitive-candidate case (verdict + caveats)tests/includes a regression test for the missing-best_found case asserting the verdict is NOT silentlyfall_back_to_baselinewith empty payloadprinciple_updates.jsonbefore merging intoprinciples.jsontests/includes a regression test asserting that an obvious-empirical statement (numeric measurements + iter-N reference) gets taggedempirical_content: true, and an obvious-algebraic statement (iff,by definition) gets taggedempirical_content: falsecategory=domainprinciple has unsetempirical_contentafter auto-classificationpytest -qexit code is 0Closes #177,Closes #178,Closes #179, andRefs #176Relationship to prior trackers
Flagged as the highest-leverage post-merge work after the sort_bench dry-run on 2026-05-25 (see PR #175's merge for the campaign that surfaced this). Two of three children regressions stem from PR #172 (Tracking #166 — Search-oriented Nous), which shipped
update_best_foundandmake_deployment_recommendationas Phase-A functions without wiring them into the iteration loop. The third (#179) stems from PR #174 (Tracking #173 — Trustworthy & transferable knowledge), which shippedempirical_content/derivation_typeas schema-additive fields with only an advisory prompt for adoption.