Tracking: integration gaps surfaced by sort_bench dry-run (3 children)

Three issues uncovered by the first real campaign that exercised the trackers from #162 / #166 / #173: a tiny `sort_bench` campaign comparing timsort vs naive quicksort. The campaign confirmed both arms cleanly (h-main: 85.3× ratio CONFIRMED, h-control-negative: 2.16× ratio CONFIRMED, prediction accuracy 2/2 = 100%) — but three of the artifacts the merged work was supposed to produce came out wrong or unset.

## How they were found

A single live `nous run` against `/tmp/sort_bench_target/` on 2026-05-25 produced `findings.json` correctly (the per-iteration test path everyone hits in CI) but failed to produce or correctly populate three downstream artifacts. **Three PRs and ~205 unit tests passed; the first real campaign run hit all three gaps immediately.** Classic Phase-A-shipped-but-Phase-B-not-wired pattern (#177, #178); plus a prompt-vs-validator gap (#179).

## Children

- [ ] #177 — `best_found.json` is never written during a real iteration (`update_best_found` exists as a function with passing unit tests, but is not called from any production code path)
- [ ] #178 — `deployment_recommendation` reports `fall_back_to_baseline` with an empty payload whenever `best_found.json` is missing, silently misrepresenting wildly successful campaigns as non-results
- [ ] #179 — extracted principles ship with `empirical_content` / `derivation_type` unset because the methodology prompt is advisory and the schema treats them as optional

## Why one tracker for all three

#177 is the root cause; #178 is the cascade symptom. Fixing #177 makes #178 harmless in the happy path, but #178 is still worth fixing for robustness — a corrupt or partially-written `best_found.json` should surface a clear caveat, not a silent `fall_back_to_baseline` verdict.

#179 is a different surface area (prompt + validator vs orchestrator wiring) but stems from the same lesson: **the unit tests passed, the real campaign exposed the gap.** All three failures are downstream artifacts that no integration test was watching for.

## Single-PR landing

One branch off `upstream/reflective`, three feat commits (one per child), tracking PR title `Tracking #176: sort_bench-surfaced integration gaps (3 children)`. Mirrors #153/#161/#171/#172/#174.

**Recommended ordering:** #177 first (root cause), then #178 (cascade hardening), then #179 (separate concern but composes cleanly with the others' integration test fixture).

## Test discipline (CLAUDE.md)

The missing safeguard is end-to-end tests, not more units. The fix here MUST include integration tests:
  * Set up `runs/iter-N/findings.json` + `principle_updates.json` as a fixture (no LLM).
  * Drive the orchestrator's per-iteration finalize step (the same code path `nous run` exercises).
  * Assert `best_found.json` exists at the work_dir root with non-empty `top_k` (#177).
  * Assert `meta_findings.json::deployment_recommendation` reflects the actual top candidate, not a fall-back, and that the missing/empty cases produce caveats (#178).
  * Assert auto-classified principles have `empirical_content` set on obvious-empirical and obvious-algebraic statements (#179).

No live LLM, MCMC, Optuna trial, or subprocess in tests. The fixture is fully scripted findings + minimal state.json.

## /goal predicate (for the tracking issue)

A PR exists with base `upstream/reflective` and head `sriumcp:<branch>` AND the working tree of that PR satisfies ALL of:
  - the production code path that finalizes a `runs/iter-N/findings.json` also writes `best_found.json` at the work_dir root before transitioning to HUMAN_FINDINGS_GATE
  - `tests/` includes an end-to-end test asserting `best_found.json` exists with non-empty `top_k` after a fixture iteration where the bundle declares ≥ 1 arm with metric metadata
  - `make_deployment_recommendation` distinguishes the missing-best_found case from the no-competitive-candidate case (verdict + caveats)
  - `tests/` includes a regression test for the missing-best_found case asserting the verdict is NOT silently `fall_back_to_baseline` with empty payload
  - A principles classifier exists and is invoked on `principle_updates.json` before merging into `principles.json`
  - `tests/` includes a regression test asserting that an obvious-empirical statement (numeric measurements + iter-N reference) gets tagged `empirical_content: true`, and an obvious-algebraic statement (`iff`, `by definition`) gets tagged `empirical_content: false`
  - The validator emits a WARN when a `category=domain` principle has unset `empirical_content` after auto-classification
  - `pytest -q` exit code is 0
  - the PR body references `Closes #177`, `Closes #178`, `Closes #179`, and `Refs #176`

## Relationship to prior trackers

Flagged as the highest-leverage post-merge work after the sort_bench dry-run on 2026-05-25 (see PR #175's merge for the campaign that surfaced this). Two of three children regressions stem from PR #172 (Tracking #166 — Search-oriented Nous), which shipped `update_best_found` and `make_deployment_recommendation` as Phase-A functions without wiring them into the iteration loop. The third (#179) stems from PR #174 (Tracking #173 — Trustworthy & transferable knowledge), which shipped `empirical_content` / `derivation_type` as schema-additive fields with only an advisory prompt for adoption.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: integration gaps surfaced by sort_bench dry-run (3 children) #176

How they were found

Children

Why one tracker for all three

Single-PR landing

Test discipline (CLAUDE.md)

/goal predicate (for the tracking issue)

Relationship to prior trackers

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Tracking: integration gaps surfaced by sort_bench dry-run (3 children) #176

Description

How they were found

Children

Why one tracker for all three

Single-PR landing

Test discipline (CLAUDE.md)

/goal predicate (for the tracking issue)

Relationship to prior trackers

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions