Skip to content

Tracking: integration gaps surfaced by sort_bench dry-run (3 children) #176

@sriumcp

Description

@sriumcp

Three issues uncovered by the first real campaign that exercised the trackers from #162 / #166 / #173: a tiny sort_bench campaign comparing timsort vs naive quicksort. The campaign confirmed both arms cleanly (h-main: 85.3× ratio CONFIRMED, h-control-negative: 2.16× ratio CONFIRMED, prediction accuracy 2/2 = 100%) — but three of the artifacts the merged work was supposed to produce came out wrong or unset.

How they were found

A single live nous run against /tmp/sort_bench_target/ on 2026-05-25 produced findings.json correctly (the per-iteration test path everyone hits in CI) but failed to produce or correctly populate three downstream artifacts. Three PRs and ~205 unit tests passed; the first real campaign run hit all three gaps immediately. Classic Phase-A-shipped-but-Phase-B-not-wired pattern (#177, #178); plus a prompt-vs-validator gap (#179).

Children

Why one tracker for all three

#177 is the root cause; #178 is the cascade symptom. Fixing #177 makes #178 harmless in the happy path, but #178 is still worth fixing for robustness — a corrupt or partially-written best_found.json should surface a clear caveat, not a silent fall_back_to_baseline verdict.

#179 is a different surface area (prompt + validator vs orchestrator wiring) but stems from the same lesson: the unit tests passed, the real campaign exposed the gap. All three failures are downstream artifacts that no integration test was watching for.

Single-PR landing

One branch off upstream/reflective, three feat commits (one per child), tracking PR title Tracking #176: sort_bench-surfaced integration gaps (3 children). Mirrors #153/#161/#171/#172/#174.

Recommended ordering: #177 first (root cause), then #178 (cascade hardening), then #179 (separate concern but composes cleanly with the others' integration test fixture).

Test discipline (CLAUDE.md)

The missing safeguard is end-to-end tests, not more units. The fix here MUST include integration tests:

No live LLM, MCMC, Optuna trial, or subprocess in tests. The fixture is fully scripted findings + minimal state.json.

/goal predicate (for the tracking issue)

A PR exists with base upstream/reflective and head sriumcp:<branch> AND the working tree of that PR satisfies ALL of:

  • the production code path that finalizes a runs/iter-N/findings.json also writes best_found.json at the work_dir root before transitioning to HUMAN_FINDINGS_GATE
  • tests/ includes an end-to-end test asserting best_found.json exists with non-empty top_k after a fixture iteration where the bundle declares ≥ 1 arm with metric metadata
  • make_deployment_recommendation distinguishes the missing-best_found case from the no-competitive-candidate case (verdict + caveats)
  • tests/ includes a regression test for the missing-best_found case asserting the verdict is NOT silently fall_back_to_baseline with empty payload
  • A principles classifier exists and is invoked on principle_updates.json before merging into principles.json
  • tests/ includes a regression test asserting that an obvious-empirical statement (numeric measurements + iter-N reference) gets tagged empirical_content: true, and an obvious-algebraic statement (iff, by definition) gets tagged empirical_content: false
  • The validator emits a WARN when a category=domain principle has unset empirical_content after auto-classification
  • pytest -q exit code is 0
  • the PR body references Closes #177, Closes #178, Closes #179, and Refs #176

Relationship to prior trackers

Flagged as the highest-leverage post-merge work after the sort_bench dry-run on 2026-05-25 (see PR #175's merge for the campaign that surfaced this). Two of three children regressions stem from PR #172 (Tracking #166 — Search-oriented Nous), which shipped update_best_found and make_deployment_recommendation as Phase-A functions without wiring them into the iteration loop. The third (#179) stems from PR #174 (Tracking #173 — Trustworthy & transferable knowledge), which shipped empirical_content / derivation_type as schema-additive fields with only an advisory prompt for adoption.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions