Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs by Lightheartdevs · Pull Request #32 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-30T18:21:53Z

Acts on the post-merge review of #31. The bundle was rated high-usefulness but the lede was the weakest part (leading with "scale ties across 15×", the most attackable claim) and one dependency (#29) was open. This addresses the three "if you publish" asks.

Changes

Reframe the lede (findings.md headlines + README TL;DR). Now leads with the two strongest, hardest-to-dismiss results:
- ① small-N misreads cells (the p3_market 1/3→8/10 flip demo)
- ② thinking is net-negative on constraint-bound synthesis, cross-validated across ~15× scale via the same p3_doc word-limit loop
- ③ "aggregate ties" demoted to a supporting line with both confounds attached up front: cross-quant (Q3 vs FP4 — not a clean scale axis) and N-asymmetry (only 397B is N=10).
Quarantine Regrade historical phase-1 cells (27B/Coder) — flat-vs-nested metric bug (fixed in #28) #29 — the 27B/Coder phase-1 reference cells (graded by the same flat-vs-nested bug this entry fixed) are withheld (‡) pending the regrade. Their p2/p3 cells are unaffected and retained (the cross-model section uses only those).
Statistical honesty section — Wilson 95% CIs on the thesis-carrying cells: p3_doc think-vs-no-think intervals are disjoint (real); p3_pm borderline; p3_market +2 is noise. "Net −10" is directionally solid but carried by p3_doc; the cross-model reproduction is what makes the direction trustworthy, not the single-model magnitude.

Not in this PR (tracked)

Lift a comparator to N≥3 on market/pm/doc so the tie isn't N=10-vs-N=1 — needs the box (currently running MiniMax-M2.7 N=10); will follow.
Land the Regrade historical phase-1 cells (27B/Coder) — flat-vs-nested metric bug (fixed in #28) #29 regrade itself (this PR only quarantines; the regrade re-runs the old 27B/Coder p1 cells on the fixed grader).

🤖 Generated with Claude Code

…ilson CIs Acts on review feedback that the headline ("scale ties across 15x") was the most attackable claim while the methodological results were the strongest: - Reframe lede (findings.md + README TL;DR): lead with (1) small-N-misreads-cells demo and (2) thinking-net-negative-on-synthesis; DEMOTE "scale ties" to #3 with both confounds attached up front (cross-quant Q3-vs-FP4, N-asymmetry N=10-vs-N=1). - Quarantine the 27B/Coder phase-1 reference cells (‡) pending issue #29 — they used the same flat-vs-nested grader bug this entry fixed. p2/p3 cells unaffected, retained. - Add a "Statistical honesty" section with Wilson 95% CIs on the thesis-carrying cells: p3_doc think-vs-no-think CIs are disjoint (real); p3_pm borderline; p3_market +2 is noise. "net -10" is directionally solid but carried by p3_doc; the cross-model reproduction (same loop on 27B) is what makes the direction trustworthy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lightheartdevs merged commit fd27f37 into main May 30, 2026
1 check passed

Lightheartdevs mentioned this pull request May 31, 2026

Add MiniMax-M2.7-NVFP4 (N=5, TP=2): temp serving-trap + exhaustive-completer findings #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs#32

Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs#32
Lightheartdevs merged 1 commit into
mainfrom
review-fixes-397b-n10-2026-05-30

Lightheartdevs commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Lightheartdevs commented May 30, 2026

Changes

Not in this PR (tracked)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant