Skip to content

Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs#32

Merged
Lightheartdevs merged 1 commit into
mainfrom
review-fixes-397b-n10-2026-05-30
May 30, 2026
Merged

Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs#32
Lightheartdevs merged 1 commit into
mainfrom
review-fixes-397b-n10-2026-05-30

Conversation

@Lightheartdevs

Copy link
Copy Markdown
Contributor

Acts on the post-merge review of #31. The bundle was rated high-usefulness but the lede was the weakest part (leading with "scale ties across 15×", the most attackable claim) and one dependency (#29) was open. This addresses the three "if you publish" asks.

Changes

  1. Reframe the lede (findings.md headlines + README TL;DR). Now leads with the two strongest, hardest-to-dismiss results:
    • small-N misreads cells (the p3_market 1/3→8/10 flip demo)
    • thinking is net-negative on constraint-bound synthesis, cross-validated across ~15× scale via the same p3_doc word-limit loop
    • "aggregate ties" demoted to a supporting line with both confounds attached up front: cross-quant (Q3 vs FP4 — not a clean scale axis) and N-asymmetry (only 397B is N=10).
  2. Quarantine Regrade historical phase-1 cells (27B/Coder) — flat-vs-nested metric bug (fixed in #28) #29 — the 27B/Coder phase-1 reference cells (graded by the same flat-vs-nested bug this entry fixed) are withheld () pending the regrade. Their p2/p3 cells are unaffected and retained (the cross-model section uses only those).
  3. Statistical honesty section — Wilson 95% CIs on the thesis-carrying cells: p3_doc think-vs-no-think intervals are disjoint (real); p3_pm borderline; p3_market +2 is noise. "Net −10" is directionally solid but carried by p3_doc; the cross-model reproduction is what makes the direction trustworthy, not the single-model magnitude.

Not in this PR (tracked)

🤖 Generated with Claude Code

…ilson CIs

Acts on review feedback that the headline ("scale ties across 15x") was the most
attackable claim while the methodological results were the strongest:

- Reframe lede (findings.md + README TL;DR): lead with (1) small-N-misreads-cells
  demo and (2) thinking-net-negative-on-synthesis; DEMOTE "scale ties" to #3 with
  both confounds attached up front (cross-quant Q3-vs-FP4, N-asymmetry N=10-vs-N=1).
- Quarantine the 27B/Coder phase-1 reference cells (‡) pending issue #29 — they used
  the same flat-vs-nested grader bug this entry fixed. p2/p3 cells unaffected, retained.
- Add a "Statistical honesty" section with Wilson 95% CIs on the thesis-carrying cells:
  p3_doc think-vs-no-think CIs are disjoint (real); p3_pm borderline; p3_market +2 is
  noise. "net -10" is directionally solid but carried by p3_doc; the cross-model
  reproduction (same loop on 27B) is what makes the direction trustworthy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit fd27f37 into main May 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant