Post-merge review fixes: reframe lede, quarantine #29 cells, add Wilson CIs#32
Merged
Merged
Conversation
…ilson CIs
Acts on review feedback that the headline ("scale ties across 15x") was the most
attackable claim while the methodological results were the strongest:
- Reframe lede (findings.md + README TL;DR): lead with (1) small-N-misreads-cells
demo and (2) thinking-net-negative-on-synthesis; DEMOTE "scale ties" to #3 with
both confounds attached up front (cross-quant Q3-vs-FP4, N-asymmetry N=10-vs-N=1).
- Quarantine the 27B/Coder phase-1 reference cells (‡) pending issue #29 — they used
the same flat-vs-nested grader bug this entry fixed. p2/p3 cells unaffected, retained.
- Add a "Statistical honesty" section with Wilson 95% CIs on the thesis-carrying cells:
p3_doc think-vs-no-think CIs are disjoint (real); p3_pm borderline; p3_market +2 is
noise. "net -10" is directionally solid but carried by p3_doc; the cross-model
reproduction (same loop on 27B) is what makes the direction trustworthy.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Acts on the post-merge review of #31. The bundle was rated high-usefulness but the lede was the weakest part (leading with "scale ties across 15×", the most attackable claim) and one dependency (#29) was open. This addresses the three "if you publish" asks.
Changes
p3_market1/3→8/10 flip demo)p3_docword-limit loop‡) pending the regrade. Their p2/p3 cells are unaffected and retained (the cross-model section uses only those).p3_docthink-vs-no-think intervals are disjoint (real);p3_pmborderline;p3_market+2 is noise. "Net −10" is directionally solid but carried byp3_doc; the cross-model reproduction is what makes the direction trustworthy, not the single-model magnitude.Not in this PR (tracked)
🤖 Generated with Claude Code