[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)#30
Conversation
Run names + the idempotent skip are keyed by LABEL only, so running --thinking off then on under one label silently skipped the second arm as already-complete. Mirror the reasoning-effort guard: require the mode encoded in the label (nothink / think), reject otherwise with a corrected example. Closes the last review item; this was deferred while the N=3 run was executing the script. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both arms complete at N=3 (graded with the fixed phase1_grade.py). no-think 23/36, think 22/36. Retire the N=1 "provisional" framing; findings.md / README / manifest / QUALITATIVE now carry the N=3 numbers. Key N=3 corrections to the N=1 story: - Thinking is net -1 but NOT inert (N=1's read): it redistributes — stabilizes p3_market (1/3->3/3, zero runaways) while hurting p3_pm (2/3->0/3) and p3_doc (2/3->1/3). Changes WHERE 397B succeeds, not how often. - N=3 exposes single-draw luck: no-think p3_market (N1 PASS->1/3), p3_pm (N1 FAIL->2/3), p3_doc (PASS->2/3) are high-variance; mid-tier + bugfix are 3/3. - Runaway resistance confirmed across all 72 cells (incl. p3_market think 3/3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0acba1b to
6a640dc
Compare
|
Retargeted base main (was stacked on #28's branch). Heads-up: #28 merged at |
Self-audit against the grade/summary data caught that "all 72 cells done_signal / never ran away" was false: 69/72 done_signal. The 3 non-clean exits are all no-think — p3_market v2/v3 hit the 500-iter stuck threshold and p3_pm v1 model_stopped. Zero max_tokens/length runaways IS true (the Flash failure mode), but the pathology is stalling, not runaway. Reframed across findings.md / README / QUALITATIVE / manifest into a sharper finding: 397B stalls quietly rather than running away, and thinking clears the market stall (think p3_market 3/3 done_signal vs no-think 2-stuck/1-pass). Also retired the last stale N=1 status line in QUALITATIVE.md. Scorecard totals (23/36, 22/36) re-verified against grade.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-audit (against the grade/summary data)Verified the published claims against ground truth. One real correction:
Verified accurate (no change): scorecard vs grade.json (23/36, 22/36 and every cell match); testwrite-think |
Draft / WIP. Stacked on #28 (the N=1 + tooling/grader-fix PR) — base is that branch, so this PR's diff is only the N=3 delta. Merge after #28.
Firms the 397B entry from N=1 → N=3, graded with the fixed
phase1_grade.pyfrom #28.Status
What N=3 already shows (no-think)
Replicates expose that two N=1 verdicts were single-draw luck:
p3_market: N=1 ✓ → 1/3 (high variance, hand-graded citation dimension).p3_pm: N=1 ✗ → 2/3 (intermittent under-recall, not systematic).p3_doc: N=1 ✓ → 2/3 (occasional 700-word-limit trip).Zero-variance (all 3 reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business — and the consistent fails (testwrite/refactor/writing). The grader-fix-sensitive cells (bugfix 3/3, testwrite 0/3, refactor 0/3) are perfectly consistent → the P1 fix is stable.
Remaining before un-drafting
🤖 Generated with Claude Code