[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it) by Lightheartdevs · Pull Request #30 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-29T21:34:32Z

Draft / WIP. Stacked on #28 (the N=1 + tooling/grader-fix PR) — base is that branch, so this PR's diff is only the N=3 delta. Merge after #28.

Firms the 397B entry from N=1 → N=3, graded with the fixed phase1_grade.py from #28.

Status

✅ No-think arm complete (N=3): 23/36 = 8/12 by majority — N=1 headline holds.
⏳ Think arm running — column fills on completion.

What N=3 already shows (no-think)

Replicates expose that two N=1 verdicts were single-draw luck:

p3_market: N=1 ✓ → 1/3 (high variance, hand-graded citation dimension).
p3_pm: N=1 ✗ → 2/3 (intermittent under-recall, not systematic).
p3_doc: N=1 ✓ → 2/3 (occasional 700-word-limit trip).

Zero-variance (all 3 reps): p1_bugfix, the grounded mid-tier (p2_extract/ci/hallucination/triage), p3_business — and the consistent fails (testwrite/refactor/writing). The grader-fix-sensitive cells (bugfix 3/3, testwrite 0/3, refactor 0/3) are perfectly consistent → the P1 fix is stable.

Remaining before un-drafting

Think N=3 column + think-vs-no-think at N=3 (does "thinking net −1" hold, or was the p3_doc flip also draw-luck?).
Think-arm per-cell variance (esp. p3_market runaway behavior across reps).
Fold N=3 into findings.md / README / manifest; retire the "N=1 provisional" framing.

🤖 Generated with Claude Code

Run names + the idempotent skip are keyed by LABEL only, so running --thinking off then on under one label silently skipped the second arm as already-complete. Mirror the reasoning-effort guard: require the mode encoded in the label (nothink / think), reject otherwise with a corrected example. Closes the last review item; this was deferred while the N=3 run was executing the script. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both arms complete at N=3 (graded with the fixed phase1_grade.py). no-think 23/36, think 22/36. Retire the N=1 "provisional" framing; findings.md / README / manifest / QUALITATIVE now carry the N=3 numbers. Key N=3 corrections to the N=1 story: - Thinking is net -1 but NOT inert (N=1's read): it redistributes — stabilizes p3_market (1/3->3/3, zero runaways) while hurting p3_pm (2/3->0/3) and p3_doc (2/3->1/3). Changes WHERE 397B succeeds, not how often. - N=3 exposes single-draw luck: no-think p3_market (N1 PASS->1/3), p3_pm (N1 FAIL->2/3), p3_doc (PASS->2/3) are high-variance; mid-tier + bugfix are 3/3. - Runaway resistance confirmed across all 72 cells (incl. p3_market think 3/3). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lightheartdevs · 2026-05-29T23:01:43Z

Retargeted base main (was stacked on #28's branch). Heads-up: #28 merged at 500c423, i.e. without the P2a --thinking guard (7ba3650) — the merge landed a moment before that push made it in. P2a is carried here instead (it's in this PR's history), so merging #30 brings both the guard and the N=3 results into main. Diff is clean: run_microbench.sh (P2a) + the 4 entry docs folded to N=3.

Self-audit against the grade/summary data caught that "all 72 cells done_signal / never ran away" was false: 69/72 done_signal. The 3 non-clean exits are all no-think — p3_market v2/v3 hit the 500-iter stuck threshold and p3_pm v1 model_stopped. Zero max_tokens/length runaways IS true (the Flash failure mode), but the pathology is stalling, not runaway. Reframed across findings.md / README / QUALITATIVE / manifest into a sharper finding: 397B stalls quietly rather than running away, and thinking clears the market stall (think p3_market 3/3 done_signal vs no-think 2-stuck/1-pass). Also retired the last stale N=1 status line in QUALITATIVE.md. Scorecard totals (23/36, 22/36) re-verified against grade.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lightheartdevs · 2026-05-29T23:22:29Z

Self-audit (against the grade/summary data)

Verified the published claims against ground truth. One real correction:

🔴 Fixed an overclaim: "all 72 cells done_signal / never ran away" was false — 69/72 done_signal. The 3 non-clean exits are all no-think: p3_market v2/v3 hit the 500-iter stuck threshold and p3_pm v1 model_stopped. Zero max_tokens/length runaways is still true (that's the Flash failure mode), so reframed as: 397B's pathology is stalling, not runaway — and thinking clears the market stall (think p3_market 3/3 done_signal vs no-think 2-stuck/1-pass). Corrected in findings/README/QUALITATIVE/manifest (4809534).

Verified accurate (no change): scorecard vs grade.json (23/36, 22/36 and every cell match); testwrite-think cov=99 / 153 passing, fails on logalyzer_unchanged; bugfix ruff 2→0, bench 11.2s→0.537s; redistribution deltas (market +2 / pm −2 / doc −1); internal-doc consistency; links resolve; no run-logs/agent-pilot in the diff; P2a guard present and carried to main.

User Name and others added 2 commits May 29, 2026 18:55

Lightheartdevs force-pushed the qwen3.5-397b-n3-results-2026-05-29 branch from 0acba1b to 6a640dc Compare May 29, 2026 22:59

Lightheartdevs marked this pull request as ready for review May 29, 2026 22:59

Lightheartdevs changed the base branch from add-qwen3.5-397b-microbench-2026-05-29 to main May 29, 2026 23:01

Lightheartdevs changed the title ~~[N=3] Qwen3.5-397B-A17B microbench — firmed results (WIP)~~ [N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it) May 29, 2026

Lightheartdevs merged commit aea2585 into main May 29, 2026
1 check passed

Lightheartdevs deleted the qwen3.5-397b-n3-results-2026-05-29 branch May 29, 2026 23:37

This was referenced May 30, 2026

Bench autopilot tooling + 397B N=10 + GPU power analysis + unified cross-model #31

Merged

Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo) #34

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)#30

[N=3] Qwen3.5-397B-A17B firmed results + P2a guard (#28 merged without it)#30
Lightheartdevs merged 3 commits into
mainfrom
qwen3.5-397b-n3-results-2026-05-29

Lightheartdevs commented May 29, 2026

Uh oh!

Lightheartdevs commented May 29, 2026

Uh oh!

Lightheartdevs commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Lightheartdevs commented May 29, 2026

Status

What N=3 already shows (no-think)

Remaining before un-drafting

Uh oh!

Lightheartdevs commented May 29, 2026

Uh oh!

Lightheartdevs commented May 29, 2026

Self-audit (against the grade/summary data)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant