Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo) by Lightheartdevs · Pull Request #34 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-31T23:23:32Z

The clean FP8 redo of the Qwen3.6-27B run that the 397B entry had to exclude as a Q8/FP8 serving failure. Full MMBT 12-family microbench at N=5, both reasoning modes. Grid complete: 120 cells attempted, 119 graded.

FP8 serving is stable (the gap the 397B entry flagged is closed)

	clean `done_signal`	errors
excluded Q8 attempt	4/36	23/36 runaway + 8/36 HTTP-400 (instability storm)
this FP8 run	113/119	6: 5 HTTP-400 + 1 max_tokens

The 6 FP8 errors are the model looping itself into the 131072 context ceiling on the two hardest tasks (market, business-think — e.g. business-think_v2 errored at iter 186 / 12M cumulative prompt-tokens), not quant instability.

Headline — thinking is net-negative for 27B

No-think 35/60 (58%) vs think 29/60 (48%).

Smoking gun: p2_triage 0/5 think vs 5/5 no-think (reasons off the closed label set). p3_writing also breaks under thinking (1/5 vs 5/5).
Thinking's only win is p3_business (5/5 vs 1/5) — length discipline, not insight.

Qualitative

27B is the content winner on p3 longform (best bias recall 8/8 — beats 397B-think's 5/8; sharpest synthesis; best citation honesty) held back by a near-miss word cap (briefs ~700 by wc, grader counts 703–705) and market scrape-loop variance.

Registration (repo contract)

manifest.json — provenance, image digest, serving config, finish-reason census, run inventory
3 claims.yaml entries (serving-viable, thinking-net-negative, triage-overthinking)
indexed in hardware-tests/README.md + root README.md
market_v3.skip-reason for the dropped operator-killed rep

Known open items (not blockers)

Hand-grading dimensions (stance/calibration/fabrication) still null — qualitative section is close reading.
Reviewer decision: keep standalone, or fold the FP8 results in as the proper N=5 "27B" column of the 397B entry (replacing its N=1 Q4 ref).
p3_doc word-count grader counts markdown tokens — worth a grader fix before quoting doc pass-rates.

🤖 Generated with Claude Code

The clean FP8 redo of the 27B run the 397B entry had to exclude as a Q8/FP8 serving failure — 112 cells, 0 runaways, 0 HTTP-400 storms. Headline: thinking is net-negative for 27B on this bench (no-think 35/60 = 58%, think trends lower). Smoking gun: p2_triage 0/5 think vs 5/5 no-think (overthinks a closed label set). Thinking's only win is p3_business (length discipline, not insight). Qualitative: 27B is the content winner on p3 longform (best bias recall, sharpest synthesis, best citation honesty) held back by a word-count grader artifact and market scrape-loop variance. In-progress: no-think complete (60/60); think p3 tail still landing. Hand-grading dimensions not yet filled. Will finalize when the grid completes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fixes) Grid complete (119/120 graded). Corrects the WIP draft's quantitative errors that the audit caught: - finish-reason reality: 113/119 clean done_signal, 6 errors (5 HTTP-400 + 1 max_tokens) — NOT "0 errors / 112 clean". The 6 are model loop-into-ctx- overflow on market/business-think, not FP8 instability (the honest framing of "FP8 viable where Q8 failed"). - final aggregates: no-think 35/60, think 29/60 (think net-negative, complete). - p3_market-think corrected to 3/4 (61 iters / 32k tok), v3 dropped-stuck. - p3_doc word-cap stated as real 705-vs-700 near-miss, not invented 697/711. - p3_writing think 1/5 vs no-think 5/5 added (think breaks it). Adds the registration the repo contract requires: manifest.json (provenance + finish-reason census + run inventory), 3 claims.yaml entries, index rows in hardware-tests/README.md and root README.md, and market_v3.skip-reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…or, taxonomy note Tier-1 audit fixes: state temp=0.3 and the 500W cap (vs 397B's 600W) in the reproduce block; clarify think p3_market is 3/4 (v3 dropped) so the arm is 29/59 graded; add the microbench-index 'where this lives' taxonomy note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…p8-microbench-2026-05-31 # Conflicts: # claims.yaml

User Name and others added 2 commits May 31, 2026 19:23

Lightheartdevs marked this pull request as ready for review June 1, 2026 01:29

Lightheartdevs changed the title ~~[WIP] Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo)~~ Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo) Jun 1, 2026

Lightheartdevs mentioned this pull request Jun 1, 2026

Grader: p3_doc word-count counts markdown scaffolding as words → penalizes citation-dense formatting #36

Open

Lightheartdevs mentioned this pull request Jun 1, 2026

docs: microbench cross-tree index + synthesis-doc currency (org fixes) #37

Merged

Merge remote-tracking branch 'origin/main' into capture-qwen3.6-27b-f…

cb915da

…p8-microbench-2026-05-31 # Conflicts: # claims.yaml

Lightheartdevs merged commit a74a8ad into main Jun 1, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo)#34

Qwen3.6-27B-FP8 full microbench N=5 — think vs no-think (clean FP8 redo)#34
Lightheartdevs merged 4 commits into
mainfrom
capture-qwen3.6-27b-fp8-microbench-2026-05-31

Lightheartdevs commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Lightheartdevs commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

FP8 serving is stable (the gap the 397B entry flagged is closed)

Headline — thinking is net-negative for 27B

Qualitative

Registration (repo contract)

Known open items (not blockers)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Lightheartdevs commented May 31, 2026 •

edited

Loading