Light-Heart-Labs · Lightheartdevs · Jun 1, 2026 · Jun 1, 2026
diff --git a/COMPARISON.md b/COMPARISON.md
@@ -241,10 +241,10 @@ All three arms use **Cyankiwi 4-bit AWQ** community quants. Multiple field repor
 
 What this means for the data here:
 - Within-quant comparison (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) **is** informative — the differential is a model-behavior gap, not a quant artifact.
-- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **not** characterized.
+- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **partly characterized now** — see below.
 - Effects that depend on a thinking-mechanism (the `--no-think` ship-rate jump, the word-trim loop reduction) are **unlikely to be quant-specific** — they're about the trace, not the weights' precision.
 
-The FP8 re-run is the highest-priority follow-up.
+**Update (2026-05-31): the FP8 re-run is done.** A clean **Qwen3.6-27B-FP8** run of the full 12-family grid is published at [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/). It confirms the prediction above: **thinking is still net-negative at FP8** (no-think 35/60 vs think 29/60), so that finding is *not* a 4-bit-AWQ artifact. FP8 serving was also stable (113/119 clean) where the earlier Q8 attempt was a serving failure. The AWQ-underperforms-FP8 *absolute-capability* question for 27B is now directly addressable from that entry; 35B-A3B at higher precision remains the open follow-up. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) for the full cross-quant picture.
 
 ### Other VRAM tiers
 

diff --git a/MICROBENCH-INDEX.md b/MICROBENCH-INDEX.md
@@ -0,0 +1,44 @@
+# Microbench Index — the 12-family agentic microbench, across both trees
+
+> **Why this file exists.** The MMBT "12-family agentic microbench" (the same harness, task families,
+> think/no-think comparison, and `done_signal`/PASS scorecard) is a **model-behavior** study. Its entries
+> are split across two top-level trees for an *accidental* reason — some models needed the dual-Blackwell
+> rig (so they landed under `hardware-tests/`), the earlier 4-bit runs are under `benchmarks/`. This index
+> gathers all of them in one place so you don't have to know which tree a model happened to land in.
+>
+> Each entry below is a 12-family microbench. The `hardware-tests/` ones also carry a secondary power
+> section, but their *headline* is model behavior.
+
+## All 12-family microbench entries
+
+| Entry | Tree | Models / arms | N | Headline |
+|---|---|---|---|---|
+| [`microbench-2026-04-28`](benchmarks/microbench-2026-04-28/) | benchmarks/ | Qwen3.6-**27B-AWQ** vs Qwen3-Coder-Next-**AWQ** | 3 | Aggregate-tied ~7/12 each; complementary task-class strengths; Coder-Next much faster/cheaper. |
+| [`microbench-phase-b-2026-05-02`](benchmarks/microbench-phase-b-2026-05-02/) | benchmarks/ | + **27B-AWQ no-think** third arm; 4 differential cells to N=10 | 10 | 27B ships 86.8% no-think vs 75% think (same `p3_doc` word-limit loop); Coder-Next market 0/10 (Wilson [0, 27.8%]). |
+| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | hardware-tests/ | **397B-A17B** (Q3 GGUF) no-think/think; **Step-3.7-Flash** (NVFP4) low/med/high; **MiniMax-M2.7** (NVFP4); **27B-Q4 / Coder-Q4** refs | 10 / 1 / 5 | Thinking net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax "exhaustive completer" + temp serving-trap. |
+| [`qwen3.6-27b-fp8-microbench-2026-05-31`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) | hardware-tests/ | Qwen3.6-**27B-FP8** no-think/think | 5 | Thinking net-negative (35/60 vs 29/60); `p2_triage` 0/5 think vs 5/5 no-think; FP8 serving stable where Q8 failed. |
+
+## The four "27B"s — don't conflate them
+
+This study references Qwen3.6-27B in **four** different forms. When you see "27B," check which:
+
+| Label | What it is | Where |
+|---|---|---|
+| **27B-AWQ** (= "27B-Q4") | Cyankiwi 4-bit AWQ, vLLM | `microbench-2026-04-28`, `microbench-phase-b-2026-05-02`; the "27B-Q4 ref" columns in the 397B entry; the AWQ rows in `SCORECARD.md` / `COMPARISON.md` |
+| **27B-Q8** | Q8_0 GGUF, llama.cpp | `hardware-tests/qwen3.6-q8-fleet-2026-05-17` (throughput); attempted on the microbench harness but **excluded as a serving failure** (23/36 token-runaway) — see the 397B entry |
+| **27B-FP8** | official FP8, vLLM | `hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31` (this is the clean redo of the excluded Q8/FP8 attempt) |
+| **35B-A3B** (sibling MoE) | Qwen3.6-35B-A3B, various quants | referenced as the small-MoE comparator; Q8 MoE crashes on Blackwell sm_120 (known kernel bug) |
+
+## The one finding that holds across all of them
+
+**Thinking is net-negative on this agentic microbench**, consistently across a ~15× parameter range:
+397B 82→72 (−10), 27B-AWQ 86.8%→75% ship rate, 27B-FP8 35→29 (−6) — all the same direction, largely via
+the same `p3_doc` word-limit / over-production mechanism (see [issue #36](https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests/issues/36)
+for the grader artifact that compounds it). Failure *temperament* tracks lineage, not size: Qwen-family
+models (397B, 27B) **stall**; Coder-Next / Flash / MiniMax(@temp0.3) **run away**.
+
+## Note on organization
+
+This index is the low-disruption fix for the cross-tree split (it avoids moving directories, which would
+break links and git history). If the microbench corpus keeps growing, the cleaner long-term move is a
+dedicated `microbenchmarks/` tree; this index is the bridge until then.
diff --git a/README.md b/README.md
@@ -15,6 +15,7 @@ but I'm making it public so that other people can use it too.
 | Where the benchmark folders start | [`benchmarks/README.md`](benchmarks/README.md) — agent-task benchmark landing page |
 | **"Coder-Next or 27B (or 27B-no-think) for my task?"** | [`COMPARISON.md`](COMPARISON.md) — head-to-head decision doc |
 | The full single-table comparison across all entries | [`SCORECARD.md`](SCORECARD.md) |
+| **All 12-family microbench results (across both trees) + the four "27B"s** | [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) — cross-tree microbench index + quant disambiguation |
 | How repo size is managed | [`REPO-SPACE.md`](REPO-SPACE.md) — storage hotspots and artifact policy |
 | How to benchmark a new local model | [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) |
 | How to replay a specific past run | [`tooling/REPRODUCING.md`](tooling/REPRODUCING.md) |
@@ -74,6 +75,8 @@ For the benchmark landing page and per-folder navigation map, start with
 |---|---|---|
 | [`vllm-power-sweep-2026-04-29`](hardware-tests/vllm-power-sweep-2026-04-29/) | 7 GPU power caps × 5 min sustained vLLM load × 2 concurrencies (N=1, N=32) × 2 AWQ-INT4 models (Dense Qwen3.6-27B, MoE Coder-Next), 28 cells total, on RTX PRO 6000 Blackwell. | Throughput-vs-power-cap curve, native draw at unbounded cap, and per-cap thermal envelope. Validates the 500 W production cap (within 3.3 % of optimal in every scenario), and shows Coder-Next ≈ 1.8× faster batched / 2.3× faster single-stream than dense 27 B at every cap. The findings doc carries an "Audit notes" section flagging two per-cap "winner" markers that don't survive a re-read of the raw CSVs (a vLLM container warmup transient and a single-window thermal clock dip distort the per-cap winners without changing the plateau-shape headline). |
 | [`qwen3.6-q8-fleet-2026-05-17`](hardware-tests/qwen3.6-q8-fleet-2026-05-17/) | Same Qwen3.6 Q8 GGUF model bytes across Blackwell 6000 Tower, DGX Spark, EVO X2 / Strix Halo, and M5 Max MacBook Pro under a pinned llama.cpp SHA, with vLLM appendix rows on Tower2. | Cross-platform single-user prefill/decode/TTFT, backend failure modes, thermal field notes, and cost-throughput caveats for local AI hardware debates. Multi-user serving conclusions are explicitly held. |
+| [`best-stack-followup-2026-05-17`](hardware-tests/best-stack-followup-2026-05-17/) | Follow-up bundle: MLX on the M5 Max, and Dream-Server on ROCm 7 on the Strix Halo. | Best-serving-stack notes per platform (MLX beats Metal on M5; ROCm 7 works on Strix; no prefill lift). |
+| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **A 12-family agentic microbench (model behavior), filed here because it needed the dual-Blackwell rig.** 397B-A17B (Q3 GGUF) no-think/think N=10, + Step-3.7-Flash, MiniMax-M2.7, and 27B/Coder-Q4 refs. | Thinking is net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax temp serving-trap. Secondary: dual-GPU power telemetry. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md). |
 | [`local-ai-hardware-valuation-2026-05-17`](hardware-tests/local-ai-hardware-valuation-2026-05-17/) | Derived valuation worksheet built from editable price/spec inputs plus the Qwen3.6 27B Q8 hardware measurements. | Recomputable buyer metrics: `$/usable AI GB`, `$/GB/s`, `$/measured decode tok/s`, `$/measured prefill tok/s`, capacity-bandwidth score, and rough 5-year energy/TCO lines. Use this when market prices change and you want the same mental model to survive the refresh. |
 | [`step3.7-flash-nvfp4-dual-blackwell-2026-05-28`](hardware-tests/step3.7-flash-nvfp4-dual-blackwell-2026-05-28/) | Setup/config note: serving `stepfun-ai/Step-3.7-Flash-NVFP4` (201B MoE VLM, day-one) under vLLM on 2× RTX PRO 6000 Blackwell (sm_120, no NVLink), TP=2, native NVFP4 + FP8 KV. | The working launch command and the four non-obvious flags it took to get there, with full diagnostic trail: `--disable-custom-all-reduce` (custom all-reduce deadlocks without P2P/NVLink), `--moe-backend cutlass` (only native-FP4 MoE kernel that supports the model's SWIGLUSTEP activation), no expert-parallel, native max-model-len. No official 2×6000 recipe exists upstream. Companion to the Step-3.7 microbench entry. |
 

diff --git a/ROADMAP.md b/ROADMAP.md
@@ -8,13 +8,15 @@
 
 ## Active follow-ups (in priority order)
 
-### 1. FP8 re-run of the 12-cell microbench grid &nbsp; **[contributor-welcome]**
+### 1. FP8 re-run of the 12-cell microbench grid &nbsp; ~~**[contributor-welcome]**~~ **✅ DONE (2026-05-31) for 27B**
 
 **Source**: [`KNOWN-LIMITATIONS.md` § Cyankiwi 4-bit AWQ field reports](KNOWN-LIMITATIONS.md#quantization-specificity), [`benchmarks/microbench-phase-b-2026-05-02/findings.md` § Recommended follow-ups](benchmarks/microbench-phase-b-2026-05-02/findings.md#recommended-follow-ups)
 
-Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell × N=10 grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.
+Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.
 
-What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each model arm, submit a PR with the results.
+**Done for Qwen3.6-27B:** [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) — full 12-family grid, N=5, both reasoning modes, official FP8 on vLLM. Result: the **thinking-net-negative** finding holds at FP8 (no-think 35/60 vs think 29/60), so it's not a 4-bit-AWQ artifact; FP8 serving was stable (113/119 clean) where the Q8 attempt failed. **Still open (contributor-welcome):** the same FP8/higher-precision re-run for **Coder-Next** and especially **35B-A3B** (fails at 4-bit; Q8 MoE also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question). See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md).
+
+What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each remaining model arm, submit a PR with the results.
 
 Hardware: needs FP8-capable GPU. RTX PRO 6000 / H100 / similar.
 

diff --git a/SCORECARD.md b/SCORECARD.md
@@ -5,6 +5,19 @@
 > **For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see [`COMPARISON.md`](COMPARISON.md).** This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis.
 >
 > **Read [`KNOWN-LIMITATIONS.md`](KNOWN-LIMITATIONS.md) before quoting any cell.** Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column.
+>
+> **⚠️ Scope of this SCORECARD (read before comparing to other entries).** Every "27B" cell below is
+> **Qwen3.6-27B at 4-bit AWQ** (Cyankiwi), and the tables cover only the `benchmarks/` microbench arms
+> (27B-AWQ / Coder-Next-AWQ). They do **not** include the later `hardware-tests/` microbenches —
+> **397B, Step-3.7-Flash, MiniMax-M2.7, and the clean 27B-FP8 redo** — nor do they disambiguate the four
+> different "27B"s (AWQ / Q8 / FP8 / 35B-A3B sibling). For the full microbench picture across both trees
+> and the quant disambiguation, see **[`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md)**.
+>
+> **Newer microbench results (summary; full detail in the linked entries):**
+> - **397B-A17B** (Q3 GGUF), N=10: no-think 82/120, think 72/120 — *thinking net-negative*.
+> - **27B-FP8**, N=5: no-think 35/60, think 29/60 — *thinking net-negative*; FP8 serving stable where Q8 failed. ([entry](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/))
+> - **Step-3.7-Flash** (NVFP4): 7/8/8 low/med/high; **MiniMax-M2.7** (NVFP4), N=5: 35/60 — "exhaustive completer" + a temp=0.3 serving-trap. ([entry](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/))
+> - Across ~15× of scale the aggregate ties ~7–8/12 and **thinking is net-negative everywhere** — the consistent cross-model finding.
 
 ## Cell-name legend (microbench)
 
@@ -257,6 +270,6 @@ These additions would tighten the recommendations above; until they land, the re
 3. **Failed-run artifacts published** (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model.
 4. **N=10+ on the highest-signal cells** (Coder-Next on `dreamserver-1-pr-audit`, 27B on the same; both on `microbench-2026-04-28/adversarial-hallucination`). Would bound the variance the current N=3 only suggests.
 5. **Different PR shapes** in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes.
-6. **Higher-precision quantizations** of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue.
+6. **Higher-precision quantizations** of the same models (FP8, BF16). **Partly done (2026-05-31):** a clean **27B-FP8** run of the full 12-family grid is published ([`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/)) — FP8 serving is stable and the thinking-net-negative finding holds, so the headline conclusions generalize past 4-bit AWQ for 27B. Still open: **35B-A3B** at higher precision (it fails at 4-bit; the Q8 MoE path also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question, not yet settled).
 
 None of these are in scope for the current MMBT publication. They're separate experiments.
diff --git a/benchmarks/microbench-2026-04-28/README.md b/benchmarks/microbench-2026-04-28/README.md
@@ -2,6 +2,8 @@
 
 > 12 task families, 2 local models, N=3 each. Smaller-scope tasks than the dreamserver-PR-audit / wallstreet-intern-test benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Designed to surface task-class-specific differences between Qwen3.6-27B-AWQ and Qwen3-Coder-Next-AWQ that the larger benchmarks couldn't isolate.
 
+> **Part of a larger arc.** This same 12-family microbench has since been run on more models (397B, Step-3.7-Flash, MiniMax-M2.7, and a clean 27B-FP8 redo) — some of which live under `hardware-tests/` because they needed the dual-Blackwell rig. For the full cross-tree index and the disambiguation of the four "27B"s (AWQ here, vs Q8 / FP8 / 35B-A3B), see [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md).
+
 ## Read these first
 
 - [`findings.md`](findings.md) — cross-cutting writeup. Headline reads, daily-driver-guide updates, caveats. Read this before drilling into individual task-family folders.

diff --git a/benchmarks/microbench-phase-b-2026-05-02/README.md b/benchmarks/microbench-phase-b-2026-05-02/README.md
@@ -1,5 +1,7 @@
 # microbench-phase-b-2026-05-02 — N=10 expansion + 27B-no-think third arm
 
+> **Part of a larger arc.** This 12-family microbench has since been extended to more models (397B, Step-3.7-Flash, MiniMax-M2.7, a clean 27B-FP8 redo), some filed under `hardware-tests/` because they needed the dual-Blackwell rig. The **thinking-net-negative** finding first sharpened here (27B 86.8% no-think vs 75% think) recurs across that whole arc. Full cross-tree index + the four-"27B" disambiguation: [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md).
+
 > **How this entry relates to [`microbench-2026-04-28`](../microbench-2026-04-28/)**: this entry is the *current* picture for the 4 differential cells (p2_hallucination, p3_business, p3_doc, p3_market) at N=10 across all three model arms, and the *first* picture for 27B-no-think across the full 12-family grid (N=10). The 2026-04-28 entry remains the current N=3 baseline for the other 8 cells on Coder-Next + 27B-thinking — it is **not superseded**, and many cross-references in this entry point back to it. **Read both for the full picture.**
 >
 > **Of the ~240 runs in this batch, this entry publishes one representative run per (cell × model arm) — 22 representatives total.** Per-run artifacts (cost.json / grade.json / label.json / summary.json / receipt.json) for the remaining ~220 runs live on the source bench's `submit/phase-b-overnight-2026-05-02` branch (sibling branch in this repo), which preserves the full transcripts + workspace tarballs for reproducibility.

diff --git a/hardware-tests/README.md b/hardware-tests/README.md
@@ -28,6 +28,8 @@ Do not mix those two tables in a cross-host ranking.
 | Bundle | Primary question | Main caution |
 |---|---|---|
 | [`qwen3.6-q8-fleet-2026-05-17`](qwen3.6-q8-fleet-2026-05-17/) | How do four local-AI hardware classes handle the same dense and MoE Qwen3.6 workloads? | Multi-user serving is held; Tower2 MoE uses a defended vLLM FP8 exception because native llama.cpp Q8 crashes. |
+| [`best-stack-followup-2026-05-17`](best-stack-followup-2026-05-17/) | What's the best serving stack per platform (MLX vs Metal on M5; ROCm 7 on Strix Halo)? | Platform-specific; MLX beats Metal on M5, ROCm 7 works on Strix, no prefill lift. |
+| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **(Model-behavior microbench, filed here for the rig.)** Does thinking help; do results tie across scale; 397B / Step-3.7 / MiniMax / 27B-Q4 refs. | Thinking net-negative across ~15× scale; small-N misreads cells; MiniMax temp serving-trap. A 12-family agentic microbench — see [`../MICROBENCH-INDEX.md`](../MICROBENCH-INDEX.md). Secondary: dual-GPU power. |
 | [`local-ai-hardware-valuation-2026-05-17`](local-ai-hardware-valuation-2026-05-17/) | What are buyers paying per usable memory GB, bandwidth, and measured 27B Q8 tok/s? | Prices and wall-power assumptions are time-bound inputs. |
 | [`vllm-power-sweep-2026-04-29`](vllm-power-sweep-2026-04-29/) | Where is the RTX PRO 6000 Blackwell LLM-serving power-cap plateau? | Tower2-only, vLLM-only, AWQ-INT4 models. |
 | [`ltx23-power-sweep-2026-05-05`](ltx23-power-sweep-2026-05-05/) | Does the same GPU power-cap curve apply to diffusion/video generation? | Workload-specific to the LTX-2.3 workflow tested. |