Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,10 +241,10 @@ All three arms use **Cyankiwi 4-bit AWQ** community quants. Multiple field repor

What this means for the data here:
- Within-quant comparison (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) **is** informative — the differential is a model-behavior gap, not a quant artifact.
- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **not** characterized.
- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **partly characterized now** — see below.
- Effects that depend on a thinking-mechanism (the `--no-think` ship-rate jump, the word-trim loop reduction) are **unlikely to be quant-specific** — they're about the trace, not the weights' precision.

The FP8 re-run is the highest-priority follow-up.
**Update (2026-05-31): the FP8 re-run is done.** A clean **Qwen3.6-27B-FP8** run of the full 12-family grid is published at [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/). It confirms the prediction above: **thinking is still net-negative at FP8** (no-think 35/60 vs think 29/60), so that finding is *not* a 4-bit-AWQ artifact. FP8 serving was also stable (113/119 clean) where the earlier Q8 attempt was a serving failure. The AWQ-underperforms-FP8 *absolute-capability* question for 27B is now directly addressable from that entry; 35B-A3B at higher precision remains the open follow-up. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) for the full cross-quant picture.

### Other VRAM tiers

Expand Down
44 changes: 44 additions & 0 deletions MICROBENCH-INDEX.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Microbench Index — the 12-family agentic microbench, across both trees

> **Why this file exists.** The MMBT "12-family agentic microbench" (the same harness, task families,
> think/no-think comparison, and `done_signal`/PASS scorecard) is a **model-behavior** study. Its entries
> are split across two top-level trees for an *accidental* reason — some models needed the dual-Blackwell
> rig (so they landed under `hardware-tests/`), the earlier 4-bit runs are under `benchmarks/`. This index
> gathers all of them in one place so you don't have to know which tree a model happened to land in.
>
> Each entry below is a 12-family microbench. The `hardware-tests/` ones also carry a secondary power
> section, but their *headline* is model behavior.

## All 12-family microbench entries

| Entry | Tree | Models / arms | N | Headline |
|---|---|---|---|---|
| [`microbench-2026-04-28`](benchmarks/microbench-2026-04-28/) | benchmarks/ | Qwen3.6-**27B-AWQ** vs Qwen3-Coder-Next-**AWQ** | 3 | Aggregate-tied ~7/12 each; complementary task-class strengths; Coder-Next much faster/cheaper. |
| [`microbench-phase-b-2026-05-02`](benchmarks/microbench-phase-b-2026-05-02/) | benchmarks/ | + **27B-AWQ no-think** third arm; 4 differential cells to N=10 | 10 | 27B ships 86.8% no-think vs 75% think (same `p3_doc` word-limit loop); Coder-Next market 0/10 (Wilson [0, 27.8%]). |
| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | hardware-tests/ | **397B-A17B** (Q3 GGUF) no-think/think; **Step-3.7-Flash** (NVFP4) low/med/high; **MiniMax-M2.7** (NVFP4); **27B-Q4 / Coder-Q4** refs | 10 / 1 / 5 | Thinking net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax "exhaustive completer" + temp serving-trap. |
| [`qwen3.6-27b-fp8-microbench-2026-05-31`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) | hardware-tests/ | Qwen3.6-**27B-FP8** no-think/think | 5 | Thinking net-negative (35/60 vs 29/60); `p2_triage` 0/5 think vs 5/5 no-think; FP8 serving stable where Q8 failed. |

## The four "27B"s — don't conflate them

This study references Qwen3.6-27B in **four** different forms. When you see "27B," check which:

| Label | What it is | Where |
|---|---|---|
| **27B-AWQ** (= "27B-Q4") | Cyankiwi 4-bit AWQ, vLLM | `microbench-2026-04-28`, `microbench-phase-b-2026-05-02`; the "27B-Q4 ref" columns in the 397B entry; the AWQ rows in `SCORECARD.md` / `COMPARISON.md` |
| **27B-Q8** | Q8_0 GGUF, llama.cpp | `hardware-tests/qwen3.6-q8-fleet-2026-05-17` (throughput); attempted on the microbench harness but **excluded as a serving failure** (23/36 token-runaway) — see the 397B entry |
| **27B-FP8** | official FP8, vLLM | `hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31` (this is the clean redo of the excluded Q8/FP8 attempt) |
| **35B-A3B** (sibling MoE) | Qwen3.6-35B-A3B, various quants | referenced as the small-MoE comparator; Q8 MoE crashes on Blackwell sm_120 (known kernel bug) |

## The one finding that holds across all of them

**Thinking is net-negative on this agentic microbench**, consistently across a ~15× parameter range:
397B 82→72 (−10), 27B-AWQ 86.8%→75% ship rate, 27B-FP8 35→29 (−6) — all the same direction, largely via
the same `p3_doc` word-limit / over-production mechanism (see [issue #36](https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests/issues/36)
for the grader artifact that compounds it). Failure *temperament* tracks lineage, not size: Qwen-family
models (397B, 27B) **stall**; Coder-Next / Flash / MiniMax(@temp0.3) **run away**.

## Note on organization

This index is the low-disruption fix for the cross-tree split (it avoids moving directories, which would
break links and git history). If the microbench corpus keeps growing, the cleaner long-term move is a
dedicated `microbenchmarks/` tree; this index is the bridge until then.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ but I'm making it public so that other people can use it too.
| Where the benchmark folders start | [`benchmarks/README.md`](benchmarks/README.md) — agent-task benchmark landing page |
| **"Coder-Next or 27B (or 27B-no-think) for my task?"** | [`COMPARISON.md`](COMPARISON.md) — head-to-head decision doc |
| The full single-table comparison across all entries | [`SCORECARD.md`](SCORECARD.md) |
| **All 12-family microbench results (across both trees) + the four "27B"s** | [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) — cross-tree microbench index + quant disambiguation |
| How repo size is managed | [`REPO-SPACE.md`](REPO-SPACE.md) — storage hotspots and artifact policy |
| How to benchmark a new local model | [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) |
| How to replay a specific past run | [`tooling/REPRODUCING.md`](tooling/REPRODUCING.md) |
Expand Down Expand Up @@ -74,6 +75,8 @@ For the benchmark landing page and per-folder navigation map, start with
|---|---|---|
| [`vllm-power-sweep-2026-04-29`](hardware-tests/vllm-power-sweep-2026-04-29/) | 7 GPU power caps × 5 min sustained vLLM load × 2 concurrencies (N=1, N=32) × 2 AWQ-INT4 models (Dense Qwen3.6-27B, MoE Coder-Next), 28 cells total, on RTX PRO 6000 Blackwell. | Throughput-vs-power-cap curve, native draw at unbounded cap, and per-cap thermal envelope. Validates the 500 W production cap (within 3.3 % of optimal in every scenario), and shows Coder-Next ≈ 1.8× faster batched / 2.3× faster single-stream than dense 27 B at every cap. The findings doc carries an "Audit notes" section flagging two per-cap "winner" markers that don't survive a re-read of the raw CSVs (a vLLM container warmup transient and a single-window thermal clock dip distort the per-cap winners without changing the plateau-shape headline). |
| [`qwen3.6-q8-fleet-2026-05-17`](hardware-tests/qwen3.6-q8-fleet-2026-05-17/) | Same Qwen3.6 Q8 GGUF model bytes across Blackwell 6000 Tower, DGX Spark, EVO X2 / Strix Halo, and M5 Max MacBook Pro under a pinned llama.cpp SHA, with vLLM appendix rows on Tower2. | Cross-platform single-user prefill/decode/TTFT, backend failure modes, thermal field notes, and cost-throughput caveats for local AI hardware debates. Multi-user serving conclusions are explicitly held. |
| [`best-stack-followup-2026-05-17`](hardware-tests/best-stack-followup-2026-05-17/) | Follow-up bundle: MLX on the M5 Max, and Dream-Server on ROCm 7 on the Strix Halo. | Best-serving-stack notes per platform (MLX beats Metal on M5; ROCm 7 works on Strix; no prefill lift). |
| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **A 12-family agentic microbench (model behavior), filed here because it needed the dual-Blackwell rig.** 397B-A17B (Q3 GGUF) no-think/think N=10, + Step-3.7-Flash, MiniMax-M2.7, and 27B/Coder-Q4 refs. | Thinking is net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax temp serving-trap. Secondary: dual-GPU power telemetry. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md). |
| [`local-ai-hardware-valuation-2026-05-17`](hardware-tests/local-ai-hardware-valuation-2026-05-17/) | Derived valuation worksheet built from editable price/spec inputs plus the Qwen3.6 27B Q8 hardware measurements. | Recomputable buyer metrics: `$/usable AI GB`, `$/GB/s`, `$/measured decode tok/s`, `$/measured prefill tok/s`, capacity-bandwidth score, and rough 5-year energy/TCO lines. Use this when market prices change and you want the same mental model to survive the refresh. |
| [`step3.7-flash-nvfp4-dual-blackwell-2026-05-28`](hardware-tests/step3.7-flash-nvfp4-dual-blackwell-2026-05-28/) | Setup/config note: serving `stepfun-ai/Step-3.7-Flash-NVFP4` (201B MoE VLM, day-one) under vLLM on 2× RTX PRO 6000 Blackwell (sm_120, no NVLink), TP=2, native NVFP4 + FP8 KV. | The working launch command and the four non-obvious flags it took to get there, with full diagnostic trail: `--disable-custom-all-reduce` (custom all-reduce deadlocks without P2P/NVLink), `--moe-backend cutlass` (only native-FP4 MoE kernel that supports the model's SWIGLUSTEP activation), no expert-parallel, native max-model-len. No official 2×6000 recipe exists upstream. Companion to the Step-3.7 microbench entry. |

Expand Down
8 changes: 5 additions & 3 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,15 @@

## Active follow-ups (in priority order)

### 1. FP8 re-run of the 12-cell microbench grid   **[contributor-welcome]**
### 1. FP8 re-run of the 12-cell microbench grid   ~~**[contributor-welcome]**~~ **✅ DONE (2026-05-31) for 27B**

**Source**: [`KNOWN-LIMITATIONS.md` § Cyankiwi 4-bit AWQ field reports](KNOWN-LIMITATIONS.md#quantization-specificity), [`benchmarks/microbench-phase-b-2026-05-02/findings.md` § Recommended follow-ups](benchmarks/microbench-phase-b-2026-05-02/findings.md#recommended-follow-ups)

Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell × N=10 grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.
Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.

What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each model arm, submit a PR with the results.
**Done for Qwen3.6-27B:** [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) — full 12-family grid, N=5, both reasoning modes, official FP8 on vLLM. Result: the **thinking-net-negative** finding holds at FP8 (no-think 35/60 vs think 29/60), so it's not a 4-bit-AWQ artifact; FP8 serving was stable (113/119 clean) where the Q8 attempt failed. **Still open (contributor-welcome):** the same FP8/higher-precision re-run for **Coder-Next** and especially **35B-A3B** (fails at 4-bit; Q8 MoE also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question). See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md).

What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each remaining model arm, submit a PR with the results.

Hardware: needs FP8-capable GPU. RTX PRO 6000 / H100 / similar.

Expand Down
15 changes: 14 additions & 1 deletion SCORECARD.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,19 @@
> **For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see [`COMPARISON.md`](COMPARISON.md).** This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis.
>
> **Read [`KNOWN-LIMITATIONS.md`](KNOWN-LIMITATIONS.md) before quoting any cell.** Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column.
>
> **⚠️ Scope of this SCORECARD (read before comparing to other entries).** Every "27B" cell below is
> **Qwen3.6-27B at 4-bit AWQ** (Cyankiwi), and the tables cover only the `benchmarks/` microbench arms
> (27B-AWQ / Coder-Next-AWQ). They do **not** include the later `hardware-tests/` microbenches —
> **397B, Step-3.7-Flash, MiniMax-M2.7, and the clean 27B-FP8 redo** — nor do they disambiguate the four
> different "27B"s (AWQ / Q8 / FP8 / 35B-A3B sibling). For the full microbench picture across both trees
> and the quant disambiguation, see **[`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md)**.
>
> **Newer microbench results (summary; full detail in the linked entries):**
> - **397B-A17B** (Q3 GGUF), N=10: no-think 82/120, think 72/120 — *thinking net-negative*.
> - **27B-FP8**, N=5: no-think 35/60, think 29/60 — *thinking net-negative*; FP8 serving stable where Q8 failed. ([entry](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/))
> - **Step-3.7-Flash** (NVFP4): 7/8/8 low/med/high; **MiniMax-M2.7** (NVFP4), N=5: 35/60 — "exhaustive completer" + a temp=0.3 serving-trap. ([entry](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/))
> - Across ~15× of scale the aggregate ties ~7–8/12 and **thinking is net-negative everywhere** — the consistent cross-model finding.

## Cell-name legend (microbench)

Expand Down Expand Up @@ -257,6 +270,6 @@ These additions would tighten the recommendations above; until they land, the re
3. **Failed-run artifacts published** (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model.
4. **N=10+ on the highest-signal cells** (Coder-Next on `dreamserver-1-pr-audit`, 27B on the same; both on `microbench-2026-04-28/adversarial-hallucination`). Would bound the variance the current N=3 only suggests.
5. **Different PR shapes** in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes.
6. **Higher-precision quantizations** of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue.
6. **Higher-precision quantizations** of the same models (FP8, BF16). **Partly done (2026-05-31):** a clean **27B-FP8** run of the full 12-family grid is published ([`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/)) — FP8 serving is stable and the thinking-net-negative finding holds, so the headline conclusions generalize past 4-bit AWQ for 27B. Still open: **35B-A3B** at higher precision (it fails at 4-bit; the Q8 MoE path also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question, not yet settled).

None of these are in scope for the current MMBT publication. They're separate experiments.
2 changes: 2 additions & 0 deletions benchmarks/microbench-2026-04-28/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

> 12 task families, 2 local models, N=3 each. Smaller-scope tasks than the dreamserver-PR-audit / wallstreet-intern-test benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Designed to surface task-class-specific differences between Qwen3.6-27B-AWQ and Qwen3-Coder-Next-AWQ that the larger benchmarks couldn't isolate.

> **Part of a larger arc.** This same 12-family microbench has since been run on more models (397B, Step-3.7-Flash, MiniMax-M2.7, and a clean 27B-FP8 redo) — some of which live under `hardware-tests/` because they needed the dual-Blackwell rig. For the full cross-tree index and the disambiguation of the four "27B"s (AWQ here, vs Q8 / FP8 / 35B-A3B), see [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md).

## Read these first

- [`findings.md`](findings.md) — cross-cutting writeup. Headline reads, daily-driver-guide updates, caveats. Read this before drilling into individual task-family folders.
Expand Down
2 changes: 2 additions & 0 deletions benchmarks/microbench-phase-b-2026-05-02/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# microbench-phase-b-2026-05-02 — N=10 expansion + 27B-no-think third arm

> **Part of a larger arc.** This 12-family microbench has since been extended to more models (397B, Step-3.7-Flash, MiniMax-M2.7, a clean 27B-FP8 redo), some filed under `hardware-tests/` because they needed the dual-Blackwell rig. The **thinking-net-negative** finding first sharpened here (27B 86.8% no-think vs 75% think) recurs across that whole arc. Full cross-tree index + the four-"27B" disambiguation: [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md).

> **How this entry relates to [`microbench-2026-04-28`](../microbench-2026-04-28/)**: this entry is the *current* picture for the 4 differential cells (p2_hallucination, p3_business, p3_doc, p3_market) at N=10 across all three model arms, and the *first* picture for 27B-no-think across the full 12-family grid (N=10). The 2026-04-28 entry remains the current N=3 baseline for the other 8 cells on Coder-Next + 27B-thinking — it is **not superseded**, and many cross-references in this entry point back to it. **Read both for the full picture.**
>
> **Of the ~240 runs in this batch, this entry publishes one representative run per (cell × model arm) — 22 representatives total.** Per-run artifacts (cost.json / grade.json / label.json / summary.json / receipt.json) for the remaining ~220 runs live on the source bench's `submit/phase-b-overnight-2026-05-02` branch (sibling branch in this repo), which preserves the full transcripts + workspace tarballs for reproducibility.
Expand Down
2 changes: 2 additions & 0 deletions hardware-tests/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ Do not mix those two tables in a cross-host ranking.
| Bundle | Primary question | Main caution |
|---|---|---|
| [`qwen3.6-q8-fleet-2026-05-17`](qwen3.6-q8-fleet-2026-05-17/) | How do four local-AI hardware classes handle the same dense and MoE Qwen3.6 workloads? | Multi-user serving is held; Tower2 MoE uses a defended vLLM FP8 exception because native llama.cpp Q8 crashes. |
| [`best-stack-followup-2026-05-17`](best-stack-followup-2026-05-17/) | What's the best serving stack per platform (MLX vs Metal on M5; ROCm 7 on Strix Halo)? | Platform-specific; MLX beats Metal on M5, ROCm 7 works on Strix, no prefill lift. |
| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **(Model-behavior microbench, filed here for the rig.)** Does thinking help; do results tie across scale; 397B / Step-3.7 / MiniMax / 27B-Q4 refs. | Thinking net-negative across ~15× scale; small-N misreads cells; MiniMax temp serving-trap. A 12-family agentic microbench — see [`../MICROBENCH-INDEX.md`](../MICROBENCH-INDEX.md). Secondary: dual-GPU power. |
| [`local-ai-hardware-valuation-2026-05-17`](local-ai-hardware-valuation-2026-05-17/) | What are buyers paying per usable memory GB, bandwidth, and measured 27B Q8 tok/s? | Prices and wall-power assumptions are time-bound inputs. |
| [`vllm-power-sweep-2026-04-29`](vllm-power-sweep-2026-04-29/) | Where is the RTX PRO 6000 Blackwell LLM-serving power-cap plateau? | Tower2-only, vLLM-only, AWQ-INT4 models. |
| [`ltx23-power-sweep-2026-05-05`](ltx23-power-sweep-2026-05-05/) | Does the same GPU power-cap curve apply to diffusion/video generation? | Workload-specific to the LTX-2.3 workflow tested. |
Expand Down
Loading