From 09af796eedd717ba931e0e7f6bbb33e9b20552e5 Mon Sep 17 00:00:00 2001 From: User Name Date: Sun, 31 May 2026 21:54:29 -0400 Subject: [PATCH] docs: microbench cross-tree index + synthesis-doc currency (Tier 2/3 org fixes) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Addresses the repo-organization audit findings: Tier 3 (taxonomy): the 12-family agentic microbench (a model-behavior study) is split across benchmarks/ and hardware-tests/ by which GPU/quant a model needed, not by question. Low-disruption fix (no directory moves): - NEW MICROBENCH-INDEX.md — gathers all 12-family microbench entries across both trees + disambiguates the four "27B"s (AWQ / Q8 / FP8 / 35B-A3B). - "where this lives" taxonomy notes in the benchmarks/ microbench READMEs (the 397B and 27B-FP8 entry notes ship in PRs #33/#34). Tier 2 (currency): SCORECARD/COMPARISON/ROADMAP were frozen at 2026-05-02 and still listed the now-done FP8 re-run as future work. - SCORECARD: 27B-quant scope banner (its "27B" = AWQ) + newer-results summary (397B/Step/MiniMax/FP8) + "would change this picture" #6 marked partly-done. - COMPARISON: "FP8 re-run is highest-priority follow-up" -> done, cross-linked. - ROADMAP item 1: marked DONE for 27B, narrowed remaining to Coder/35B-A3B. Tier 1 (discoverability): indexed the previously-unindexed best-stack and qwen3.5-397b entries in root README + hardware-tests/README; added a MICROBENCH-INDEX row to the five-minute-answers table. Merge note: references the 27B-FP8 (PR #34) and MiniMax (PR #33) entries — merge those first. Light expected overlap with #34 on the README index tables (inserted at different anchors to minimize it) and with #33/#34 on claims.yaml (not touched here). Co-Authored-By: Claude Opus 4.8 (1M context) --- COMPARISON.md | 4 +- MICROBENCH-INDEX.md | 44 +++++++++++++++++++ README.md | 3 ++ ROADMAP.md | 8 ++-- SCORECARD.md | 15 ++++++- benchmarks/microbench-2026-04-28/README.md | 2 + .../microbench-phase-b-2026-05-02/README.md | 2 + hardware-tests/README.md | 2 + 8 files changed, 74 insertions(+), 6 deletions(-) create mode 100644 MICROBENCH-INDEX.md diff --git a/COMPARISON.md b/COMPARISON.md index 643b2693..9bc37c3e 100644 --- a/COMPARISON.md +++ b/COMPARISON.md @@ -241,10 +241,10 @@ All three arms use **Cyankiwi 4-bit AWQ** community quants. Multiple field repor What this means for the data here: - Within-quant comparison (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) **is** informative — the differential is a model-behavior gap, not a quant artifact. -- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **not** characterized. +- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **partly characterized now** — see below. - Effects that depend on a thinking-mechanism (the `--no-think` ship-rate jump, the word-trim loop reduction) are **unlikely to be quant-specific** — they're about the trace, not the weights' precision. -The FP8 re-run is the highest-priority follow-up. +**Update (2026-05-31): the FP8 re-run is done.** A clean **Qwen3.6-27B-FP8** run of the full 12-family grid is published at [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/). It confirms the prediction above: **thinking is still net-negative at FP8** (no-think 35/60 vs think 29/60), so that finding is *not* a 4-bit-AWQ artifact. FP8 serving was also stable (113/119 clean) where the earlier Q8 attempt was a serving failure. The AWQ-underperforms-FP8 *absolute-capability* question for 27B is now directly addressable from that entry; 35B-A3B at higher precision remains the open follow-up. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) for the full cross-quant picture. ### Other VRAM tiers diff --git a/MICROBENCH-INDEX.md b/MICROBENCH-INDEX.md new file mode 100644 index 00000000..6d1a97c7 --- /dev/null +++ b/MICROBENCH-INDEX.md @@ -0,0 +1,44 @@ +# Microbench Index — the 12-family agentic microbench, across both trees + +> **Why this file exists.** The MMBT "12-family agentic microbench" (the same harness, task families, +> think/no-think comparison, and `done_signal`/PASS scorecard) is a **model-behavior** study. Its entries +> are split across two top-level trees for an *accidental* reason — some models needed the dual-Blackwell +> rig (so they landed under `hardware-tests/`), the earlier 4-bit runs are under `benchmarks/`. This index +> gathers all of them in one place so you don't have to know which tree a model happened to land in. +> +> Each entry below is a 12-family microbench. The `hardware-tests/` ones also carry a secondary power +> section, but their *headline* is model behavior. + +## All 12-family microbench entries + +| Entry | Tree | Models / arms | N | Headline | +|---|---|---|---|---| +| [`microbench-2026-04-28`](benchmarks/microbench-2026-04-28/) | benchmarks/ | Qwen3.6-**27B-AWQ** vs Qwen3-Coder-Next-**AWQ** | 3 | Aggregate-tied ~7/12 each; complementary task-class strengths; Coder-Next much faster/cheaper. | +| [`microbench-phase-b-2026-05-02`](benchmarks/microbench-phase-b-2026-05-02/) | benchmarks/ | + **27B-AWQ no-think** third arm; 4 differential cells to N=10 | 10 | 27B ships 86.8% no-think vs 75% think (same `p3_doc` word-limit loop); Coder-Next market 0/10 (Wilson [0, 27.8%]). | +| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | hardware-tests/ | **397B-A17B** (Q3 GGUF) no-think/think; **Step-3.7-Flash** (NVFP4) low/med/high; **MiniMax-M2.7** (NVFP4); **27B-Q4 / Coder-Q4** refs | 10 / 1 / 5 | Thinking net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax "exhaustive completer" + temp serving-trap. | +| [`qwen3.6-27b-fp8-microbench-2026-05-31`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) | hardware-tests/ | Qwen3.6-**27B-FP8** no-think/think | 5 | Thinking net-negative (35/60 vs 29/60); `p2_triage` 0/5 think vs 5/5 no-think; FP8 serving stable where Q8 failed. | + +## The four "27B"s — don't conflate them + +This study references Qwen3.6-27B in **four** different forms. When you see "27B," check which: + +| Label | What it is | Where | +|---|---|---| +| **27B-AWQ** (= "27B-Q4") | Cyankiwi 4-bit AWQ, vLLM | `microbench-2026-04-28`, `microbench-phase-b-2026-05-02`; the "27B-Q4 ref" columns in the 397B entry; the AWQ rows in `SCORECARD.md` / `COMPARISON.md` | +| **27B-Q8** | Q8_0 GGUF, llama.cpp | `hardware-tests/qwen3.6-q8-fleet-2026-05-17` (throughput); attempted on the microbench harness but **excluded as a serving failure** (23/36 token-runaway) — see the 397B entry | +| **27B-FP8** | official FP8, vLLM | `hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31` (this is the clean redo of the excluded Q8/FP8 attempt) | +| **35B-A3B** (sibling MoE) | Qwen3.6-35B-A3B, various quants | referenced as the small-MoE comparator; Q8 MoE crashes on Blackwell sm_120 (known kernel bug) | + +## The one finding that holds across all of them + +**Thinking is net-negative on this agentic microbench**, consistently across a ~15× parameter range: +397B 82→72 (−10), 27B-AWQ 86.8%→75% ship rate, 27B-FP8 35→29 (−6) — all the same direction, largely via +the same `p3_doc` word-limit / over-production mechanism (see [issue #36](https://github.com/Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests/issues/36) +for the grader artifact that compounds it). Failure *temperament* tracks lineage, not size: Qwen-family +models (397B, 27B) **stall**; Coder-Next / Flash / MiniMax(@temp0.3) **run away**. + +## Note on organization + +This index is the low-disruption fix for the cross-tree split (it avoids moving directories, which would +break links and git history). If the microbench corpus keeps growing, the cleaner long-term move is a +dedicated `microbenchmarks/` tree; this index is the bridge until then. diff --git a/README.md b/README.md index bd464e72..fbcf7e0a 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ but I'm making it public so that other people can use it too. | Where the benchmark folders start | [`benchmarks/README.md`](benchmarks/README.md) — agent-task benchmark landing page | | **"Coder-Next or 27B (or 27B-no-think) for my task?"** | [`COMPARISON.md`](COMPARISON.md) — head-to-head decision doc | | The full single-table comparison across all entries | [`SCORECARD.md`](SCORECARD.md) | +| **All 12-family microbench results (across both trees) + the four "27B"s** | [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md) — cross-tree microbench index + quant disambiguation | | How repo size is managed | [`REPO-SPACE.md`](REPO-SPACE.md) — storage hotspots and artifact policy | | How to benchmark a new local model | [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) | | How to replay a specific past run | [`tooling/REPRODUCING.md`](tooling/REPRODUCING.md) | @@ -74,6 +75,8 @@ For the benchmark landing page and per-folder navigation map, start with |---|---|---| | [`vllm-power-sweep-2026-04-29`](hardware-tests/vllm-power-sweep-2026-04-29/) | 7 GPU power caps × 5 min sustained vLLM load × 2 concurrencies (N=1, N=32) × 2 AWQ-INT4 models (Dense Qwen3.6-27B, MoE Coder-Next), 28 cells total, on RTX PRO 6000 Blackwell. | Throughput-vs-power-cap curve, native draw at unbounded cap, and per-cap thermal envelope. Validates the 500 W production cap (within 3.3 % of optimal in every scenario), and shows Coder-Next ≈ 1.8× faster batched / 2.3× faster single-stream than dense 27 B at every cap. The findings doc carries an "Audit notes" section flagging two per-cap "winner" markers that don't survive a re-read of the raw CSVs (a vLLM container warmup transient and a single-window thermal clock dip distort the per-cap winners without changing the plateau-shape headline). | | [`qwen3.6-q8-fleet-2026-05-17`](hardware-tests/qwen3.6-q8-fleet-2026-05-17/) | Same Qwen3.6 Q8 GGUF model bytes across Blackwell 6000 Tower, DGX Spark, EVO X2 / Strix Halo, and M5 Max MacBook Pro under a pinned llama.cpp SHA, with vLLM appendix rows on Tower2. | Cross-platform single-user prefill/decode/TTFT, backend failure modes, thermal field notes, and cost-throughput caveats for local AI hardware debates. Multi-user serving conclusions are explicitly held. | +| [`best-stack-followup-2026-05-17`](hardware-tests/best-stack-followup-2026-05-17/) | Follow-up bundle: MLX on the M5 Max, and Dream-Server on ROCm 7 on the Strix Halo. | Best-serving-stack notes per platform (MLX beats Metal on M5; ROCm 7 works on Strix; no prefill lift). | +| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **A 12-family agentic microbench (model behavior), filed here because it needed the dual-Blackwell rig.** 397B-A17B (Q3 GGUF) no-think/think N=10, + Step-3.7-Flash, MiniMax-M2.7, and 27B/Coder-Q4 refs. | Thinking is net-negative (397B 82→72); small-N misreads cells; aggregate ties ~7–8/12 across ~15× scale; MiniMax temp serving-trap. Secondary: dual-GPU power telemetry. See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md). | | [`local-ai-hardware-valuation-2026-05-17`](hardware-tests/local-ai-hardware-valuation-2026-05-17/) | Derived valuation worksheet built from editable price/spec inputs plus the Qwen3.6 27B Q8 hardware measurements. | Recomputable buyer metrics: `$/usable AI GB`, `$/GB/s`, `$/measured decode tok/s`, `$/measured prefill tok/s`, capacity-bandwidth score, and rough 5-year energy/TCO lines. Use this when market prices change and you want the same mental model to survive the refresh. | | [`step3.7-flash-nvfp4-dual-blackwell-2026-05-28`](hardware-tests/step3.7-flash-nvfp4-dual-blackwell-2026-05-28/) | Setup/config note: serving `stepfun-ai/Step-3.7-Flash-NVFP4` (201B MoE VLM, day-one) under vLLM on 2× RTX PRO 6000 Blackwell (sm_120, no NVLink), TP=2, native NVFP4 + FP8 KV. | The working launch command and the four non-obvious flags it took to get there, with full diagnostic trail: `--disable-custom-all-reduce` (custom all-reduce deadlocks without P2P/NVLink), `--moe-backend cutlass` (only native-FP4 MoE kernel that supports the model's SWIGLUSTEP activation), no expert-parallel, native max-model-len. No official 2×6000 recipe exists upstream. Companion to the Step-3.7 microbench entry. | diff --git a/ROADMAP.md b/ROADMAP.md index e2e897f8..1cd4baf2 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,13 +8,15 @@ ## Active follow-ups (in priority order) -### 1. FP8 re-run of the 12-cell microbench grid   **[contributor-welcome]** +### 1. FP8 re-run of the 12-cell microbench grid   ~~**[contributor-welcome]**~~ **✅ DONE (2026-05-31) for 27B** **Source**: [`KNOWN-LIMITATIONS.md` § Cyankiwi 4-bit AWQ field reports](KNOWN-LIMITATIONS.md#quantization-specificity), [`benchmarks/microbench-phase-b-2026-05-02/findings.md` § Recommended follow-ups](benchmarks/microbench-phase-b-2026-05-02/findings.md#recommended-follow-ups) -Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell × N=10 grid on FP8 would let current findings generalize across quants or be bounded as quant-specific. +Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell grid on FP8 would let current findings generalize across quants or be bounded as quant-specific. -What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each model arm, submit a PR with the results. +**Done for Qwen3.6-27B:** [`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/) — full 12-family grid, N=5, both reasoning modes, official FP8 on vLLM. Result: the **thinking-net-negative** finding holds at FP8 (no-think 35/60 vs think 29/60), so it's not a 4-bit-AWQ artifact; FP8 serving was stable (113/119 clean) where the Q8 attempt failed. **Still open (contributor-welcome):** the same FP8/higher-precision re-run for **Coder-Next** and especially **35B-A3B** (fails at 4-bit; Q8 MoE also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question). See [`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md). + +What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each remaining model arm, submit a PR with the results. Hardware: needs FP8-capable GPU. RTX PRO 6000 / H100 / similar. diff --git a/SCORECARD.md b/SCORECARD.md index 9299750a..daa2f48b 100644 --- a/SCORECARD.md +++ b/SCORECARD.md @@ -5,6 +5,19 @@ > **For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see [`COMPARISON.md`](COMPARISON.md).** This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis. > > **Read [`KNOWN-LIMITATIONS.md`](KNOWN-LIMITATIONS.md) before quoting any cell.** Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column. +> +> **⚠️ Scope of this SCORECARD (read before comparing to other entries).** Every "27B" cell below is +> **Qwen3.6-27B at 4-bit AWQ** (Cyankiwi), and the tables cover only the `benchmarks/` microbench arms +> (27B-AWQ / Coder-Next-AWQ). They do **not** include the later `hardware-tests/` microbenches — +> **397B, Step-3.7-Flash, MiniMax-M2.7, and the clean 27B-FP8 redo** — nor do they disambiguate the four +> different "27B"s (AWQ / Q8 / FP8 / 35B-A3B sibling). For the full microbench picture across both trees +> and the quant disambiguation, see **[`MICROBENCH-INDEX.md`](MICROBENCH-INDEX.md)**. +> +> **Newer microbench results (summary; full detail in the linked entries):** +> - **397B-A17B** (Q3 GGUF), N=10: no-think 82/120, think 72/120 — *thinking net-negative*. +> - **27B-FP8**, N=5: no-think 35/60, think 29/60 — *thinking net-negative*; FP8 serving stable where Q8 failed. ([entry](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/)) +> - **Step-3.7-Flash** (NVFP4): 7/8/8 low/med/high; **MiniMax-M2.7** (NVFP4), N=5: 35/60 — "exhaustive completer" + a temp=0.3 serving-trap. ([entry](hardware-tests/qwen3.5-397b-vs-step3.7-flash-2026-05-29/)) +> - Across ~15× of scale the aggregate ties ~7–8/12 and **thinking is net-negative everywhere** — the consistent cross-model finding. ## Cell-name legend (microbench) @@ -257,6 +270,6 @@ These additions would tighten the recommendations above; until they land, the re 3. **Failed-run artifacts published** (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model. 4. **N=10+ on the highest-signal cells** (Coder-Next on `dreamserver-1-pr-audit`, 27B on the same; both on `microbench-2026-04-28/adversarial-hallucination`). Would bound the variance the current N=3 only suggests. 5. **Different PR shapes** in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes. -6. **Higher-precision quantizations** of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue. +6. **Higher-precision quantizations** of the same models (FP8, BF16). **Partly done (2026-05-31):** a clean **27B-FP8** run of the full 12-family grid is published ([`hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/`](hardware-tests/qwen3.6-27b-fp8-microbench-2026-05-31/)) — FP8 serving is stable and the thinking-net-negative finding holds, so the headline conclusions generalize past 4-bit AWQ for 27B. Still open: **35B-A3B** at higher precision (it fails at 4-bit; the Q8 MoE path also hits a Blackwell sm_120 kernel bug — a quant/engine-headroom question, not yet settled). None of these are in scope for the current MMBT publication. They're separate experiments. diff --git a/benchmarks/microbench-2026-04-28/README.md b/benchmarks/microbench-2026-04-28/README.md index 19689d88..bee3b3f0 100644 --- a/benchmarks/microbench-2026-04-28/README.md +++ b/benchmarks/microbench-2026-04-28/README.md @@ -2,6 +2,8 @@ > 12 task families, 2 local models, N=3 each. Smaller-scope tasks than the dreamserver-PR-audit / wallstreet-intern-test benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Designed to surface task-class-specific differences between Qwen3.6-27B-AWQ and Qwen3-Coder-Next-AWQ that the larger benchmarks couldn't isolate. +> **Part of a larger arc.** This same 12-family microbench has since been run on more models (397B, Step-3.7-Flash, MiniMax-M2.7, and a clean 27B-FP8 redo) — some of which live under `hardware-tests/` because they needed the dual-Blackwell rig. For the full cross-tree index and the disambiguation of the four "27B"s (AWQ here, vs Q8 / FP8 / 35B-A3B), see [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md). + ## Read these first - [`findings.md`](findings.md) — cross-cutting writeup. Headline reads, daily-driver-guide updates, caveats. Read this before drilling into individual task-family folders. diff --git a/benchmarks/microbench-phase-b-2026-05-02/README.md b/benchmarks/microbench-phase-b-2026-05-02/README.md index 5a15fe22..6cc8c753 100644 --- a/benchmarks/microbench-phase-b-2026-05-02/README.md +++ b/benchmarks/microbench-phase-b-2026-05-02/README.md @@ -1,5 +1,7 @@ # microbench-phase-b-2026-05-02 — N=10 expansion + 27B-no-think third arm +> **Part of a larger arc.** This 12-family microbench has since been extended to more models (397B, Step-3.7-Flash, MiniMax-M2.7, a clean 27B-FP8 redo), some filed under `hardware-tests/` because they needed the dual-Blackwell rig. The **thinking-net-negative** finding first sharpened here (27B 86.8% no-think vs 75% think) recurs across that whole arc. Full cross-tree index + the four-"27B" disambiguation: [`../../MICROBENCH-INDEX.md`](../../MICROBENCH-INDEX.md). + > **How this entry relates to [`microbench-2026-04-28`](../microbench-2026-04-28/)**: this entry is the *current* picture for the 4 differential cells (p2_hallucination, p3_business, p3_doc, p3_market) at N=10 across all three model arms, and the *first* picture for 27B-no-think across the full 12-family grid (N=10). The 2026-04-28 entry remains the current N=3 baseline for the other 8 cells on Coder-Next + 27B-thinking — it is **not superseded**, and many cross-references in this entry point back to it. **Read both for the full picture.** > > **Of the ~240 runs in this batch, this entry publishes one representative run per (cell × model arm) — 22 representatives total.** Per-run artifacts (cost.json / grade.json / label.json / summary.json / receipt.json) for the remaining ~220 runs live on the source bench's `submit/phase-b-overnight-2026-05-02` branch (sibling branch in this repo), which preserves the full transcripts + workspace tarballs for reproducibility. diff --git a/hardware-tests/README.md b/hardware-tests/README.md index 3ade3204..266216bf 100644 --- a/hardware-tests/README.md +++ b/hardware-tests/README.md @@ -28,6 +28,8 @@ Do not mix those two tables in a cross-host ranking. | Bundle | Primary question | Main caution | |---|---|---| | [`qwen3.6-q8-fleet-2026-05-17`](qwen3.6-q8-fleet-2026-05-17/) | How do four local-AI hardware classes handle the same dense and MoE Qwen3.6 workloads? | Multi-user serving is held; Tower2 MoE uses a defended vLLM FP8 exception because native llama.cpp Q8 crashes. | +| [`best-stack-followup-2026-05-17`](best-stack-followup-2026-05-17/) | What's the best serving stack per platform (MLX vs Metal on M5; ROCm 7 on Strix Halo)? | Platform-specific; MLX beats Metal on M5, ROCm 7 works on Strix, no prefill lift. | +| [`qwen3.5-397b-vs-step3.7-flash-2026-05-29`](qwen3.5-397b-vs-step3.7-flash-2026-05-29/) | **(Model-behavior microbench, filed here for the rig.)** Does thinking help; do results tie across scale; 397B / Step-3.7 / MiniMax / 27B-Q4 refs. | Thinking net-negative across ~15× scale; small-N misreads cells; MiniMax temp serving-trap. A 12-family agentic microbench — see [`../MICROBENCH-INDEX.md`](../MICROBENCH-INDEX.md). Secondary: dual-GPU power. | | [`local-ai-hardware-valuation-2026-05-17`](local-ai-hardware-valuation-2026-05-17/) | What are buyers paying per usable memory GB, bandwidth, and measured 27B Q8 tok/s? | Prices and wall-power assumptions are time-bound inputs. | | [`vllm-power-sweep-2026-04-29`](vllm-power-sweep-2026-04-29/) | Where is the RTX PRO 6000 Blackwell LLM-serving power-cap plateau? | Tower2-only, vLLM-only, AWQ-INT4 models. | | [`ltx23-power-sweep-2026-05-05`](ltx23-power-sweep-2026-05-05/) | Does the same GPU power-cap curve apply to diffusion/video generation? | Workload-specific to the LTX-2.3 workflow tested. |