fix(blog): reframe NVL72 bullet — disagg still applies, wide EP doesn't

functionstackx · claude · functionstackx · commit d4e9e4372aeb · 2026-05-25T23:36:31.000-04:00
Wider expert parallelism doesn't compound on a 10B-active / 256-small-expert
model the way it does on DeepSeek R1 or Kimi K2.5, but disaggregated
prefill + decode on NVL72 is still a valid next lever for MiniMax-M2.5 (KV
between pools over NVLink 5, decode pool absorbs more concurrency past the
single-node saturation knee). Drops the speculative FP4 KV cache and
"see MTP bullet" trailers.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/packages/app/content/blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx b/packages/app/content/blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx
@@ -188,7 +188,7 @@ B200 NVFP4's perf/$ lift over H100 climbs **monotonically from 3.96x at 22 tok/s
 
 Three gaps still expand or sharpen the headline number from here:
 
-- **Why this doesn't get an NVL72 chapter.** Unlike DeepSeek R1 (37B active / 671B total) or Kimi K2.5 (32B active / 1T total) — where wider expert parallelism on NVL72 compounds via the compute-comm overlap on EP collectives over rack-scale NVLink — **MiniMax-M2.5 is too small for wide EP to be the right lever**. At 10B active params on 256 small experts, each rank in a TP=2 / 8-GPU configuration already holds only a handful of experts; widening EP across a 72-GPU NVLink domain doesn't shrink the per-rank weight footprint enough to matter. The 17.6k tok/s/GPU plateau on B200 NVFP4 is driven by attention and KV-cache HBM reads at high concurrency, not by NVLink fabric headroom. The cleaner Blackwell-side next levers for this model are **B300 NVFP4** (1.5x the dense FP4 compute per GPU at the same 8 TB/s HBM bandwidth), **FP4 KV cache** when vLLM ships it for the MiniMax attention layer, and the **MTP** path in the previous bullet.
+- **NVL72 disagg (without wide EP).** **Wider expert parallelism isn't the right lever for this model** — at 10B active params on 256 small experts, each rank in a TP=2 / 8-GPU configuration already holds only a handful of experts, so widening EP across a 72-GPU NVLink domain doesn't shrink the per-rank weight footprint enough to matter (unlike DeepSeek R1 or Kimi K2.5, where wide EP compounds via compute-comm overlap on the EP collectives). **Disaggregated prefill + decode is still on the table** though: today's single-node aggregated recipe puts both stages on the same TP=2 island and contends for HBM bandwidth at the conc 256+ saturation knee; a disagg recipe on GB200/GB300 NVL72 would move KV between dedicated prefill and decode pools over NVLink 5 and let the decode pool absorb more concurrency before saturating. No InferenceX disagg recipe for MiniMax on NVL72 has shipped yet.
 
 For MiniMax-M2.5 serving on vLLM today, B200 NVFP4 is the cheaper choice by 4x–8.2x across every interactivity point H100 can reach.