Skip to content

Commit d4e9e43

Browse files
fix(blog): reframe NVL72 bullet — disagg still applies, wide EP doesn't
Wider expert parallelism doesn't compound on a 10B-active / 256-small-expert model the way it does on DeepSeek R1 or Kimi K2.5, but disaggregated prefill + decode on NVL72 is still a valid next lever for MiniMax-M2.5 (KV between pools over NVLink 5, decode pool absorbs more concurrency past the single-node saturation knee). Drops the speculative FP4 KV cache and "see MTP bullet" trailers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 66a1670 commit d4e9e43

1 file changed

Lines changed: 1 addition & 1 deletion

File tree

packages/app/content/blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ B200 NVFP4's perf/$ lift over H100 climbs **monotonically from 3.96x at 22 tok/s
188188

189189
Three gaps still expand or sharpen the headline number from here:
190190

191-
- **Why this doesn't get an NVL72 chapter.** Unlike DeepSeek R1 (37B active / 671B total) or Kimi K2.5 (32B active / 1T total) — where wider expert parallelism on NVL72 compounds via the compute-comm overlap on EP collectives over rack-scale NVLink — **MiniMax-M2.5 is too small for wide EP to be the right lever**. At 10B active params on 256 small experts, each rank in a TP=2 / 8-GPU configuration already holds only a handful of experts; widening EP across a 72-GPU NVLink domain doesn't shrink the per-rank weight footprint enough to matter. The 17.6k tok/s/GPU plateau on B200 NVFP4 is driven by attention and KV-cache HBM reads at high concurrency, not by NVLink fabric headroom. The cleaner Blackwell-side next levers for this model are **B300 NVFP4** (1.5x the dense FP4 compute per GPU at the same 8 TB/s HBM bandwidth), **FP4 KV cache** when vLLM ships it for the MiniMax attention layer, and the **MTP** path in the previous bullet.
191+
- **NVL72 disagg (without wide EP).** **Wider expert parallelism isn't the right lever for this model** — at 10B active params on 256 small experts, each rank in a TP=2 / 8-GPU configuration already holds only a handful of experts, so widening EP across a 72-GPU NVLink domain doesn't shrink the per-rank weight footprint enough to matter (unlike DeepSeek R1 or Kimi K2.5, where wide EP compounds via compute-comm overlap on the EP collectives). **Disaggregated prefill + decode is still on the table** though: today's single-node aggregated recipe puts both stages on the same TP=2 island and contends for HBM bandwidth at the conc 256+ saturation knee; a disagg recipe on GB200/GB300 NVL72 would move KV between dedicated prefill and decode pools over NVLink 5 and let the decode pool absorb more concurrency before saturating. No InferenceX disagg recipe for MiniMax on NVL72 has shipped yet.
192192

193193
For MiniMax-M2.5 serving on vLLM today, B200 NVFP4 is the cheaper choice by 4x–8.2x across every interactivity point H100 can reach.
194194

0 commit comments

Comments
 (0)