Skip to content

feat(blog): B200 NVFP4 vs H100 FP8 on MiniMax-M2.5 — up to 8.2x better perf/$#387

Merged
functionstackx merged 2 commits into
masterfrom
blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar
May 26, 2026
Merged

feat(blog): B200 NVFP4 vs H100 FP8 on MiniMax-M2.5 — up to 8.2x better perf/$#387
functionstackx merged 2 commits into
masterfrom
blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 26, 2026

Summary

  • New blog post comparing B200 vLLM NVFP4 vs H100 vLLM FP8 on MiniMax-M2.5 8K/1K (GHA run 26306422380, measured 2026-05-22).
  • Headline: up to 8.2x better performance per dollar at 110 tok/s/user (H100 $0.74/M vs B200 NVFP4 $0.091/M); lift grows monotonically from 4.0x at 22 tok/s/user as H100's curve falls off faster than B200 NVFP4's at high interactivity.
  • Decomposes at the peak into a 2.94x generation step (Blackwell vs Hopper at FP8) and a 2.77x precision step (B200 FP8 → B200 NVFP4), unlocked by vllm-project/vllm #36307 — the trtllm-gen FP8 MoE modular kernel that finally accepts MiniMax's routing-logits dtype.
  • On-Paper Specs section grounds the gap in silicon: B200 has 2.27x more FP8 compute, 4.55x more FP4 compute than H100 has FP8 compute, and 2.39x more HBM bandwidth at 1.50x the TCO. Silicon ceiling 1.51x–3.03x; measured 8.2x is ~2.7x above that, the gap is the kernel.
  • Quality benchmarks figure (new): MiniMax M2.5 vs Claude Opus 4.5/4.6 / Gemini 3 Pro / GPT-5.2 on SWE-Bench family + Terminal Bench + VIBE-Pro. Within 1–4 points of Opus across the board, leads on Multi-SWE-Bench.
  • MoE kernel comparison figure (supplementary, 1K/1K) showing trtllm vs deep_gemm vs triton MoE backends on B200.
  • M2.7 transferability called out in body and chart captions.
  • ‘What's Next’ honestly notes wide-EP on NVL72 is NOT the right next lever for a 10B-active model — points at B300 NVFP4, FP4 KV cache, MTP instead.

Test plan

  • Vercel preview renders at `/blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar`
  • Hero benchmark, GPU-specs radar, quality-benchmarks, and MoE-kernel-comparison `` blocks render in both themes (same image is in both light + dark slots — drop real dark exports later if desired)
  • Iso-interactivity table values match the live cost view for spot-checked rows (22 / 50 / 80 / 110)
  • OG image generates
  • Sitemap + RSS pick up the new slug
  • FAQ JSON-LD passes a schema validator

🤖 Generated with Claude Code


Note

Low Risk
Content-only MDX addition; no application logic, auth, or data-path changes.

Overview
Adds a new InferenceX blog article at packages/app/content/blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar.mdx comparing B200 vLLM NVFP4 vs H100 vLLM FP8 on MiniMax-M2.5 at 8K/1K (2026-05-22 run), with headline up to ~8.2× better perf/$ at iso-interactivity and a decomposition into ~2.94× generation (FP8) plus ~2.77× precision (NVFP4), tied to vLLM PR #36307 (modular trtllm-gen FP8 MoE kernel for MiniMax routing).

The post includes benchmark and cost tables, iso-interactivity perf/$ analysis, Figure assets (throughput, GPU-specs radar, coding-quality bars, MoE-kernel comparison), DashboardCTA links to the filtered InferenceX views, and FAQPage JSON-LD for SEO.

Reviewed by Cursor Bugbot for commit d4e9e43. Bugbot is set up for automated code reviews on this repo. Configure here.

…r perf/$

New post comparing B200 vLLM NVFP4 vs H100 vLLM FP8 on MiniMax-M2.5 8K/1K
(GHA run 26306422380, measured 2026-05-22). Headline: up to 8.2x better
performance per dollar at 110 tok/s/user, growing monotonically from
4.0x at 22 tok/s/user. Decomposes at the peak into a 2.94x generation
step (Blackwell vs Hopper at FP8) and a 2.77x precision step (B200 FP8
→ B200 NVFP4) unlocked by vLLM PR #36307 (the trtllm-gen FP8 MoE
modular kernel that finally accepts MiniMax's routing-logits dtype).

Sections: lede with decomposition, hero throughput chart, model
architecture + M2.7 transferability note, "Why MiniMax-M2.5 Is Worth
Optimizing For" with quality benchmarks vs Claude Opus 4.5/4.6 / Gemini
3 Pro / GPT-5.2, On-Paper H100 vs B200 specs (radar + table + silicon
ratios that bound the perf/$ ceiling), TRT-LLM MoE Kernel Integration
into vLLM (PR #36307 + per-kernel comparison figure), per-config
tables, iso-interactivity perf/$ table, What's Next (MTP, why NVL72
wide-EP isn't the right lever at 10B active, H100 stack room), FAQ.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 26, 2026 3:37am

Request Review

Wider expert parallelism doesn't compound on a 10B-active / 256-small-expert
model the way it does on DeepSeek R1 or Kimi K2.5, but disaggregated
prefill + decode on NVL72 is still a valid next lever for MiniMax-M2.5 (KV
between pools over NVLink 5, decode pool absorbs more concurrency past the
single-node saturation knee). Drops the speculative FP4 KV cache and
"see MTP bullet" trailers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d4e9e43. Configure here.

@functionstackx functionstackx merged commit 36be82b into master May 26, 2026
20 checks passed
@functionstackx functionstackx deleted the blog/b200-minimax-m2-5-vllm-nvfp4-vs-h100-fp8-perf-per-dollar branch May 26, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant