feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user by functionstackx · Pull Request #381 · SemiAnalysisAI/InferenceX-app

functionstackx · 2026-05-26T00:46:10Z

Summary

New blog post comparing GB200 NVL72 vs B200, both running Dynamo TRT-LLM + MTP and both disaggregated prefill/decode, on DeepSeek R1 0528 FP4 1k/1k.

Headline numbers (InferenceX 2026-05-22, run 26306422380)

Peak iso-interactivity gap: 4.39x at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the spline-interpolated Pareto)
Peak throughput per GPU: 1.17x (14,659 vs 12,515) at the left end where both SKUs converge on narrow EP=4 + DP attention
Honest inversion: B200 wins above ~250 tok/s/user by ~1.2x — small batches fit on one NVLink island and the cross-rack hop becomes pure overhead

Technical framing

The post explains the gap via compute-comm overlap on the EP dispatch/combine collectives. NVLink 5 at 900 GB/s per GPU uni-directional lets the dispatch fit inside the GEMM time budget; ConnectX-7 RoCEv2 Ethernet at 50 GB/s per GPU (400 Gbit/s ÷ 8) is 18x slower and exposes the collective as raw communication time between kernel launches. Includes the GB200 NVL72 rack diagram (72 GPUs / 9 NVSwitch5 trays / 4 33kW power shelves) right after the lede so readers see the scale-up domain visually before the prose.

Skill updates bundled in this PR

Lessons that came out of writing this draft, codified in .claude/skills/write-inferencex-blog/SKILL.md:

Three-regime mapping fixed. High interactivity (small batch) is weight-bandwidth-bound; low interactivity (huge batch) is compute + KV-bandwidth-bound; middle is network-bound. Previously had weight-bandwidth-bound labeled on the wrong side of the chart.
Bandwidth-units rule. Always state uni-di vs bi-di explicitly, convert Gbit/s → GB/s inline, NVLink-to-IB/RoCE ratio is 18x not 36x. Flags the prior Kimi K2.5 post for the same 36x error that should be corrected separately.
Browser-editor collision warning + kill-editor-before-commit step. The auto-save race overwrote my expansion paragraph twice during this draft.
"Do not put the interpolation algorithm in the body" — monotone-cubic-Hermite mentions and source-file paths in published prose are slop.
House style. Vague positional adjectives ("middle of the curve") banned in favor of concrete interactivity ranges.
New "Reusable technical framings" section with the rack-scale-wide-EP-MoE template (the 5-step compute-comm overlap argument).

Test plan

Replace dark-theme images — benchmark-dark.png and gb200-nvl72-rack-dark.png are currently copies of the light theme. Drop real dark exports before merging.
Click-through Vercel preview when it builds — verify both Figures render, rack diagram displays at the right size, tables aren't broken.
Verify the dashboard preset URL (g_runid=26306422380&i_seq=1k/1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp) lands on the right comparison.
Sanity-check 1–2 cells of the iso-interactivity table by piping rows through iso_interactivity.py.

🤖 Generated with Claude Code

Note

Low Risk
Content and documentation-only changes (MDX blog + skill markdown); no runtime, auth, or data-path code.

Overview
Adds a new InferenceX benchmark post comparing GB200 NVL72 vs B200 on DeepSeek R1 0528 FP4 1k/1k, both with Dynamo TRT-LLM + MTP and disaggregated prefill/decode. The lede anchors 4.39x tok/s/GPU at 125 tok/s/user (iso-interactivity), with tables for per-concurrency configs, throughput and $/M iso-interactivity bands, an honest B200 win above ~250 tok/s/user, rack + Pareto <Figure> blocks, dashboard CTAs, live chart link, and FAQ JsonLd.

The bundled write-inferencex-blog skill update codifies editorial and numeric guardrails from drafting this piece: explicit uni-di vs bi-di bandwidth math (18x NVLink vs RoCE/IB, not 36x), image naming/placement, title/subtitle patterns, no spline-algorithm prose in published posts, browser-editor collision + kill-before-commit, tighter house style, and a new “rack-scale NVL72 wide EP on sparse MoE” reusable framing (three-regime curve mapping).

^{Reviewed by Cursor Bugbot for commit 5d55881. Bugbot is set up for automated code reviews on this repo. Configure here.}

… at 125 tok/s/user GB200 NVL72 Dynamo TRT-LLM + MTP vs B200 disagg multinode Dynamo TRT-LLM + MTP on DeepSeek R1 0528 FP4 1k/1k. Peak iso-interactivity ratio 4.39x at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the spline-interpolated Pareto). Peak throughput per GPU 14,659 vs 12,515 (1.17x) at the left end of the chart where both SKUs run narrow EP=4 + DP attention. Curves cross above 250 tok/s/user where small batches fit on one NVLink island and B200 wins back. The mechanic: NVL72's 72-GPU NVLink-5 scale-up domain runs at 900 GB/s per GPU uni-directional. B200 multinode disagg crosses ConnectX-7 RoCEv2 Ethernet at 400 Gbit/s = 50 GB/s uni-di per GPU, 18x slower. In the 75-175 tok/s/user band the EP dispatch/combine collectives are the bottleneck — fast network lets the runtime overlap them with the matmul; slow network exposes them as raw communication time. Includes the GB200 NVL72 rack diagram so readers can see the 72-GPU / 9-NVSwitch5 layout up front. SKILL.md updates from lessons in this draft: - Three-regime mapping fixed: high interactivity (small batch) = weight-bandwidth-bound, low interactivity (huge batch) = compute + KV-bandwidth-bound, middle = network-bound. Previously had weight-bandwidth-bound labeled on the wrong side of the chart. - Bandwidth-units rule: always state uni-di vs bi-di explicitly, convert Gbit/s to GB/s in the same sentence, NVLink-to-IB/RoCE ratio is 18x not 36x. Flags the prior Kimi K2.5 post for the same 36x error. - Browser-editor collision warning + kill-editor-before-commit step, learned the hard way during this draft when auto-save clobbered an expansion twice. - "Do not put the interpolation algorithm in the body" — no monotone-cubic-Hermite mentions or source-file paths in the published prose, that's slop. - House style: vague positional adjectives like "in the middle of the curve" banned in favor of concrete interactivity ranges. - New "Reusable technical framings" section with the rack-scale- wide-EP-MoE template (the 5-step compute-comm overlap argument). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-26T00:46:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
inferencemax-app	Ready	Preview, Comment	May 26, 2026 12:56am

Sends readers from the lede to the InferenceX GPU specs page so they can see scale-up topology, HBM bandwidth, and TDP for each SKU without leaving the dashboard. Links only the first mention of each SKU to avoid over-linking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The links sit better next to the NVLink and ConnectX-7 bandwidth numbers — readers who want to dig into the per-GPU topology and fabric specs can click through right where the spec discussion happens, not in the headline number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. Title/subtitle conventions — model + headline ratio + interactivity anchor, not a laundry list of framework + precision + parallelism. Frameworks belong in the body. Compares the good vs avoid form for PR #381's title. 2. Image filename convention — descriptive slugs (benchmark-, rack-, topology-, timeline-), never numeric (figure1, figure2). Both light and dark variants required even if dark is a placeholder. 3. Architectural diagrams (rack layouts, topology) go between the architectural prose and the DashboardCTA, near the top. Visual grounds the technical claim before the data tables appear. References the GB200 NVL72 rack diagram placement from this PR. 4. Iso-interactivity row-pruning heuristic — never open the table with an _unreachable_ row. First row must have two real numbers so the reader anchors on a real comparison. _unreachable_ rows are fine in the middle/end. 5. House-style additions: - Cross-links (/gpu-specs, prior posts, recipe docs) go next to the sentence that motivates the click, not in the lede. - Write tight first, expand only on request. Long preemptive explanations get trimmed back by the reviewer (and overwritten by the browser editor's auto-save while you wait). - Don't restate table contents in prose. Use prose around tables for the WHY, not to summarize the WHAT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5d55881. Configure here.}

cursor · 2026-05-26T01:03:37Z

+      "name": "How much faster is GB200 NVL72 than B200 on DeepSeek R1 FP4 with Dynamo TRT-LLM and MTP?",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "On DeepSeek R1 0528 FP4 at 1k/1k with Dynamo TRT-LLM + MTP and disaggregated prefill/decode on both SKUs, GB200 NVL72 delivers up to 4.39x throughput per GPU vs B200 at iso-interactivity, peaking at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the dashboard's monotone-cubic-Hermite Pareto interpolation). At peak throughput (sub-25 tok/s/user) the gap shrinks to 1.15x because both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape and the workload is decode-memory-bandwidth bound. Above 250 tok/s/user the curves cross and B200 wins by about 1.2x because at small batch sizes the workload fits inside an 8-GPU NVLink island and the cross-rack hop is pure overhead. Measured on InferenceX 2026-05-22, run 26306422380."


FAQ claims "TP=32" but tables show EP=4

Medium Severity

The FAQ structured data claims "both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape" but neither SKU uses TP=32 in the data tables. The peak-throughput rows show GB200 NVL72 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode, and B200 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode. "TP=32" is a fabricated configuration that doesn't match the benchmark data. This incorrect claim will appear in Google rich snippets.

^{Reviewed by Cursor Bugbot for commit 5d55881. Configure here.}

cursor · 2026-05-26T01:03:37Z

+      "name": "Where does B200 still beat GB200 NVL72 on DeepSeek R1 FP4?",
+      "acceptedAnswer": {
+        "@type": "Answer",
+        "text": "Above roughly 250 tok/s/user. At that interactivity the workload runs at very small batch sizes (4 to 25 concurrent users on a 40-GPU decode pool in the B200 recipe), the per-token decode work is small enough that all-to-all bandwidth isn't the bottleneck, and the workload comfortably fits inside an 8-GPU NVLink island. B200 saves the cross-rack hop and wins by about 1.2x at 275 tok/s/user. NVL72 in this dataset has no recipe that runs below 286 tok/s/user, so above that point only B200 is reachable. The very-low-batch regime is also where rack-scale NVL72's advantage is structurally smallest because there are few tokens in flight for the wide NVLink bandwidth to carry."


Directional error: "below" should be "above" for interactivity

Medium Severity

The FAQ says "NVL72 in this dataset has no recipe that runs below 286 tok/s/user" but this is backwards. The GB200 NVL72 table shows the maximum interactivity is 286.40 tok/s/user (at Conc 4) — all other recipes run below that. The correct statement is "no recipe that reaches above 286 tok/s/user," which is consistent with the second half of the same sentence: "so above that point only B200 is reachable."

^{Reviewed by Cursor Bugbot for commit 5d55881. Configure here.}

vercel Bot deployed to Preview May 26, 2026 00:46 View deployment

vercel Bot deployed to Preview May 26, 2026 00:51 View deployment

vercel Bot deployed to Preview May 26, 2026 00:52 View deployment

functionstackx merged commit d6643d9 into master May 26, 2026
16 of 17 checks passed

functionstackx deleted the blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt branch May 26, 2026 00:55

vercel Bot deployed to Preview May 26, 2026 00:56 View deployment

cursor Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user#381

feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user#381
functionstackx merged 4 commits into
masterfrom
blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 26, 2026

Uh oh!

cursor Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline numbers (InferenceX 2026-05-22, run 26306422380)

Technical framing

Skill updates bundled in this PR

Test plan

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

FAQ claims "TP=32" but tables show EP=4

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

Directional error: "below" should be "above" for interactivity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

vercel Bot commented May 26, 2026 •

edited

Loading