feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user#381
Conversation
… at 125 tok/s/user GB200 NVL72 Dynamo TRT-LLM + MTP vs B200 disagg multinode Dynamo TRT-LLM + MTP on DeepSeek R1 0528 FP4 1k/1k. Peak iso-interactivity ratio 4.39x at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the spline-interpolated Pareto). Peak throughput per GPU 14,659 vs 12,515 (1.17x) at the left end of the chart where both SKUs run narrow EP=4 + DP attention. Curves cross above 250 tok/s/user where small batches fit on one NVLink island and B200 wins back. The mechanic: NVL72's 72-GPU NVLink-5 scale-up domain runs at 900 GB/s per GPU uni-directional. B200 multinode disagg crosses ConnectX-7 RoCEv2 Ethernet at 400 Gbit/s = 50 GB/s uni-di per GPU, 18x slower. In the 75-175 tok/s/user band the EP dispatch/combine collectives are the bottleneck — fast network lets the runtime overlap them with the matmul; slow network exposes them as raw communication time. Includes the GB200 NVL72 rack diagram so readers can see the 72-GPU / 9-NVSwitch5 layout up front. SKILL.md updates from lessons in this draft: - Three-regime mapping fixed: high interactivity (small batch) = weight-bandwidth-bound, low interactivity (huge batch) = compute + KV-bandwidth-bound, middle = network-bound. Previously had weight-bandwidth-bound labeled on the wrong side of the chart. - Bandwidth-units rule: always state uni-di vs bi-di explicitly, convert Gbit/s to GB/s in the same sentence, NVLink-to-IB/RoCE ratio is 18x not 36x. Flags the prior Kimi K2.5 post for the same 36x error. - Browser-editor collision warning + kill-editor-before-commit step, learned the hard way during this draft when auto-save clobbered an expansion twice. - "Do not put the interpolation algorithm in the body" — no monotone-cubic-Hermite mentions or source-file paths in the published prose, that's slop. - House style: vague positional adjectives like "in the middle of the curve" banned in favor of concrete interactivity ranges. - New "Reusable technical framings" section with the rack-scale- wide-EP-MoE template (the 5-step compute-comm overlap argument). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Sends readers from the lede to the InferenceX GPU specs page so they can see scale-up topology, HBM bandwidth, and TDP for each SKU without leaving the dashboard. Links only the first mention of each SKU to avoid over-linking. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The links sit better next to the NVLink and ConnectX-7 bandwidth numbers — readers who want to dig into the per-GPU topology and fabric specs can click through right where the spec discussion happens, not in the headline number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Title/subtitle conventions — model + headline ratio + interactivity anchor, not a laundry list of framework + precision + parallelism. Frameworks belong in the body. Compares the good vs avoid form for PR #381's title. 2. Image filename convention — descriptive slugs (benchmark-, rack-, topology-, timeline-), never numeric (figure1, figure2). Both light and dark variants required even if dark is a placeholder. 3. Architectural diagrams (rack layouts, topology) go between the architectural prose and the DashboardCTA, near the top. Visual grounds the technical claim before the data tables appear. References the GB200 NVL72 rack diagram placement from this PR. 4. Iso-interactivity row-pruning heuristic — never open the table with an _unreachable_ row. First row must have two real numbers so the reader anchors on a real comparison. _unreachable_ rows are fine in the middle/end. 5. House-style additions: - Cross-links (/gpu-specs, prior posts, recipe docs) go next to the sentence that motivates the click, not in the lede. - Write tight first, expand only on request. Long preemptive explanations get trimmed back by the reviewer (and overwritten by the browser editor's auto-save while you wait). - Don't restate table contents in prose. Use prose around tables for the WHY, not to summarize the WHAT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5d55881. Configure here.
| "name": "How much faster is GB200 NVL72 than B200 on DeepSeek R1 FP4 with Dynamo TRT-LLM and MTP?", | ||
| "acceptedAnswer": { | ||
| "@type": "Answer", | ||
| "text": "On DeepSeek R1 0528 FP4 at 1k/1k with Dynamo TRT-LLM + MTP and disaggregated prefill/decode on both SKUs, GB200 NVL72 delivers up to 4.39x throughput per GPU vs B200 at iso-interactivity, peaking at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the dashboard's monotone-cubic-Hermite Pareto interpolation). At peak throughput (sub-25 tok/s/user) the gap shrinks to 1.15x because both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape and the workload is decode-memory-bandwidth bound. Above 250 tok/s/user the curves cross and B200 wins by about 1.2x because at small batch sizes the workload fits inside an 8-GPU NVLink island and the cross-rack hop is pure overhead. Measured on InferenceX 2026-05-22, run 26306422380." |
There was a problem hiding this comment.
FAQ claims "TP=32" but tables show EP=4
Medium Severity
The FAQ structured data claims "both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape" but neither SKU uses TP=32 in the data tables. The peak-throughput rows show GB200 NVL72 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode, and B200 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode. "TP=32" is a fabricated configuration that doesn't match the benchmark data. This incorrect claim will appear in Google rich snippets.
Reviewed by Cursor Bugbot for commit 5d55881. Configure here.
| "name": "Where does B200 still beat GB200 NVL72 on DeepSeek R1 FP4?", | ||
| "acceptedAnswer": { | ||
| "@type": "Answer", | ||
| "text": "Above roughly 250 tok/s/user. At that interactivity the workload runs at very small batch sizes (4 to 25 concurrent users on a 40-GPU decode pool in the B200 recipe), the per-token decode work is small enough that all-to-all bandwidth isn't the bottleneck, and the workload comfortably fits inside an 8-GPU NVLink island. B200 saves the cross-rack hop and wins by about 1.2x at 275 tok/s/user. NVL72 in this dataset has no recipe that runs below 286 tok/s/user, so above that point only B200 is reachable. The very-low-batch regime is also where rack-scale NVL72's advantage is structurally smallest because there are few tokens in flight for the wide NVLink bandwidth to carry." |
There was a problem hiding this comment.
Directional error: "below" should be "above" for interactivity
Medium Severity
The FAQ says "NVL72 in this dataset has no recipe that runs below 286 tok/s/user" but this is backwards. The GB200 NVL72 table shows the maximum interactivity is 286.40 tok/s/user (at Conc 4) — all other recipes run below that. The correct statement is "no recipe that reaches above 286 tok/s/user," which is consistent with the second half of the same sentence: "so above that point only B200 is reachable."
Reviewed by Cursor Bugbot for commit 5d55881. Configure here.


Summary
New blog post comparing GB200 NVL72 vs B200, both running Dynamo TRT-LLM + MTP and both disaggregated prefill/decode, on DeepSeek R1 0528 FP4 1k/1k.
Headline numbers (InferenceX 2026-05-22, run 26306422380)
Technical framing
The post explains the gap via compute-comm overlap on the EP dispatch/combine collectives. NVLink 5 at 900 GB/s per GPU uni-directional lets the dispatch fit inside the GEMM time budget; ConnectX-7 RoCEv2 Ethernet at 50 GB/s per GPU (400 Gbit/s ÷ 8) is 18x slower and exposes the collective as raw communication time between kernel launches. Includes the GB200 NVL72 rack diagram (72 GPUs / 9 NVSwitch5 trays / 4 33kW power shelves) right after the lede so readers see the scale-up domain visually before the prose.
Skill updates bundled in this PR
Lessons that came out of writing this draft, codified in
.claude/skills/write-inferencex-blog/SKILL.md:Test plan
benchmark-dark.pngandgb200-nvl72-rack-dark.pngare currently copies of the light theme. Drop real dark exports before merging.g_runid=26306422380&i_seq=1k/1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp) lands on the right comparison.iso_interactivity.py.🤖 Generated with Claude Code
Note
Low Risk
Content and documentation-only changes (MDX blog + skill markdown); no runtime, auth, or data-path code.
Overview
Adds a new InferenceX benchmark post comparing GB200 NVL72 vs B200 on DeepSeek R1 0528 FP4 1k/1k, both with Dynamo TRT-LLM + MTP and disaggregated prefill/decode. The lede anchors 4.39x tok/s/GPU at 125 tok/s/user (iso-interactivity), with tables for per-concurrency configs, throughput and $/M iso-interactivity bands, an honest B200 win above ~250 tok/s/user, rack + Pareto
<Figure>blocks, dashboard CTAs, live chart link, and FAQ JsonLd.The bundled
write-inferencex-blogskill update codifies editorial and numeric guardrails from drafting this piece: explicit uni-di vs bi-di bandwidth math (18x NVLink vs RoCE/IB, not 36x), image naming/placement, title/subtitle patterns, no spline-algorithm prose in published posts, browser-editor collision + kill-before-commit, tighter house style, and a new “rack-scale NVL72 wide EP on sparse MoE” reusable framing (three-regime curve mapping).Reviewed by Cursor Bugbot for commit 5d55881. Bugbot is set up for automated code reviews on this repo. Configure here.