Skip to content

feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user#381

Merged
functionstackx merged 4 commits into
masterfrom
blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt
May 26, 2026
Merged

feat(blog): GB200 NVL72 vs B200 disagg on DeepSeek R1 FP4 — up to 4.4x at 125 tok/s/user#381
functionstackx merged 4 commits into
masterfrom
blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 26, 2026

Summary

New blog post comparing GB200 NVL72 vs B200, both running Dynamo TRT-LLM + MTP and both disaggregated prefill/decode, on DeepSeek R1 0528 FP4 1k/1k.

Headline numbers (InferenceX 2026-05-22, run 26306422380)

  • Peak iso-interactivity gap: 4.39x at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the spline-interpolated Pareto)
  • Peak throughput per GPU: 1.17x (14,659 vs 12,515) at the left end where both SKUs converge on narrow EP=4 + DP attention
  • Honest inversion: B200 wins above ~250 tok/s/user by ~1.2x — small batches fit on one NVLink island and the cross-rack hop becomes pure overhead

Technical framing

The post explains the gap via compute-comm overlap on the EP dispatch/combine collectives. NVLink 5 at 900 GB/s per GPU uni-directional lets the dispatch fit inside the GEMM time budget; ConnectX-7 RoCEv2 Ethernet at 50 GB/s per GPU (400 Gbit/s ÷ 8) is 18x slower and exposes the collective as raw communication time between kernel launches. Includes the GB200 NVL72 rack diagram (72 GPUs / 9 NVSwitch5 trays / 4 33kW power shelves) right after the lede so readers see the scale-up domain visually before the prose.

Skill updates bundled in this PR

Lessons that came out of writing this draft, codified in .claude/skills/write-inferencex-blog/SKILL.md:

  • Three-regime mapping fixed. High interactivity (small batch) is weight-bandwidth-bound; low interactivity (huge batch) is compute + KV-bandwidth-bound; middle is network-bound. Previously had weight-bandwidth-bound labeled on the wrong side of the chart.
  • Bandwidth-units rule. Always state uni-di vs bi-di explicitly, convert Gbit/s → GB/s inline, NVLink-to-IB/RoCE ratio is 18x not 36x. Flags the prior Kimi K2.5 post for the same 36x error that should be corrected separately.
  • Browser-editor collision warning + kill-editor-before-commit step. The auto-save race overwrote my expansion paragraph twice during this draft.
  • "Do not put the interpolation algorithm in the body" — monotone-cubic-Hermite mentions and source-file paths in published prose are slop.
  • House style. Vague positional adjectives ("middle of the curve") banned in favor of concrete interactivity ranges.
  • New "Reusable technical framings" section with the rack-scale-wide-EP-MoE template (the 5-step compute-comm overlap argument).

Test plan

  • Replace dark-theme imagesbenchmark-dark.png and gb200-nvl72-rack-dark.png are currently copies of the light theme. Drop real dark exports before merging.
  • Click-through Vercel preview when it builds — verify both Figures render, rack diagram displays at the right size, tables aren't broken.
  • Verify the dashboard preset URL (g_runid=26306422380&i_seq=1k/1k&i_active=b200_dynamo-trt_mtp%2Cgb200_dynamo-trt_mtp) lands on the right comparison.
  • Sanity-check 1–2 cells of the iso-interactivity table by piping rows through iso_interactivity.py.

🤖 Generated with Claude Code


Note

Low Risk
Content and documentation-only changes (MDX blog + skill markdown); no runtime, auth, or data-path code.

Overview
Adds a new InferenceX benchmark post comparing GB200 NVL72 vs B200 on DeepSeek R1 0528 FP4 1k/1k, both with Dynamo TRT-LLM + MTP and disaggregated prefill/decode. The lede anchors 4.39x tok/s/GPU at 125 tok/s/user (iso-interactivity), with tables for per-concurrency configs, throughput and $/M iso-interactivity bands, an honest B200 win above ~250 tok/s/user, rack + Pareto <Figure> blocks, dashboard CTAs, live chart link, and FAQ JsonLd.

The bundled write-inferencex-blog skill update codifies editorial and numeric guardrails from drafting this piece: explicit uni-di vs bi-di bandwidth math (18x NVLink vs RoCE/IB, not 36x), image naming/placement, title/subtitle patterns, no spline-algorithm prose in published posts, browser-editor collision + kill-before-commit, tighter house style, and a new “rack-scale NVL72 wide EP on sparse MoE” reusable framing (three-regime curve mapping).

Reviewed by Cursor Bugbot for commit 5d55881. Bugbot is set up for automated code reviews on this repo. Configure here.

… at 125 tok/s/user

GB200 NVL72 Dynamo TRT-LLM + MTP vs B200 disagg multinode Dynamo
TRT-LLM + MTP on DeepSeek R1 0528 FP4 1k/1k. Peak iso-interactivity
ratio 4.39x at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the
spline-interpolated Pareto). Peak throughput per GPU 14,659 vs
12,515 (1.17x) at the left end of the chart where both SKUs run
narrow EP=4 + DP attention. Curves cross above 250 tok/s/user where
small batches fit on one NVLink island and B200 wins back.

The mechanic: NVL72's 72-GPU NVLink-5 scale-up domain runs at 900
GB/s per GPU uni-directional. B200 multinode disagg crosses
ConnectX-7 RoCEv2 Ethernet at 400 Gbit/s = 50 GB/s uni-di per GPU,
18x slower. In the 75-175 tok/s/user band the EP dispatch/combine
collectives are the bottleneck — fast network lets the runtime
overlap them with the matmul; slow network exposes them as raw
communication time. Includes the GB200 NVL72 rack diagram so
readers can see the 72-GPU / 9-NVSwitch5 layout up front.

SKILL.md updates from lessons in this draft:

- Three-regime mapping fixed: high interactivity (small batch) =
  weight-bandwidth-bound, low interactivity (huge batch) =
  compute + KV-bandwidth-bound, middle = network-bound. Previously
  had weight-bandwidth-bound labeled on the wrong side of the chart.
- Bandwidth-units rule: always state uni-di vs bi-di explicitly,
  convert Gbit/s to GB/s in the same sentence, NVLink-to-IB/RoCE
  ratio is 18x not 36x. Flags the prior Kimi K2.5 post for the
  same 36x error.
- Browser-editor collision warning + kill-editor-before-commit
  step, learned the hard way during this draft when auto-save
  clobbered an expansion twice.
- "Do not put the interpolation algorithm in the body" — no
  monotone-cubic-Hermite mentions or source-file paths in the
  published prose, that's slop.
- House style: vague positional adjectives like "in the middle of
  the curve" banned in favor of concrete interactivity ranges.
- New "Reusable technical framings" section with the rack-scale-
  wide-EP-MoE template (the 5-step compute-comm overlap argument).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 26, 2026 12:56am

Request Review

Sends readers from the lede to the InferenceX GPU specs page so they
can see scale-up topology, HBM bandwidth, and TDP for each SKU
without leaving the dashboard. Links only the first mention of each
SKU to avoid over-linking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The links sit better next to the NVLink and ConnectX-7 bandwidth
numbers — readers who want to dig into the per-GPU topology and
fabric specs can click through right where the spec discussion
happens, not in the headline number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Title/subtitle conventions — model + headline ratio + interactivity
   anchor, not a laundry list of framework + precision + parallelism.
   Frameworks belong in the body. Compares the good vs avoid form for
   PR #381's title.

2. Image filename convention — descriptive slugs (benchmark-, rack-,
   topology-, timeline-), never numeric (figure1, figure2). Both
   light and dark variants required even if dark is a placeholder.

3. Architectural diagrams (rack layouts, topology) go between the
   architectural prose and the DashboardCTA, near the top. Visual
   grounds the technical claim before the data tables appear.
   References the GB200 NVL72 rack diagram placement from this PR.

4. Iso-interactivity row-pruning heuristic — never open the table
   with an _unreachable_ row. First row must have two real numbers
   so the reader anchors on a real comparison. _unreachable_ rows
   are fine in the middle/end.

5. House-style additions:
   - Cross-links (/gpu-specs, prior posts, recipe docs) go next to
     the sentence that motivates the click, not in the lede.
   - Write tight first, expand only on request. Long preemptive
     explanations get trimmed back by the reviewer (and overwritten
     by the browser editor's auto-save while you wait).
   - Don't restate table contents in prose. Use prose around tables
     for the WHY, not to summarize the WHAT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx merged commit d6643d9 into master May 26, 2026
16 of 17 checks passed
@functionstackx functionstackx deleted the blog/gb200-nvl72-vs-b200-disagg-deepseek-r1-fp4-dynamo-trt branch May 26, 2026 00:55
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5d55881. Configure here.

"name": "How much faster is GB200 NVL72 than B200 on DeepSeek R1 FP4 with Dynamo TRT-LLM and MTP?",
"acceptedAnswer": {
"@type": "Answer",
"text": "On DeepSeek R1 0528 FP4 at 1k/1k with Dynamo TRT-LLM + MTP and disaggregated prefill/decode on both SKUs, GB200 NVL72 delivers up to 4.39x throughput per GPU vs B200 at iso-interactivity, peaking at 125 tok/s/user (4,130 vs 941 tok/s/GPU on the dashboard's monotone-cubic-Hermite Pareto interpolation). At peak throughput (sub-25 tok/s/user) the gap shrinks to 1.15x because both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape and the workload is decode-memory-bandwidth bound. Above 250 tok/s/user the curves cross and B200 wins by about 1.2x because at small batch sizes the workload fits inside an 8-GPU NVLink island and the cross-rack hop is pure overhead. Measured on InferenceX 2026-05-22, run 26306422380."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FAQ claims "TP=32" but tables show EP=4

Medium Severity

The FAQ structured data claims "both SKUs run narrow EP=4 with DP attention on the same TP=32 disagg shape" but neither SKU uses TP=32 in the data tables. The peak-throughput rows show GB200 NVL72 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode, and B200 at "12 GPU, TP=12" prefill / "20 GPU, EP=4" decode. "TP=32" is a fabricated configuration that doesn't match the benchmark data. This incorrect claim will appear in Google rich snippets.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5d55881. Configure here.

"name": "Where does B200 still beat GB200 NVL72 on DeepSeek R1 FP4?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Above roughly 250 tok/s/user. At that interactivity the workload runs at very small batch sizes (4 to 25 concurrent users on a 40-GPU decode pool in the B200 recipe), the per-token decode work is small enough that all-to-all bandwidth isn't the bottleneck, and the workload comfortably fits inside an 8-GPU NVLink island. B200 saves the cross-rack hop and wins by about 1.2x at 275 tok/s/user. NVL72 in this dataset has no recipe that runs below 286 tok/s/user, so above that point only B200 is reachable. The very-low-batch regime is also where rack-scale NVL72's advantage is structurally smallest because there are few tokens in flight for the wide NVLink bandwidth to carry."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directional error: "below" should be "above" for interactivity

Medium Severity

The FAQ says "NVL72 in this dataset has no recipe that runs below 286 tok/s/user" but this is backwards. The GB200 NVL72 table shows the maximum interactivity is 286.40 tok/s/user (at Conc 4) — all other recipes run below that. The correct statement is "no recipe that reaches above 286 tok/s/user," which is consistent with the second half of the same sentence: "so above that point only B200 is reachable."

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5d55881. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant