Skip to content

feat(blog): add MI355X vs B200 GLM-5 FP8 SGLang post#378

Merged
functionstackx merged 10 commits into
masterfrom
glm5-mi355-vs-b200
May 25, 2026
Merged

feat(blog): add MI355X vs B200 GLM-5 FP8 SGLang post#378
functionstackx merged 10 commits into
masterfrom
glm5-mi355-vs-b200

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 25, 2026

Summary

  • New blog post: AMD MI355X SGLang FP8 undercuts NVIDIA B200 SGLang FP8 per million tokens on GLM-5, 14 weeks after the model's 2026-02-11 release
  • Peak gap: 1.41x at 18 tok/s/user with MTP (40% cheaper) and 1.36x at 10 tok/s/user without MTP on the 8k/1k workload, single-node
  • Walks through sgl-project/sglang#21511 (HaiShaw): FP8 KV cache + FP8 attention via TileLang, reusing fused_qk_rope_cat_and_cache_mla for both Q and KV quant on MI355
  • Covers GLM-5 architecture (744B/40B active, 256 experts top-8, glm_moe_dsa, DSA + MLA, 200K ctx)
  • All tables sourced from InferenceX 2026-05-20 run (g_runid=26187777287); chart preset linked from both DashboardCTA blocks

Test plan

  • Visual check on local dev server (pnpm dev/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200)
  • Verify chart preset link resolves to the correct GLM-5 FP8 view with i_metric=y_costh and the four series active
  • Confirm OG image renders and RSS feed picks up the post
  • Sanity-check the GLM-5 parameter counts (744B/40B, 256 experts) against the official ZAI announcement
  • Confirm the 2026-05-20 chart numbers in the iso-interactivity table still match once a newer dump publishes

🤖 Generated with Claude Code


Note

Low Risk
Content-only additions (documentation skill and static MDX); no application logic, auth, or data pipeline changes.

Overview
Adds a Claude skill (.claude/skills/write-inferencex-blog/SKILL.md) that documents how to draft InferenceX benchmark posts—source-of-truth priority (CSV vs chart), TCO/cost formulas, slug/frontmatter, MDX sections (DashboardCTA, Figure, FAQ JsonLd), and commit/PR workflow—and points at this post as the AMD-vs-NVIDIA single-node cost template.

Publishes a new MDX article at packages/app/content/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx claiming MI355X SGLang FP8 on GLM-5 8k/1k is up to 40% cheaper per million tokens than B200 (peak 1.41x with MTP at 18 tok/s/user), tied to SGLang PR #21511 and InferenceX PR #1440, with per-concurrency tables, iso-interactivity comparisons (including where B200 wins above ~90 tok/s/user), preset dashboard links, and five FAQ JSON-LD entries.

Reviewed by Cursor Bugbot for commit c2f98a5. Bugbot is set up for automated code reviews on this repo. Configure here.

14 weeks after GLM-5's release, MI355X SGLang FP8 undercuts B200 SGLang
FP8 per million tokens across the single-node Pareto on 8k/1k — peak
1.41x with MTP at 18 tok/s/user, 1.36x non-MTP at 10 tok/s/user.
Walks through SGLang PR #21511 (HaiShaw) fusing QK rope cat + MLA cache
+ FP8 quant on MI355 via TileLang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 25, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 25, 2026 11:22pm

Request Review

Comment thread packages/app/content/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx Outdated
Removes the redundant kernel-fusion recap (already covered in the
"What Shipped to Make This Happen" section) and lifts the MI355X
capability sentence into its own paragraph for clearer pacing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kill

Removes the stray blank line between the MTP iso-interactivity table
header and its data rows that was preventing markdown from parsing
them as a table (rendering all rows as a single pipe-delimited
paragraph instead).

Also adds .claude/skills/write-inferencex-blog/SKILL.md, codifying
the structure, numeric-verification workflow, frontmatter, MDX
components, dashboard-link conventions, and FAQ JSON-LD pattern that
this PR's post follows — so future InferenceX blog posts can be
authored against a consistent template.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/app/content/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx Outdated
Comment thread packages/app/content/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx Outdated
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B200 ran on lmsysorg/sglang:v0.5.12-cu130; MI355X ran on
lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx and others added 3 commits May 25, 2026 19:16
There is no MI355X GLM-5 disagg or wide-EP recipe yet. Updates both
the What's Next bullet and the matching FAQ answer to state the gap
directly rather than implying a recipe exists but underperforms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llout

Replaces "playbook exists" framing with the direct statement that AMD
has still not shipped disagg for GLM-5. Applied to both the bullet
and the matching FAQ answer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Data run date (2026-05-20) stays as-is in the body since that's when
the InferenceX measurement happened.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e8a9524. Configure here.

@functionstackx functionstackx enabled auto-merge (squash) May 25, 2026 23:19
1. Soften "across the entire Pareto" claim in lede and subtitle to
   "across most of the Pareto" with the ~10-77 tok/s/user range
   called out explicitly. The MTP table already shows B200 noses
   ahead above ~90 tok/s/user.

2. Correct "TP=4 dominates across the whole range" in the
   iso-interactivity intro — TP=4 dominates up to ~77 tok/s/user;
   TP=8 conc 4 takes over at ~90 tok/s/user where TP=4 can't reach.

3. Fix FAQ overstatement: MTP "roughly doubles" -> "lifts ~1.34x" on
   the cited concurrency 32 data point (1,274 -> 1,707 tok/s/GPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx merged commit 09dc863 into master May 25, 2026
14 of 15 checks passed
@functionstackx functionstackx deleted the glm5-mi355-vs-b200 branch May 25, 2026 23:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant