Skip to content

Commit cc2330c

Browse files
feat(blog): B200 NVFP4 vs H200 INT4 on Kimi K2.5/K2.6 — up to 2.95x better perf/$
On 8K/1K with vllm/vllm-openai:v0.21.0, B200 NVFP4 is 2.71x-2.95x cheaper per million tokens than H200 INT4 across the 30-90 tok/s/user serving band (peak 2.95x at 32 tok/s/user, .140/M vs .413/M). The cost gap decomposes into B200's silicon ratios over H200 (1.67x HBM BW, 1.28x HBM capacity that unlocks TP=4 vs TP=8, no FP4 tensor cores on Hopper at all) composed with the NVFP4 precision unlock, divided by B200's 1.38x TCO penalty. Kimi K2.5 and K2.6 are the open-weights models powering xAI's Cursor Composer 2 and Composer 2.5, leading SWE-Bench Pro at 58.6% over GPT-5.4 / Opus 4.6 / Gemini 3.1 Pro. Same backbone across both releases — K2.6 is a post-training refinement of K2.5 — so every serving curve applies one-to-one to both. Also adds an X-not-Y antithesis ban to the write-inferencex-blog SKILL house style ("the gap is silicon x precision, not framework" etc.). Reads as performatively contrarian AI flexing and was getting reflexively cut on review; codifying so future drafts don't repeat it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 19ae49e commit cc2330c

8 files changed

Lines changed: 208 additions & 0 deletions

File tree

.claude/skills/write-inferencex-blog/SKILL.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,6 +362,13 @@ After the PR opens, expect Cursor Bugbot to flag correctness issues in the prose
362362
- **Write tight first, expand only on request.** Default to 1-3 short paragraphs per explanation; trust the reader to ask for more detail in review. Long preemptive expansions get trimmed back by the reviewer (and overwritten by the browser editor's auto-save while you wait). The compute-comm-overlap framing template in the "Reusable technical framings" section is the upper bound — don't go longer than that even for the most central technical argument.
363363
- **Don't restate the table contents in prose.** If the reader can see "4,130 vs 941 tok/s/GPU = 4.39x at 125 tok/s/user" in the iso-interactivity row, don't also write it in the closing paragraph after the table. Use the prose around tables to explain the WHY, not to summarize the WHAT. A closing paragraph that just restates the headline number gets removed in editorial review.
364364
- Don't apologize for non-coverage in the lede — save it for "What's Next".
365+
- **Don't use the "X, not Y" antithesis construction for emphasis.** AI writing tics this hard — phrases like "the gap is silicon × precision, **not** framework", "every gain came from the kernels, **not** the silicon", "it's a software story, **not** a hardware one", "this is a real lever, **not** a paper one". Reads as performatively contrarian flexing and is one of the loudest AI-prose tells. State the thing on its own; if the "Y" the reader might have guessed is actually plausible-but-wrong, address it on its merits in a separate sentence (or skip it — usually the table that follows kills the wrong guess on its own).
366+
- Avoid: "The gap is silicon × precision, not framework."
367+
- Use instead: "The gap is silicon × precision." (or, if you really need to neutralize the framework guess: "Both run the same vLLM build; the spread comes from the silicon and the precision.")
368+
- Avoid: "This is a real lever, not a paper one."
369+
- Use instead: just delete the sentence — the data already shows it is real.
370+
- Avoid: "The lift came from the kernels, not the silicon."
371+
- Use instead: "Same hardware on both dates — every gain came from the kernels."
365372

366373
## Reusable technical framings
367374

packages/app/content/blog/b200-nvfp4-vs-h200-int4-kimi-k2-vllm-perf-per-dollar.mdx

Lines changed: 201 additions & 0 deletions
Large diffs are not rendered by default.
434 KB
Loading
434 KB
Loading
271 KB
Loading
271 KB
Loading
392 KB
Loading
392 KB
Loading

0 commit comments

Comments
 (0)