Skip to content

Commit 147f063

Browse files
TimDettmersclaude
andcommitted
Add GLM-4.7 NVFP4 vs BF16 MoE benchmark results on B200
NVFP4 pipeline wins 1.07x–2.17x across all GLM-4.7 shapes. Down projection (K=13696, N=4096) sees up to 2.17x and 322 TFLOPS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 874dc65 commit 147f063

File tree

2 files changed

+100
-14
lines changed

2 files changed

+100
-14
lines changed

bench_moe_pipeline.py

Lines changed: 61 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -99,34 +99,81 @@ def main():
9999
print(f"Compute capability: {torch.cuda.get_device_capability(0)}")
100100
print()
101101

102+
# GLM-4.7 (352B MoE) shapes from benchmarks/bench_moe_gemm_sm100.py
103+
# gate_up: K=4096, N=13696
104+
# down: K=13696, N=4096
102105
configs = [
106+
# --- GLM-4.7 gate_up (K=4096, N=13696) ---
103107
{
104-
"name": "Small (Mixtral-like, few tokens)",
108+
"name": "GLM4.7 gate_up 8e×8tok",
105109
"num_experts": 8,
106110
"K": 4096,
107-
"N": 14336,
108-
"tokens_per_expert": [4, 8, 2, 6, 4, 8, 2, 6],
111+
"N": 13696,
112+
"tokens_per_expert": [8] * 8,
109113
},
110114
{
111-
"name": "Medium (Mixtral-like, moderate tokens)",
115+
"name": "GLM4.7 gate_up 8e×32tok",
112116
"num_experts": 8,
113117
"K": 4096,
114-
"N": 14336,
115-
"tokens_per_expert": [32, 48, 16, 64, 24, 40, 56, 8],
118+
"N": 13696,
119+
"tokens_per_expert": [32] * 8,
116120
},
117121
{
118-
"name": "Large (Mixtral-like, many tokens)",
122+
"name": "GLM4.7 gate_up 8e×64tok",
119123
"num_experts": 8,
120124
"K": 4096,
121-
"N": 14336,
122-
"tokens_per_expert": [128, 128, 128, 128, 128, 128, 128, 128],
125+
"N": 13696,
126+
"tokens_per_expert": [64] * 8,
123127
},
124128
{
125-
"name": "DeepSeek-like (more experts, smaller)",
126-
"num_experts": 16,
127-
"K": 2048,
128-
"N": 5632,
129-
"tokens_per_expert": [8] * 16,
129+
"name": "GLM4.7 gate_up 8e×128tok",
130+
"num_experts": 8,
131+
"K": 4096,
132+
"N": 13696,
133+
"tokens_per_expert": [128] * 8,
134+
},
135+
{
136+
"name": "GLM4.7 gate_up 8e skewed",
137+
"num_experts": 8,
138+
"K": 4096,
139+
"N": 13696,
140+
"tokens_per_expert": [128, 64, 32, 16, 8, 4, 2, 1],
141+
},
142+
# --- GLM-4.7 down (K=13696, N=4096) ---
143+
{
144+
"name": "GLM4.7 down 8e×8tok",
145+
"num_experts": 8,
146+
"K": 13696,
147+
"N": 4096,
148+
"tokens_per_expert": [8] * 8,
149+
},
150+
{
151+
"name": "GLM4.7 down 8e×32tok",
152+
"num_experts": 8,
153+
"K": 13696,
154+
"N": 4096,
155+
"tokens_per_expert": [32] * 8,
156+
},
157+
{
158+
"name": "GLM4.7 down 8e×64tok",
159+
"num_experts": 8,
160+
"K": 13696,
161+
"N": 4096,
162+
"tokens_per_expert": [64] * 8,
163+
},
164+
{
165+
"name": "GLM4.7 down 8e×128tok",
166+
"num_experts": 8,
167+
"K": 13696,
168+
"N": 4096,
169+
"tokens_per_expert": [128] * 8,
170+
},
171+
{
172+
"name": "GLM4.7 down 8e skewed",
173+
"num_experts": 8,
174+
"K": 13696,
175+
"N": 4096,
176+
"tokens_per_expert": [128, 64, 32, 16, 8, 4, 2, 1],
130177
},
131178
]
132179

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# NVFP4 MoE Pipeline vs BF16 Benchmark — B200
2+
3+
**GPU**: NVIDIA B200 (SM_100, compute capability 10.0)
4+
**Date**: 2026-03-09
5+
**Benchmark**: `bench_moe_pipeline.py` (100 iterations, 20 warmup)
6+
7+
## GLM-4.7 (352B MoE) Shapes
8+
9+
### gate_up (K=4096, N=13696)
10+
11+
| Config | BF16 (ms) | NVFP4 (ms) | Speedup | BF16 TFLOPS | NVFP4 TFLOPS |
12+
|---|---|---|---|---|---|
13+
| 8e × 8 tokens (64 total) | 0.501 | 0.267 | **1.87x** | 14.33 | 26.85 |
14+
| 8e × 32 tokens (256 total) | 0.533 | 0.321 | **1.66x** | 53.90 | 89.58 |
15+
| 8e × 64 tokens (512 total) | 0.555 | 0.383 | **1.45x** | 103.52 | 150.17 |
16+
| 8e × 128 tokens (1024 total) | 0.597 | 0.514 | **1.16x** | 192.55 | 223.57 |
17+
| 8e skewed (255 total) | 0.538 | 0.506 | **1.07x** | 53.14 | 56.60 |
18+
19+
### down (K=13696, N=4096)
20+
21+
| Config | BF16 (ms) | NVFP4 (ms) | Speedup | BF16 TFLOPS | NVFP4 TFLOPS |
22+
|---|---|---|---|---|---|
23+
| 8e × 8 tokens (64 total) | 0.546 | 0.254 | **2.15x** | 13.16 | 28.29 |
24+
| 8e × 32 tokens (256 total) | 0.588 | 0.271 | **2.17x** | 48.85 | 106.01 |
25+
| 8e × 64 tokens (512 total) | 0.578 | 0.296 | **1.95x** | 99.39 | 194.12 |
26+
| 8e × 128 tokens (1024 total) | 0.599 | 0.356 | **1.68x** | 191.81 | 322.79 |
27+
| 8e skewed (255 total) | 0.562 | 0.308 | **1.83x** | 50.87 | 93.01 |
28+
29+
## Summary
30+
31+
- NVFP4 pipeline wins every configuration (1.07x–2.17x over BF16)
32+
- Down projection benefits most (large K, small N → memory-bandwidth-bound → FP4's 2x smaller footprint helps)
33+
- Few tokens per expert shows largest speedup (pipeline overhead elimination dominates)
34+
- Peak throughput: 322.8 TFLOPS (down proj, 128 tok/expert)
35+
36+
## Method
37+
38+
- **BF16 baseline**: Per-expert `torch.matmul` in a Python loop (represents the standard MoE dispatch pattern)
39+
- **NVFP4 pipeline**: 6-kernel fused pipeline (abs_max → quantize_raw → scatter → scale_swizzle → batched_GEMM → gather), zero host-GPU sync in compute path

0 commit comments

Comments
 (0)