Add GLM-4.7 NVFP4 vs BF16 MoE benchmark results on B200

TimDettmers · claude · TimDettmers · commit 147f06341861 · 2026-03-09T16:28:00.000-04:00
NVFP4 pipeline wins 1.07x–2.17x across all GLM-4.7 shapes.
Down projection (K=13696, N=4096) sees up to 2.17x and 322 TFLOPS.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/bench_moe_pipeline.py b/bench_moe_pipeline.py
@@ -99,34 +99,81 @@ def main():
     print(f"Compute capability: {torch.cuda.get_device_capability(0)}")
     print()
 
+    # GLM-4.7 (352B MoE) shapes from benchmarks/bench_moe_gemm_sm100.py
+    # gate_up: K=4096, N=13696
+    # down:    K=13696, N=4096
     configs = [
+        # --- GLM-4.7 gate_up (K=4096, N=13696) ---
         {
-            "name": "Small (Mixtral-like, few tokens)",
+            "name": "GLM4.7 gate_up 8e×8tok",
             "num_experts": 8,
             "K": 4096,
-            "N": 14336,
-            "tokens_per_expert": [4, 8, 2, 6, 4, 8, 2, 6],
+            "N": 13696,
+            "tokens_per_expert": [8] * 8,
         },
         {
-            "name": "Medium (Mixtral-like, moderate tokens)",
+            "name": "GLM4.7 gate_up 8e×32tok",
             "num_experts": 8,
             "K": 4096,
-            "N": 14336,
-            "tokens_per_expert": [32, 48, 16, 64, 24, 40, 56, 8],
+            "N": 13696,
+            "tokens_per_expert": [32] * 8,
         },
         {
-            "name": "Large (Mixtral-like, many tokens)",
+            "name": "GLM4.7 gate_up 8e×64tok",
             "num_experts": 8,
             "K": 4096,
-            "N": 14336,
-            "tokens_per_expert": [128, 128, 128, 128, 128, 128, 128, 128],
+            "N": 13696,
+            "tokens_per_expert": [64] * 8,
         },
         {
-            "name": "DeepSeek-like (more experts, smaller)",
-            "num_experts": 16,
-            "K": 2048,
-            "N": 5632,
-            "tokens_per_expert": [8] * 16,
+            "name": "GLM4.7 gate_up 8e×128tok",
+            "num_experts": 8,
+            "K": 4096,
+            "N": 13696,
+            "tokens_per_expert": [128] * 8,
+        },
+        {
+            "name": "GLM4.7 gate_up 8e skewed",
+            "num_experts": 8,
+            "K": 4096,
+            "N": 13696,
+            "tokens_per_expert": [128, 64, 32, 16, 8, 4, 2, 1],
+        },
+        # --- GLM-4.7 down (K=13696, N=4096) ---
+        {
+            "name": "GLM4.7 down 8e×8tok",
+            "num_experts": 8,
+            "K": 13696,
+            "N": 4096,
+            "tokens_per_expert": [8] * 8,
+        },
+        {
+            "name": "GLM4.7 down 8e×32tok",
+            "num_experts": 8,
+            "K": 13696,
+            "N": 4096,
+            "tokens_per_expert": [32] * 8,
+        },
+        {
+            "name": "GLM4.7 down 8e×64tok",
+            "num_experts": 8,
+            "K": 13696,
+            "N": 4096,
+            "tokens_per_expert": [64] * 8,
+        },
+        {
+            "name": "GLM4.7 down 8e×128tok",
+            "num_experts": 8,
+            "K": 13696,
+            "N": 4096,
+            "tokens_per_expert": [128] * 8,
+        },
+        {
+            "name": "GLM4.7 down 8e skewed",
+            "num_experts": 8,
+            "K": 13696,
+            "N": 4096,
+            "tokens_per_expert": [128, 64, 32, 16, 8, 4, 2, 1],
         },
     ]
 
diff --git a/benchmarks/nvfp4_moe_pipeline_results.md b/benchmarks/nvfp4_moe_pipeline_results.md
@@ -0,0 +1,39 @@
+# NVFP4 MoE Pipeline vs BF16 Benchmark — B200
+
+**GPU**: NVIDIA B200 (SM_100, compute capability 10.0)
+**Date**: 2026-03-09
+**Benchmark**: `bench_moe_pipeline.py` (100 iterations, 20 warmup)
+
+## GLM-4.7 (352B MoE) Shapes
+
+### gate_up (K=4096, N=13696)
+
+| Config | BF16 (ms) | NVFP4 (ms) | Speedup | BF16 TFLOPS | NVFP4 TFLOPS |
+|---|---|---|---|---|---|
+| 8e × 8 tokens (64 total) | 0.501 | 0.267 | **1.87x** | 14.33 | 26.85 |
+| 8e × 32 tokens (256 total) | 0.533 | 0.321 | **1.66x** | 53.90 | 89.58 |
+| 8e × 64 tokens (512 total) | 0.555 | 0.383 | **1.45x** | 103.52 | 150.17 |
+| 8e × 128 tokens (1024 total) | 0.597 | 0.514 | **1.16x** | 192.55 | 223.57 |
+| 8e skewed (255 total) | 0.538 | 0.506 | **1.07x** | 53.14 | 56.60 |
+
+### down (K=13696, N=4096)
+
+| Config | BF16 (ms) | NVFP4 (ms) | Speedup | BF16 TFLOPS | NVFP4 TFLOPS |
+|---|---|---|---|---|---|
+| 8e × 8 tokens (64 total) | 0.546 | 0.254 | **2.15x** | 13.16 | 28.29 |
+| 8e × 32 tokens (256 total) | 0.588 | 0.271 | **2.17x** | 48.85 | 106.01 |
+| 8e × 64 tokens (512 total) | 0.578 | 0.296 | **1.95x** | 99.39 | 194.12 |
+| 8e × 128 tokens (1024 total) | 0.599 | 0.356 | **1.68x** | 191.81 | 322.79 |
+| 8e skewed (255 total) | 0.562 | 0.308 | **1.83x** | 50.87 | 93.01 |
+
+## Summary
+
+- NVFP4 pipeline wins every configuration (1.07x–2.17x over BF16)
+- Down projection benefits most (large K, small N → memory-bandwidth-bound → FP4's 2x smaller footprint helps)
+- Few tokens per expert shows largest speedup (pipeline overhead elimination dominates)
+- Peak throughput: 322.8 TFLOPS (down proj, 128 tok/expert)
+
+## Method
+
+- **BF16 baseline**: Per-expert `torch.matmul` in a Python loop (represents the standard MoE dispatch pattern)
+- **NVFP4 pipeline**: 6-kernel fused pipeline (abs_max → quantize_raw → scatter → scale_swizzle → batched_GEMM → gather), zero host-GPU sync in compute path