Commit 96ba7a2
Add optimization plan: three-kernel strategy for kbit inference
Comprehensive analysis revealed the optimal kernel dispatch:
- Scalar GEMV (new, P0): decode M=1-4, projected 3-5x over cuBLAS
- Grouped GEMM (existing): MoE at batch>=8, 1.6-2x over bmm
- Dequant + cuBLAS (existing): dense prefill, ~80-90% of fp16 speed
Key findings documented:
- Fused MMA kernel achieves only 31% BW efficiency (vs cuBLAS 69%)
due to MMA waste at M=1 and dequant instruction overhead
- The 3.6x data compression yields only 1.6-2x speedup because the
efficiency gap cancels half the compression advantage
- Dequant kernel is fast (42us, 72% peak BW) when absmax is
pre-encoded to E4M4; passing fp32 adds 800us of re-encoding
- Dense layers are always L2-resident for target models, so the
fused kernel can never beat cuBLAS on them
- MLP fusion (gate/up/down) saves <0.1% of memory traffic
bench_crossover.py: dense crossover + full model speedup tables
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent daa2f12 commit 96ba7a2
3 files changed
+820
-3
lines changed
0 commit comments