Skip to content

Commit 96ba7a2

Browse files
TimDettmersclaude
andcommitted
Add optimization plan: three-kernel strategy for kbit inference
Comprehensive analysis revealed the optimal kernel dispatch: - Scalar GEMV (new, P0): decode M=1-4, projected 3-5x over cuBLAS - Grouped GEMM (existing): MoE at batch>=8, 1.6-2x over bmm - Dequant + cuBLAS (existing): dense prefill, ~80-90% of fp16 speed Key findings documented: - Fused MMA kernel achieves only 31% BW efficiency (vs cuBLAS 69%) due to MMA waste at M=1 and dequant instruction overhead - The 3.6x data compression yields only 1.6-2x speedup because the efficiency gap cancels half the compression advantage - Dequant kernel is fast (42us, 72% peak BW) when absmax is pre-encoded to E4M4; passing fp32 adds 800us of re-encoding - Dense layers are always L2-resident for target models, so the fused kernel can never beat cuBLAS on them - MLP fusion (gate/up/down) saves <0.1% of memory traffic bench_crossover.py: dense crossover + full model speedup tables Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent daa2f12 commit 96ba7a2

File tree

3 files changed

+820
-3
lines changed

3 files changed

+820
-3
lines changed

0 commit comments

Comments
 (0)