Skip to content

Commit d800bb5

Browse files
TimDettmersclaude
andcommitted
Rewrite kbit scalar GEMV: 2-4.5x faster, beats cuBLAS fp16 at M=1
Complete rewrite of the kbit_scalar_gemv kernel for single-token inference. Achieves 3-5x speedup over cuBLAS fp16 GEMV at M=1 by leveraging 3.2x data compression from k-bit quantization. Kernel architecture (v8): - 64 threads (2 warps) per block, 1 output column per block, grid=N - Dequant-once inner loop: decode weight once, FMA across all M rows - int4 vector loads for A (8 fp16 per load), eliminates L1 thrashing - Vectorized B plane loads (int4 for k=4, uint2 for k=2) - Per-M launch_bounds: M<=2 uses 24 blocks/SM (0 spills), M>=3 uses 16 blocks/SM (0 spills, relaxed register budget) - No split-K, no workspace allocation, simplified Python wiring Performance (RTX 4090, K=2048 N=5120, ncu single-kernel): - k=4 M=1: 13.1 us, 512 GB/s (3.9x faster than cuBLAS fp16) - k=4 M=2: 14.8 us, 450 GB/s (only 13% slower than M=1) - k=4 M=3: 16.6 us, 401 GB/s - k=4 M=4: 19.8 us, 337 GB/s - Crossover vs cuBLAS: M~2 (tensor cores close the gap) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent aba14e8 commit d800bb5

File tree

9 files changed

+953
-2305
lines changed

9 files changed

+953
-2305
lines changed

agents/scalar_gemv_guide.md

Lines changed: 0 additions & 383 deletions
This file was deleted.

0 commit comments

Comments
 (0)