Commit d800bb5
Rewrite kbit scalar GEMV: 2-4.5x faster, beats cuBLAS fp16 at M=1
Complete rewrite of the kbit_scalar_gemv kernel for single-token inference.
Achieves 3-5x speedup over cuBLAS fp16 GEMV at M=1 by leveraging 3.2x
data compression from k-bit quantization.
Kernel architecture (v8):
- 64 threads (2 warps) per block, 1 output column per block, grid=N
- Dequant-once inner loop: decode weight once, FMA across all M rows
- int4 vector loads for A (8 fp16 per load), eliminates L1 thrashing
- Vectorized B plane loads (int4 for k=4, uint2 for k=2)
- Per-M launch_bounds: M<=2 uses 24 blocks/SM (0 spills),
M>=3 uses 16 blocks/SM (0 spills, relaxed register budget)
- No split-K, no workspace allocation, simplified Python wiring
Performance (RTX 4090, K=2048 N=5120, ncu single-kernel):
- k=4 M=1: 13.1 us, 512 GB/s (3.9x faster than cuBLAS fp16)
- k=4 M=2: 14.8 us, 450 GB/s (only 13% slower than M=1)
- k=4 M=3: 16.6 us, 401 GB/s
- k=4 M=4: 19.8 us, 337 GB/s
- Crossover vs cuBLAS: M~2 (tensor cores close the gap)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent aba14e8 commit d800bb5
File tree
9 files changed
+953
-2305
lines changed- agents
- bitsandbytes
- backends/cuda
- csrc
- tests
9 files changed
+953
-2305
lines changedThis file was deleted.
0 commit comments