You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three changes to the production kernel (kbit_gemm_prod):
1. Branchless absmax decode: new decode_e4m4_absmax_branchless() eliminates
BSSY/BSYNC divergence-handling pairs in SASS. Subnormals treated as
normal-path (acceptable since no real weight block has absmax < 2^-10).
2. Interleaved bit extraction: all 4 fragment elements' bit extractions
interleaved in a single loop over K_BITS, giving the compiler more ILP
across elements and bit-planes.
3. Two-tier k_splits heuristic: Tier 1 (severe underutil < 25%) splits
aggressively. Tier 2 (new) splits conservatively (cap 2) when data
exceeds L2 cache (> 24 MB) and SM utilization is moderate. Llama3-8B
improves ~25% from k_splits=2.
MoE shapes remain at 0.3-0.4x vs cuBLAS — the bottleneck is structural
(1264 SASS instructions per k_tile, 1.3% tensor core utilization).
Phase 2 restructuring (dequant-during-fetch) needed.
Also adds optimization2.md documenting root cause analysis and the
dequant-during-fetch restructuring plan.
195/195 tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments