Skip to content

Commit 0d77a61

Browse files
TimDettmersclaude
andcommitted
docs: Rewrite optimization plan — revert v2, focus on grouped expert GEMM
V2 kernel (dequant-during-fetch with ldmatrix.x2.trans for B) was implemented and tested. It passed all 85 production tests but showed no speedup on MoE shapes (~70us unchanged). Moving dequant from compute to fetch just moved the bottleneck — the total ALU work is unchanged. Analysis of overlap strategies shows none are viable on Ada (sm_89): - mma.sync is synchronous — warps stall during MMA, cannot do ALU - MMA latency (~30 cycles) is 10-40x too short to hide dequant (~300+ cycles) - Warp specialization provides negligible overlap for the same reason Confirmed via web search: consumer Blackwell (sm_120, RTX 5090, RTX PRO 6000) also uses mma.sync. Only Hopper (sm_90a, wgmma) and Blackwell datacenter (sm_100a, tcgen05.mma) have async MMA. New plan: keep v1 inner loop (already 2x over cuBLAS on large shapes), implement grouped expert GEMM to batch MoE expert invocations into one kernel launch. This fixes SM utilization (3% → 100%) and makes the workload DRAM-bound where the 3.6x compression advantage applies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent dc4343b commit 0d77a61

File tree

1 file changed

+168
-373
lines changed

1 file changed

+168
-373
lines changed

0 commit comments

Comments
 (0)