Commit 0d77a61
docs: Rewrite optimization plan — revert v2, focus on grouped expert GEMM
V2 kernel (dequant-during-fetch with ldmatrix.x2.trans for B) was
implemented and tested. It passed all 85 production tests but showed
no speedup on MoE shapes (~70us unchanged). Moving dequant from
compute to fetch just moved the bottleneck — the total ALU work is
unchanged.
Analysis of overlap strategies shows none are viable on Ada (sm_89):
- mma.sync is synchronous — warps stall during MMA, cannot do ALU
- MMA latency (~30 cycles) is 10-40x too short to hide dequant (~300+ cycles)
- Warp specialization provides negligible overlap for the same reason
Confirmed via web search: consumer Blackwell (sm_120, RTX 5090,
RTX PRO 6000) also uses mma.sync. Only Hopper (sm_90a, wgmma) and
Blackwell datacenter (sm_100a, tcgen05.mma) have async MMA.
New plan: keep v1 inner loop (already 2x over cuBLAS on large shapes),
implement grouped expert GEMM to batch MoE expert invocations into
one kernel launch. This fixes SM utilization (3% → 100%) and makes
the workload DRAM-bound where the 3.6x compression advantage applies.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent dc4343b commit 0d77a61
1 file changed
+168
-373
lines changed
0 commit comments