Commit 0d77a61

and

committed

docs: Rewrite optimization plan — revert v2, focus on grouped expert GEMM

V2 kernel (dequant-during-fetch with ldmatrix.x2.trans for B) was implemented and tested. It passed all 85 production tests but showed no speedup on MoE shapes (~70us unchanged). Moving dequant from compute to fetch just moved the bottleneck — the total ALU work is unchanged. Analysis of overlap strategies shows none are viable on Ada (sm_89): - mma.sync is synchronous — warps stall during MMA, cannot do ALU - MMA latency (~30 cycles) is 10-40x too short to hide dequant (~300+ cycles) - Warp specialization provides negligible overlap for the same reason Confirmed via web search: consumer Blackwell (sm_120, RTX 5090, RTX PRO 6000) also uses mma.sync. Only Hopper (sm_90a, wgmma) and Blackwell datacenter (sm_100a, tcgen05.mma) have async MMA. New plan: keep v1 inner loop (already 2x over cuBLAS on large shapes), implement grouped expert GEMM to batch MoE expert invocations into one kernel launch. This fixes SM utilization (3% → 100%) and makes the workload DRAM-bound where the 3.6x compression advantage applies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

1 parent dc4343b commit 0d77a61Copy full SHA for 0d77a61

1 file changed

+168

-373

lines changed

optimization2.md

1 file changed

+168

-373

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 0d77a61

1 file changed

1 file changed

Uh oh!

File tree

1 file changed

1 file changed

0 commit comments