Commit efc8716
committed
perf: Double-buffered GEMM kernel for load/compute overlap
While K-step k computes from smem buffer[cur], the cooperative load for
K-step k+1 writes to buffer[nxt]. The warp scheduler interleaves the
global load instructions with MMA compute, hiding load latency. Only one
__syncthreads per K-step instead of two.
Reduces prologue to single load + sync. Uses launch_bounds(256, 3) for
~85 regs/thread headroom. Total smem: 2 × 5760 = 11520 bytes/block.1 parent d95c3ed commit efc8716
1 file changed
+140
-152
lines changed
0 commit comments