Skip to content

Commit efc8716

Browse files
committed
perf: Double-buffered GEMM kernel for load/compute overlap
While K-step k computes from smem buffer[cur], the cooperative load for K-step k+1 writes to buffer[nxt]. The warp scheduler interleaves the global load instructions with MMA compute, hiding load latency. Only one __syncthreads per K-step instead of two. Reduces prologue to single load + sync. Uses launch_bounds(256, 3) for ~85 regs/thread headroom. Total smem: 2 × 5760 = 11520 bytes/block.
1 parent d95c3ed commit efc8716

File tree

1 file changed

+140
-152
lines changed

1 file changed

+140
-152
lines changed

0 commit comments

Comments
 (0)