Commit 9c96365
committed
perf: Register-pipelined GEMM with explicit load/compute separation
Restructure the K-loop to explicitly separate global load issue from
load use: (1) issue global loads into registers, (2) compute MMA from
smem, (3) sync, (4) write registers to smem. Steps 1 and 2 overlap
because the warp scheduler interleaves load instructions (in flight)
with MMA instructions. Uses single smem buffer with launch_bounds(256,4)
for maximum occupancy (4 blocks/SM = 32 warps/SM).1 parent efc8716 commit 9c96365
1 file changed
+115
-79
lines changed
0 commit comments