Skip to content

Commit 9c96365

Browse files
committed
perf: Register-pipelined GEMM with explicit load/compute separation
Restructure the K-loop to explicitly separate global load issue from load use: (1) issue global loads into registers, (2) compute MMA from smem, (3) sync, (4) write registers to smem. Steps 1 and 2 overlap because the warp scheduler interleaves load instructions (in flight) with MMA instructions. Uses single smem buffer with launch_bounds(256,4) for maximum occupancy (4 blocks/SM = 32 warps/SM).
1 parent efc8716 commit 9c96365

File tree

1 file changed

+115
-79
lines changed

1 file changed

+115
-79
lines changed

0 commit comments

Comments
 (0)