This repository contains a CUDA learning example for optimizing single-precision general matrix multiplication (GEMM):
C = A x B
A: M x K
B: K x N
C: M x N
The current code uses row-major matrix storage and compares each GPU kernel against a reference result generated by the warmup/native kernel path.
gemmnative: one CUDA thread computes one output element.gemmUnroll2: one thread computes two output columns.gemmUnroll4: one thread computes four output columns.gemmSmem: tiled GEMM using shared memory.gemmSmemPad: shared-memory GEMM with padding.gemmSmemUnroll2: shared-memory GEMM combined with column-direction unrolling.gemmSmemUnroll4: shared-memory GEMM combined with four-column unrolling.gemmSmemregisterTile22: shared-memory GEMM with a2 x 2register tile per thread.gemmSmemregisterTile24: shared-memory GEMM with a2 x 4register tile per thread.gemmSmemregisterTile44: shared-memory GEMM with a4 x 4register tile per thread.gemmSmemregisterTile44TILEK32:4 x 4register tile with a larger K tile of 32.gemmSmemregisterTile44TILEK32_SMLoadOptimization: experimental cooperative shared-memory loading version.gemmSmemregisterTile44TILEK32_Btrans: experimental version that stores the B tile transposed in shared memory.gemmSmemregisterTile44TILEK32_Block32x8: experimental32 x 8thread-block version.gemmSmemregisterTile44TILEK32_DB: experimental double-buffered shared-memory version.cublasSgemm: cuBLAS SGEMM baseline for comparison.
GEMM-optimization/
src/
GEMM.cu
kernels.cuh
utils.cuh
profiling/
GEMMNative_NsightCompute.pdf
GEMMRegisterTile24_NsightCompute.pdf
GEMMRegisterTile44_NsightCompute.pdf
GEMMRegisterTile44TILEK32_NsightCompute.pdf
GEMMRegisterTile44_SMLoadOptimization_NsightCompute.pdf
GEMMRegisterTile44_Btrans_NsightCompute.pdf
GEMMRegisterTile44TILEK32_DB_NsightCompute.pdf
README.md
GEMM.cu keeps the main benchmark and validation flow. kernels.cuh contains the CUDA kernels and kernel dispatch logic, while utils.cuh contains data initialization, correctness checking, the optional CPU reference, and the cuBLAS launcher.
Compile with nvcc:
nvcc -arch=sm_89 src/GEMM.cu -lcublas -o GEMMFor another GPU architecture, replace sm_89 with the architecture supported by your device.
./GEMMThe first optional argument selects which kernel to run:
./GEMM 0 # run all kernels
./GEMM 7 # run gemmSmemUnroll4 only
./GEMM 9 # run gemmSmemregisterTile24 only
./GEMM 10 # run cuBLAS SGEMM only
./GEMM 11 # run gemmSmemregisterTile44 only
./GEMM 12 # run gemmSmemregisterTile44TILEK32 only
./GEMM 13 # run gemmSmemregisterTile44TILEK32_SMLoadOptimization only
./GEMM 14 # run gemmSmemregisterTile44TILEK32_Btrans only
./GEMM 15 # run gemmSmemregisterTile44TILEK32_Block32x8 only
./GEMM 16 # run gemmSmemregisterTile44TILEK32_DB onlyKernel IDs:
| ID | Kernel |
|---|---|
| 1 | gemmnative |
| 2 | gemmUnroll2 |
| 3 | gemmUnroll4 |
| 4 | gemmSmem |
| 5 | gemmSmemPad |
| 6 | gemmSmemUnroll2 |
| 7 | gemmSmemUnroll4 |
| 8 | gemmSmemregisterTile22 |
| 9 | gemmSmemregisterTile24 |
| 10 | cublasSgemm |
| 11 | gemmSmemregisterTile44 |
| 12 | gemmSmemregisterTile44TILEK32 |
| 13 | gemmSmemregisterTile44TILEK32_SMLoadOptimization |
| 14 | gemmSmemregisterTile44TILEK32_Btrans |
| 15 | gemmSmemregisterTile44TILEK32_Block32x8 |
| 16 | gemmSmemregisterTile44TILEK32_DB |
The default matrix size in the source is:
int M = 1 << 12;
int N = 1 << 12;
int K = 1 << 12;The program validates each optimized kernel against a reference result generated by the warmup/native kernel path:
- Input matrices are initialized with random FP32 values.
- Error tolerance is
1e-2. - Each selected kernel is checked against the reference output.
Expected successful output includes:
Match !
for each kernel being validated.
The kernels can be profiled with NVIDIA Nsight Compute or Nsight Systems. A typical optimization progression in this file is:
native -> unroll2 -> unroll4 -> shared memory -> shared memory + unroll -> register tiling
One profiling run on an NVIDIA GeForce RTX 4060 Laptop GPU produced the following kernel durations for M = N = K = 4096 and 16 x 16 thread blocks. The GEMM workload performs:
2 * M * N * K = 137,438,953,472 floating-point operations
Approximate throughput is computed as:
TFLOP/s = 2 * M * N * K / duration_seconds / 1e12
| Kernel | Duration | Speedup vs Native | Approx. TFLOP/s |
|---|---|---|---|
gemmnative |
248.43 ms | 1.00x | 0.55 |
gemmUnroll2 |
185.53 ms | 1.34x | 0.74 |
gemmUnroll4 |
177.22 ms | 1.40x | 0.78 |
gemmSmem |
177.99 ms | 1.40x | 0.77 |
gemmSmemPad |
263.77 ms | 0.94x | 0.52 |
gemmSmemUnroll2 |
158.55 ms | 1.57x | 0.87 |
gemmSmemUnroll4 |
149.35 ms | 1.66x | 0.92 |
gemmSmemregisterTile22 |
92.47 ms | 2.69x | 1.49 |
gemmSmemregisterTile24 |
81.59 ms | 3.05x | 1.68 |
gemmSmemregisterTile44 |
48.10 ms | 5.16x | 2.86 |
gemmSmemregisterTile44TILEK32 |
42.54 ms | 5.84x | 3.23 |
gemmSmemregisterTile44TILEK32_SMLoadOptimization |
50.13 ms | 4.96x | 2.74 |
gemmSmemregisterTile44TILEK32_Btrans |
160.35 ms | 1.55x | 0.86 |
gemmSmemregisterTile44TILEK32_Block32x8 |
48.51 ms | 5.12x | 2.83 |
gemmSmemregisterTile44TILEK32_DB |
46.20 ms | 5.38x | 2.98 |
In this version, gemmSmemregisterTile44TILEK32 is the fastest custom kernel in the benchmark above, reaching about 5.84x speedup over the native kernel and about 55.1% of the cuBLAS SGEMM throughput. cuBLAS is called with CUBLAS_DEFAULT_MATH and uses the row-major equivalence C^T = B^T x A^T.
gemmSmemPad is slower than the naive kernel in this benchmark. This suggests that padding alone does not guarantee better performance: extra shared-memory indexing and synchronization overhead can dominate when memory coalescing and data reuse are not improved enough.
The register-tiled kernels improve performance mainly by increasing per-thread work and reusing loaded shared-memory values across multiple output elements. Moving from a 2 x 4 tile to a 4 x 4 tile further reduces the number of thread blocks and improves arithmetic intensity, but it also increases register usage and lowers achieved occupancy. Increasing the K tile from 16 to 32 reduces K-tile loop iterations and synchronization points, improving the best custom kernel from 48.10 ms to 42.54 ms. The custom kernel is still slower than cuBLAS, which uses highly tuned tiling, scheduling, instruction selection, and architecture-specific kernels.
The cooperative shared-memory loading experiment, gemmSmemregisterTile44TILEK32_SMLoadOptimization, is slower than the original TILE_K=32 version in this benchmark (50.13 ms vs 42.54 ms). This suggests that more even shared-memory loading does not automatically improve performance when the original load pattern is already efficient enough and the extra indexing/loop structure reduces instruction scheduling efficiency.
The transposed-B shared-memory experiment, gemmSmemregisterTile44TILEK32_Btrans, is much slower (160.35 ms) and increases register usage from 48 to 65 registers per thread. In this kernel, B's global-memory loads are already coalesced, so transposing B in shared memory does not fix a major global-memory bottleneck. It also changes the compute loop from SmemB[k][n] to SmemB[n][k], which creates a less favorable shared-memory access pattern for this thread layout and likely introduces more bank conflicts or shared-memory replay overhead.
The 32 x 8 thread-block experiment changes the C tile shape from a balanced 64 x 64 block tile to a wider 32 x 128 block tile. It still validates correctly, but it increases register usage to 64 registers per thread and is slower than the best 16 x 16 block version (48.51 ms vs 42.54 ms). The double-buffered version uses two shared-memory buffers and improves on the 32 x 8 and cooperative-loading experiments, but it is still slower than the original TILE_K=32 kernel (46.20 ms vs 42.54 ms). Since this implementation uses normal global loads rather than cp.async, the double-buffer structure does not yet provide true overlap between global-memory loading and computation.
The experimental 4 x 8 register-tiled kernel is kept commented out in the source. It was disabled because Nsight Compute did not report a stable timing for this version, likely due to excessive register pressure from holding 32 accumulators per thread.
cuBLAS SGEMM is included as a production-grade baseline. The custom kernels are not expected to beat cuBLAS in this learning project; the important comparison is how much of cuBLAS performance the best custom kernel reaches and what optimization steps close the gap.
For M = N = K = 4096 on the RTX 4060 Laptop GPU:
| Implementation | Duration | Approx. TFLOP/s |
|---|---|---|
| cuBLAS SGEMM | 23.44 ms | 5.86 |
Best custom kernel: gemmSmemregisterTile44TILEK32 |
42.54 ms | 3.23 |
Native custom kernel: gemmnative |
248.43 ms | 0.55 |
See profiling reports:
- Naive GEMM Nsight Compute report
- 2x4 Register Tiling Nsight Compute report
- 4x4 Register Tiling Nsight Compute report
- 4x4 Register Tiling with TILE_K=32 Nsight Compute report
- 4x4 TILE_K=32 Cooperative Shared-Memory Loading Nsight Compute report
- 4x4 TILE_K=32 Transposed-B Shared-Memory Nsight Compute report
- 4x4 TILE_K=32 Double-Buffered Shared-Memory Nsight Compute report
The native baseline and the fastest custom kernel were inspected with NVIDIA Nsight Compute. The table below keeps the most interview-relevant metrics directly in the README:
| Kernel | Registers / Thread | Achieved Occupancy | DRAM Throughput | Static Shared Memory / Block | Notes |
|---|---|---|---|---|---|
gemmnative |
36 | 99.89% | 30.93% | 0 KB | One thread computes one output element |
gemmSmemregisterTile24 |
40 | 99.31% | 41.63% | 5.12 KB | 2 x 4 register tile with shared-memory reuse |
gemmSmemregisterTile44 |
48 | 82.46% | 35.90% | 8.19 KB | 4 x 4 register tile with more per-thread work |
gemmSmemregisterTile44TILEK32 |
48 | 82.45% | 13.13% | 16.38 KB | 4 x 4 register tile with TILE_K=32 |
This comparison shows what the register-tiled kernels actually improve: each thread computes multiple output elements, so the grid is much smaller than the native one-thread-per-element kernel, while shared-memory tiling improves data reuse across the K dimension. The 4 x 4 version is faster than the 2 x 4 version, but the higher register count lowers achieved occupancy from 99.31% to about 82%. The TILE_K=32 version keeps the same register count while using more shared memory per block, reducing synchronization overhead and improving runtime.
The cuBLAS baseline was also profiled with Nsight Compute. For M = N = K = 4096, cuBLAS launched ampere_sgemm_64x64_nn and showed:
| Metric | Value |
|---|---|
| Duration | 23.44 ms |
| Compute throughput | 74.18% |
| Memory throughput | 62.18% |
| L1/TEX cache throughput | 62.58% |
| L2 cache throughput | 30.92% |
| DRAM throughput | 9.96% |
Two additional experimental kernels were profiled to understand whether shared-memory load redistribution or transposing the B tile helps this implementation:
| Kernel | Duration | Compute Throughput | Memory Throughput | Registers / Thread | Result |
|---|---|---|---|---|---|
gemmSmemregisterTile44TILEK32_SMLoadOptimization |
50.13 ms | 86.71% | 86.71% | 48 | Slower than the original TILE_K=32 kernel |
gemmSmemregisterTile44TILEK32_Btrans |
160.35 ms | 16.57% | 96.58% | 65 | Much slower due to higher register pressure and unfavorable shared-memory access |
gemmSmemregisterTile44TILEK32_Block32x8 |
48.51 ms | 96.55% | 96.55% | 64 | Wider block tile, but higher register pressure |
gemmSmemregisterTile44TILEK32_DB |
46.20 ms | 94.08% | 94.08% | 56 | Normal double buffering without cp.async overlap |
The key lesson from these profiles is that optimization changes need to be validated against the actual access pattern. The best custom kernel remains gemmSmemregisterTile44TILEK32: it keeps register usage at 48 registers per thread, avoids the costly transposed shared-memory access pattern, and reaches 42.54 ms on the benchmark above.