Skip to content

LioEinaudi/GEMM-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GEMM Optimization

This repository contains a CUDA learning example for optimizing single-precision general matrix multiplication (GEMM):

C = A x B
A: M x K
B: K x N
C: M x N

The current code uses row-major matrix storage and compares each GPU kernel against a reference result generated by the warmup/native kernel path.

Implemented Kernels

  • gemmnative: one CUDA thread computes one output element.
  • gemmUnroll2: one thread computes two output columns.
  • gemmUnroll4: one thread computes four output columns.
  • gemmSmem: tiled GEMM using shared memory.
  • gemmSmemPad: shared-memory GEMM with padding.
  • gemmSmemUnroll2: shared-memory GEMM combined with column-direction unrolling.
  • gemmSmemUnroll4: shared-memory GEMM combined with four-column unrolling.
  • gemmSmemregisterTile22: shared-memory GEMM with a 2 x 2 register tile per thread.
  • gemmSmemregisterTile24: shared-memory GEMM with a 2 x 4 register tile per thread.
  • gemmSmemregisterTile44: shared-memory GEMM with a 4 x 4 register tile per thread.
  • gemmSmemregisterTile44TILEK32: 4 x 4 register tile with a larger K tile of 32.
  • gemmSmemregisterTile44TILEK32_SMLoadOptimization: experimental cooperative shared-memory loading version.
  • gemmSmemregisterTile44TILEK32_Btrans: experimental version that stores the B tile transposed in shared memory.
  • gemmSmemregisterTile44TILEK32_Block32x8: experimental 32 x 8 thread-block version.
  • gemmSmemregisterTile44TILEK32_DB: experimental double-buffered shared-memory version.
  • cublasSgemm: cuBLAS SGEMM baseline for comparison.

Repository Layout

GEMM-optimization/
  src/
    GEMM.cu
    kernels.cuh
    utils.cuh
  profiling/
    GEMMNative_NsightCompute.pdf
    GEMMRegisterTile24_NsightCompute.pdf
    GEMMRegisterTile44_NsightCompute.pdf
    GEMMRegisterTile44TILEK32_NsightCompute.pdf
    GEMMRegisterTile44_SMLoadOptimization_NsightCompute.pdf
    GEMMRegisterTile44_Btrans_NsightCompute.pdf
    GEMMRegisterTile44TILEK32_DB_NsightCompute.pdf
  README.md

GEMM.cu keeps the main benchmark and validation flow. kernels.cuh contains the CUDA kernels and kernel dispatch logic, while utils.cuh contains data initialization, correctness checking, the optional CPU reference, and the cuBLAS launcher.

Build

Compile with nvcc:

nvcc -arch=sm_89 src/GEMM.cu -lcublas -o GEMM

For another GPU architecture, replace sm_89 with the architecture supported by your device.

Run

./GEMM

The first optional argument selects which kernel to run:

./GEMM 0      # run all kernels
./GEMM 7      # run gemmSmemUnroll4 only
./GEMM 9      # run gemmSmemregisterTile24 only
./GEMM 10     # run cuBLAS SGEMM only
./GEMM 11     # run gemmSmemregisterTile44 only
./GEMM 12     # run gemmSmemregisterTile44TILEK32 only
./GEMM 13     # run gemmSmemregisterTile44TILEK32_SMLoadOptimization only
./GEMM 14     # run gemmSmemregisterTile44TILEK32_Btrans only
./GEMM 15     # run gemmSmemregisterTile44TILEK32_Block32x8 only
./GEMM 16     # run gemmSmemregisterTile44TILEK32_DB only

Kernel IDs:

ID Kernel
1 gemmnative
2 gemmUnroll2
3 gemmUnroll4
4 gemmSmem
5 gemmSmemPad
6 gemmSmemUnroll2
7 gemmSmemUnroll4
8 gemmSmemregisterTile22
9 gemmSmemregisterTile24
10 cublasSgemm
11 gemmSmemregisterTile44
12 gemmSmemregisterTile44TILEK32
13 gemmSmemregisterTile44TILEK32_SMLoadOptimization
14 gemmSmemregisterTile44TILEK32_Btrans
15 gemmSmemregisterTile44TILEK32_Block32x8
16 gemmSmemregisterTile44TILEK32_DB

The default matrix size in the source is:

int M = 1 << 12;
int N = 1 << 12;
int K = 1 << 12;

Validation

The program validates each optimized kernel against a reference result generated by the warmup/native kernel path:

  • Input matrices are initialized with random FP32 values.
  • Error tolerance is 1e-2.
  • Each selected kernel is checked against the reference output.

Expected successful output includes:

Match !

for each kernel being validated.

Profiling

The kernels can be profiled with NVIDIA Nsight Compute or Nsight Systems. A typical optimization progression in this file is:

native -> unroll2 -> unroll4 -> shared memory -> shared memory + unroll -> register tiling

One profiling run on an NVIDIA GeForce RTX 4060 Laptop GPU produced the following kernel durations for M = N = K = 4096 and 16 x 16 thread blocks. The GEMM workload performs:

2 * M * N * K = 137,438,953,472 floating-point operations

Approximate throughput is computed as:

TFLOP/s = 2 * M * N * K / duration_seconds / 1e12
Kernel Duration Speedup vs Native Approx. TFLOP/s
gemmnative 248.43 ms 1.00x 0.55
gemmUnroll2 185.53 ms 1.34x 0.74
gemmUnroll4 177.22 ms 1.40x 0.78
gemmSmem 177.99 ms 1.40x 0.77
gemmSmemPad 263.77 ms 0.94x 0.52
gemmSmemUnroll2 158.55 ms 1.57x 0.87
gemmSmemUnroll4 149.35 ms 1.66x 0.92
gemmSmemregisterTile22 92.47 ms 2.69x 1.49
gemmSmemregisterTile24 81.59 ms 3.05x 1.68
gemmSmemregisterTile44 48.10 ms 5.16x 2.86
gemmSmemregisterTile44TILEK32 42.54 ms 5.84x 3.23
gemmSmemregisterTile44TILEK32_SMLoadOptimization 50.13 ms 4.96x 2.74
gemmSmemregisterTile44TILEK32_Btrans 160.35 ms 1.55x 0.86
gemmSmemregisterTile44TILEK32_Block32x8 48.51 ms 5.12x 2.83
gemmSmemregisterTile44TILEK32_DB 46.20 ms 5.38x 2.98

In this version, gemmSmemregisterTile44TILEK32 is the fastest custom kernel in the benchmark above, reaching about 5.84x speedup over the native kernel and about 55.1% of the cuBLAS SGEMM throughput. cuBLAS is called with CUBLAS_DEFAULT_MATH and uses the row-major equivalence C^T = B^T x A^T.

Observations

gemmSmemPad is slower than the naive kernel in this benchmark. This suggests that padding alone does not guarantee better performance: extra shared-memory indexing and synchronization overhead can dominate when memory coalescing and data reuse are not improved enough.

The register-tiled kernels improve performance mainly by increasing per-thread work and reusing loaded shared-memory values across multiple output elements. Moving from a 2 x 4 tile to a 4 x 4 tile further reduces the number of thread blocks and improves arithmetic intensity, but it also increases register usage and lowers achieved occupancy. Increasing the K tile from 16 to 32 reduces K-tile loop iterations and synchronization points, improving the best custom kernel from 48.10 ms to 42.54 ms. The custom kernel is still slower than cuBLAS, which uses highly tuned tiling, scheduling, instruction selection, and architecture-specific kernels.

The cooperative shared-memory loading experiment, gemmSmemregisterTile44TILEK32_SMLoadOptimization, is slower than the original TILE_K=32 version in this benchmark (50.13 ms vs 42.54 ms). This suggests that more even shared-memory loading does not automatically improve performance when the original load pattern is already efficient enough and the extra indexing/loop structure reduces instruction scheduling efficiency.

The transposed-B shared-memory experiment, gemmSmemregisterTile44TILEK32_Btrans, is much slower (160.35 ms) and increases register usage from 48 to 65 registers per thread. In this kernel, B's global-memory loads are already coalesced, so transposing B in shared memory does not fix a major global-memory bottleneck. It also changes the compute loop from SmemB[k][n] to SmemB[n][k], which creates a less favorable shared-memory access pattern for this thread layout and likely introduces more bank conflicts or shared-memory replay overhead.

The 32 x 8 thread-block experiment changes the C tile shape from a balanced 64 x 64 block tile to a wider 32 x 128 block tile. It still validates correctly, but it increases register usage to 64 registers per thread and is slower than the best 16 x 16 block version (48.51 ms vs 42.54 ms). The double-buffered version uses two shared-memory buffers and improves on the 32 x 8 and cooperative-loading experiments, but it is still slower than the original TILE_K=32 kernel (46.20 ms vs 42.54 ms). Since this implementation uses normal global loads rather than cp.async, the double-buffer structure does not yet provide true overlap between global-memory loading and computation.

The experimental 4 x 8 register-tiled kernel is kept commented out in the source. It was disabled because Nsight Compute did not report a stable timing for this version, likely due to excessive register pressure from holding 32 accumulators per thread.

cuBLAS Baseline

cuBLAS SGEMM is included as a production-grade baseline. The custom kernels are not expected to beat cuBLAS in this learning project; the important comparison is how much of cuBLAS performance the best custom kernel reaches and what optimization steps close the gap.

For M = N = K = 4096 on the RTX 4060 Laptop GPU:

Implementation Duration Approx. TFLOP/s
cuBLAS SGEMM 23.44 ms 5.86
Best custom kernel: gemmSmemregisterTile44TILEK32 42.54 ms 3.23
Native custom kernel: gemmnative 248.43 ms 0.55

Nsight Compute Summary

See profiling reports:

The native baseline and the fastest custom kernel were inspected with NVIDIA Nsight Compute. The table below keeps the most interview-relevant metrics directly in the README:

Kernel Registers / Thread Achieved Occupancy DRAM Throughput Static Shared Memory / Block Notes
gemmnative 36 99.89% 30.93% 0 KB One thread computes one output element
gemmSmemregisterTile24 40 99.31% 41.63% 5.12 KB 2 x 4 register tile with shared-memory reuse
gemmSmemregisterTile44 48 82.46% 35.90% 8.19 KB 4 x 4 register tile with more per-thread work
gemmSmemregisterTile44TILEK32 48 82.45% 13.13% 16.38 KB 4 x 4 register tile with TILE_K=32

This comparison shows what the register-tiled kernels actually improve: each thread computes multiple output elements, so the grid is much smaller than the native one-thread-per-element kernel, while shared-memory tiling improves data reuse across the K dimension. The 4 x 4 version is faster than the 2 x 4 version, but the higher register count lowers achieved occupancy from 99.31% to about 82%. The TILE_K=32 version keeps the same register count while using more shared memory per block, reducing synchronization overhead and improving runtime.

The cuBLAS baseline was also profiled with Nsight Compute. For M = N = K = 4096, cuBLAS launched ampere_sgemm_64x64_nn and showed:

Metric Value
Duration 23.44 ms
Compute throughput 74.18%
Memory throughput 62.18%
L1/TEX cache throughput 62.58%
L2 cache throughput 30.92%
DRAM throughput 9.96%

Two additional experimental kernels were profiled to understand whether shared-memory load redistribution or transposing the B tile helps this implementation:

Kernel Duration Compute Throughput Memory Throughput Registers / Thread Result
gemmSmemregisterTile44TILEK32_SMLoadOptimization 50.13 ms 86.71% 86.71% 48 Slower than the original TILE_K=32 kernel
gemmSmemregisterTile44TILEK32_Btrans 160.35 ms 16.57% 96.58% 65 Much slower due to higher register pressure and unfavorable shared-memory access
gemmSmemregisterTile44TILEK32_Block32x8 48.51 ms 96.55% 96.55% 64 Wider block tile, but higher register pressure
gemmSmemregisterTile44TILEK32_DB 46.20 ms 94.08% 94.08% 56 Normal double buffering without cp.async overlap

The key lesson from these profiles is that optimization changes need to be validated against the actual access pattern. The best custom kernel remains gemmSmemregisterTile44TILEK32: it keeps register usage at 48 registers per thread, avoids the costly transposed shared-memory access pattern, and reaches 42.54 ms on the benchmark above.

About

CUDA FP32 GEMM optimization with loop unrolling, shared memory tiling, register tiling, benchmarking, and Nsight profiling.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages