|
1 | 1 | # SGEMM Optimization: From Naive to Tensor Core |
2 | 2 |
|
| 3 | +[](https://github.com/LessUp/sgemm-optimization/actions/workflows/ci.yml) |
| 4 | +[](https://lessup.github.io/sgemm-optimization/) |
3 | 5 | [](LICENSE) |
4 | 6 |  |
5 | 7 |  |
6 | 8 |
|
7 | 9 | English | [简体中文](README.zh-CN.md) |
8 | 10 |
|
9 | | -Hand-written, progressively optimized matrix multiplication — the "Hello World" of HPC. |
| 11 | +Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to **Tensor Core WMMA reaching 40% of cuBLAS throughput**. |
10 | 12 |
|
11 | 13 | ## Performance (RTX 3060 Laptop, 1024×1024×1024) |
12 | 14 |
|
13 | | -| Kernel | GFLOPS | vs cuBLAS | |
14 | | -|--------|--------|-----------| |
15 | | -| cuBLAS (ref) | 5727 | 100% | |
16 | | -| Tensor Core (WMMA) | 2300 | 40.2% | |
17 | | -| Tiled (32×32) | 753 | 13.1% | |
18 | | -| Double Buffer | 701 | 12.2% | |
19 | | -| Bank Conflict Free | 673 | 11.8% | |
20 | | -| Naive | 604 | 10.6% | |
21 | | - |
22 | | -## Optimization Levels |
23 | | - |
24 | | -| Level | Description | Key Technique | |
25 | | -|-------|-------------|---------------| |
26 | | -| Naive | Basic triple loop | One thread per output element | |
27 | | -| Tiled | Shared memory tiling | Data reuse, reduced global memory access | |
28 | | -| Bank Conflict Free | Eliminate bank conflicts | Shared memory padding (+1) | |
29 | | -| Double Buffer | Pipeline overlap | Compute/memory overlap | |
30 | | -| Tensor Core | WMMA API | Hardware-accelerated matrix ops (FP16→FP32) | |
| 15 | +| Kernel | GFLOPS | vs cuBLAS | Time | Key Technique | |
| 16 | +|--------|-------:|----------:|-----:|---------------| |
| 17 | +| **cuBLAS** (ref) | 5727 | 100% | 0.375 ms | NVIDIA optimized library | |
| 18 | +| **Tensor Core** (WMMA) | 2300 | 40.2% | 0.934 ms | FP16→FP32 mixed precision | |
| 19 | +| **Tiled** (32×32) | 753 | 13.1% | 2.853 ms | Shared memory blocking | |
| 20 | +| **Double Buffer** | 701 | 12.2% | 3.064 ms | Compute-memory overlap | |
| 21 | +| **Bank Conflict Free** | 673 | 11.8% | 3.190 ms | Shared memory padding (+1) | |
| 22 | +| **Naive** | 604 | 10.6% | 3.553 ms | One thread per output element | |
31 | 23 |
|
32 | | -## Build & Run |
| 24 | +*All kernels verified against cuBLAS (allclose: rtol=1e-3, atol=1e-4; Tensor Core: rtol=5e-2)* |
| 25 | + |
| 26 | +## Optimization Roadmap |
33 | 27 |
|
34 | | -```bash |
35 | | -make GPU_ARCH=sm_86 # Adjust for your GPU |
36 | | -./build/sgemm_benchmark |
| 28 | +``` |
| 29 | + ┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ |
| 30 | + │ Naive │────▶│ Tiled │────▶│ Bank-Free │────▶│ Double Buffer │ |
| 31 | + │ 604 GF │ │ 753 GF │ │ 673 GF │ │ 701 GF │ |
| 32 | + └─────────┘ └──────────┘ └──────────────┘ └───────┬───────┘ |
| 33 | + │ |
| 34 | + ▼ |
| 35 | + ┌───────────────────┐ |
| 36 | + │ Tensor Core │ |
| 37 | + │ 2300 GF (WMMA) │ |
| 38 | + └───────────────────┘ |
37 | 39 | ``` |
38 | 40 |
|
39 | | -## Key Optimization Techniques |
| 41 | +| Stage | What Changes | Why It Helps | |
| 42 | +|-------|-------------|--------------| |
| 43 | +| **Naive → Tiled** | Load tiles into shared memory | Data reuse reduces global memory traffic by TILE_SIZE× | |
| 44 | +| **Tiled → Bank-Free** | Pad shared memory `[32][33]` | Eliminates 32-way bank conflicts on column access | |
| 45 | +| **Bank-Free → Double Buffer** | Two shared-memory buffers | Overlaps next-tile load with current-tile compute | |
| 46 | +| **→ Tensor Core** | WMMA API `mma_sync` | Dedicated matrix units, ~8× peak over CUDA cores | |
40 | 47 |
|
41 | | -1. **Memory Coalescing** — Warp-aligned memory access for full bandwidth |
42 | | -2. **Shared Memory Tiling** — O(N³/TILE_SIZE) global memory reduction |
43 | | -3. **Bank Conflict Elimination** — +1 padding for 32x bandwidth recovery |
44 | | -4. **Double Buffering** — Overlap next-tile load with current-tile compute |
45 | | -5. **Tensor Core (WMMA)** — 16×16×16 hardware MMA, ~8x over CUDA Cores |
| 48 | +## Build & Run |
| 49 | + |
| 50 | +```bash |
| 51 | +# Makefile (adjust GPU arch for your hardware) |
| 52 | +make GPU_ARCH=sm_86 |
| 53 | +make benchmark |
| 54 | + |
| 55 | +# Or CMake |
| 56 | +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release |
| 57 | +cmake --build build -j$(nproc) |
| 58 | +./build/bin/sgemm_benchmark |
| 59 | +``` |
46 | 60 |
|
47 | 61 | ## Project Structure |
48 | 62 |
|
49 | 63 | ``` |
50 | | -├── src/kernels/ # 5 kernel implementations |
51 | | -├── src/utils/ # CUDA utils, benchmark, verification |
52 | | -├── src/main.cu # Entry point |
53 | | -├── tests/test_sgemm.cu # Google Test property tests |
54 | | -└── Makefile |
| 64 | +sgemm-optimization/ |
| 65 | +├── src/ |
| 66 | +│ ├── kernels/ |
| 67 | +│ │ ├── naive_sgemm.cuh # Naive: basic triple loop |
| 68 | +│ │ ├── tiled_sgemm.cuh # Tiled: shared memory blocking |
| 69 | +│ │ ├── bank_conflict_free_sgemm.cuh # Bank conflict elimination |
| 70 | +│ │ ├── double_buffer_sgemm.cuh # Double buffer pipeline |
| 71 | +│ │ └── tensor_core_sgemm.cuh # Tensor Core (WMMA API) |
| 72 | +│ ├── utils/ |
| 73 | +│ │ ├── cuda_utils.cuh # CUDA error checking & utilities |
| 74 | +│ │ ├── benchmark.cuh # Benchmark framework (CUDA Events) |
| 75 | +│ │ └── verify.cuh # Correctness verification (vs cuBLAS) |
| 76 | +│ └── main.cu # Entry point |
| 77 | +├── tests/ |
| 78 | +│ └── test_sgemm.cu # Google Test property tests |
| 79 | +├── roofline_data_*.csv # Roofline analysis data |
| 80 | +├── CMakeLists.txt # CMake build (recommended) |
| 81 | +└── Makefile # Make build (quick start) |
55 | 82 | ``` |
56 | 83 |
|
| 84 | +## Testing |
| 85 | + |
| 86 | +Property-based tests with Google Test: |
| 87 | + |
| 88 | +| Property | What It Verifies | |
| 89 | +|----------|-----------------| |
| 90 | +| **Numerical correctness** | All kernels match cuBLAS output (allclose) | |
| 91 | +| **Tensor Core tolerance** | Correct under relaxed FP16 tolerance | |
| 92 | +| **Error detection** | Verification system catches injected errors | |
| 93 | +| **Dimension invariance** | All kernels handle arbitrary aligned sizes | |
| 94 | + |
| 95 | +```bash |
| 96 | +make test |
| 97 | +# Or: cmake --build build --target test_sgemm && ctest --test-dir build |
| 98 | +``` |
| 99 | + |
| 100 | +## GPU Architecture Reference |
| 101 | + |
| 102 | +| GPU Family | Architecture | Compute Capability | Build Flag | |
| 103 | +|------------|-------------|-------------------|-----------| |
| 104 | +| Tesla V100 | Volta | sm_70 | `GPU_ARCH=sm_70` | |
| 105 | +| RTX 2080 | Turing | sm_75 | `GPU_ARCH=sm_75` | |
| 106 | +| RTX 3090 / A100 | Ampere | sm_80 / sm_86 | `GPU_ARCH=sm_86` | |
| 107 | +| RTX 4090 / L40 | Ada Lovelace | sm_89 | `GPU_ARCH=sm_89` | |
| 108 | +| H100 | Hopper | sm_90 | `GPU_ARCH=sm_90` | |
| 109 | + |
| 110 | +## Engineering Quality |
| 111 | + |
| 112 | +- **Build**: CMake 3.18+ with `target_include_directories`, `target_compile_options` (generator expressions), FetchContent for GTest v1.14.0 |
| 113 | +- **Code style**: clang-format enforced via CI |
| 114 | +- **CI**: GitHub Actions — CUDA container build + format check |
| 115 | +- **Testing**: Google Test property-based verification against cuBLAS |
| 116 | + |
| 117 | +## References |
| 118 | + |
| 119 | +- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/) |
| 120 | +- [How to Optimize a CUDA Matmul Kernel](https://siboehm.com/articles/22/CUDA-MMM) — Simon Boehm |
| 121 | +- [CUTLASS](https://github.com/NVIDIA/cutlass) — NVIDIA's high-performance GEMM library |
| 122 | +- [cuBLAS Documentation](https://docs.nvidia.com/cuda/cublas/) |
| 123 | +- [Roofline Model](https://crd.lbl.gov/divisions/amcr/computer-science-amcr/par/research/roofline/) |
| 124 | + |
57 | 125 | ## License |
58 | 126 |
|
59 | 127 | MIT License |
0 commit comments