Skip to content

Commit b96125d

Browse files
committed
chore: GitHub Pages optimization & workflow standardization
1 parent 6a4ef1d commit b96125d

7 files changed

Lines changed: 321 additions & 99 deletions

File tree

.github/workflows/pages.yml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@ on:
44
push:
55
branches: [main]
66
paths:
7-
- '*.md'
8-
- 'docs/**'
7+
- 'index.md'
8+
- 'README.md'
9+
- 'README.zh-CN.md'
10+
- 'changelog/**'
911
- '_config.yml'
1012
- '.github/workflows/pages.yml'
1113
workflow_dispatch:
@@ -23,14 +25,15 @@ jobs:
2325
build:
2426
runs-on: ubuntu-latest
2527
steps:
26-
- name: Checkout
28+
- name: Checkout (sparse — docs only)
2729
uses: actions/checkout@v4
2830
with:
2931
sparse-checkout-cone-mode: false
3032
sparse-checkout: |
31-
*.md
33+
index.md
34+
README.md
35+
README.zh-CN.md
3236
_config.yml
33-
docs
3437
changelog
3538
LICENSE
3639

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,14 @@ CMakeUserPresets.json
3131
*.swo
3232
*~
3333

34+
# Jekyll
35+
_site/
36+
.jekyll-cache/
37+
.jekyll-metadata
38+
39+
# Cache
40+
.cache/
41+
3442
# OS
3543
.DS_Store
3644
Thumbs.db

README.md

Lines changed: 102 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,127 @@
11
# SGEMM Optimization: From Naive to Tensor Core
22

3+
[![CI](https://github.com/LessUp/sgemm-optimization/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/sgemm-optimization/actions/workflows/ci.yml)
4+
[![Pages](https://github.com/LessUp/sgemm-optimization/actions/workflows/pages.yml/badge.svg)](https://lessup.github.io/sgemm-optimization/)
35
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
46
![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
57
![C++](https://img.shields.io/badge/C%2B%2B-17-00599C?logo=c%2B%2B&logoColor=white)
68

79
English | [简体中文](README.zh-CN.md)
810

9-
Hand-written, progressively optimized matrix multiplication — the "Hello World" of HPC.
11+
Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to **Tensor Core WMMA reaching 40% of cuBLAS throughput**.
1012

1113
## Performance (RTX 3060 Laptop, 1024×1024×1024)
1214

13-
| Kernel | GFLOPS | vs cuBLAS |
14-
|--------|--------|-----------|
15-
| cuBLAS (ref) | 5727 | 100% |
16-
| Tensor Core (WMMA) | 2300 | 40.2% |
17-
| Tiled (32×32) | 753 | 13.1% |
18-
| Double Buffer | 701 | 12.2% |
19-
| Bank Conflict Free | 673 | 11.8% |
20-
| Naive | 604 | 10.6% |
21-
22-
## Optimization Levels
23-
24-
| Level | Description | Key Technique |
25-
|-------|-------------|---------------|
26-
| Naive | Basic triple loop | One thread per output element |
27-
| Tiled | Shared memory tiling | Data reuse, reduced global memory access |
28-
| Bank Conflict Free | Eliminate bank conflicts | Shared memory padding (+1) |
29-
| Double Buffer | Pipeline overlap | Compute/memory overlap |
30-
| Tensor Core | WMMA API | Hardware-accelerated matrix ops (FP16→FP32) |
15+
| Kernel | GFLOPS | vs cuBLAS | Time | Key Technique |
16+
|--------|-------:|----------:|-----:|---------------|
17+
| **cuBLAS** (ref) | 5727 | 100% | 0.375 ms | NVIDIA optimized library |
18+
| **Tensor Core** (WMMA) | 2300 | 40.2% | 0.934 ms | FP16→FP32 mixed precision |
19+
| **Tiled** (32×32) | 753 | 13.1% | 2.853 ms | Shared memory blocking |
20+
| **Double Buffer** | 701 | 12.2% | 3.064 ms | Compute-memory overlap |
21+
| **Bank Conflict Free** | 673 | 11.8% | 3.190 ms | Shared memory padding (+1) |
22+
| **Naive** | 604 | 10.6% | 3.553 ms | One thread per output element |
3123

32-
## Build & Run
24+
*All kernels verified against cuBLAS (allclose: rtol=1e-3, atol=1e-4; Tensor Core: rtol=5e-2)*
25+
26+
## Optimization Roadmap
3327

34-
```bash
35-
make GPU_ARCH=sm_86 # Adjust for your GPU
36-
./build/sgemm_benchmark
28+
```
29+
┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌───────────────┐
30+
│ Naive │────▶│ Tiled │────▶│ Bank-Free │────▶│ Double Buffer │
31+
│ 604 GF │ │ 753 GF │ │ 673 GF │ │ 701 GF │
32+
└─────────┘ └──────────┘ └──────────────┘ └───────┬───────┘
33+
34+
35+
┌───────────────────┐
36+
│ Tensor Core │
37+
│ 2300 GF (WMMA) │
38+
└───────────────────┘
3739
```
3840

39-
## Key Optimization Techniques
41+
| Stage | What Changes | Why It Helps |
42+
|-------|-------------|--------------|
43+
| **Naive → Tiled** | Load tiles into shared memory | Data reuse reduces global memory traffic by TILE_SIZE× |
44+
| **Tiled → Bank-Free** | Pad shared memory `[32][33]` | Eliminates 32-way bank conflicts on column access |
45+
| **Bank-Free → Double Buffer** | Two shared-memory buffers | Overlaps next-tile load with current-tile compute |
46+
| **→ Tensor Core** | WMMA API `mma_sync` | Dedicated matrix units, ~8× peak over CUDA cores |
4047

41-
1. **Memory Coalescing** — Warp-aligned memory access for full bandwidth
42-
2. **Shared Memory Tiling** — O(N³/TILE_SIZE) global memory reduction
43-
3. **Bank Conflict Elimination** — +1 padding for 32x bandwidth recovery
44-
4. **Double Buffering** — Overlap next-tile load with current-tile compute
45-
5. **Tensor Core (WMMA)** — 16×16×16 hardware MMA, ~8x over CUDA Cores
48+
## Build & Run
49+
50+
```bash
51+
# Makefile (adjust GPU arch for your hardware)
52+
make GPU_ARCH=sm_86
53+
make benchmark
54+
55+
# Or CMake
56+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
57+
cmake --build build -j$(nproc)
58+
./build/bin/sgemm_benchmark
59+
```
4660

4761
## Project Structure
4862

4963
```
50-
├── src/kernels/ # 5 kernel implementations
51-
├── src/utils/ # CUDA utils, benchmark, verification
52-
├── src/main.cu # Entry point
53-
├── tests/test_sgemm.cu # Google Test property tests
54-
└── Makefile
64+
sgemm-optimization/
65+
├── src/
66+
│ ├── kernels/
67+
│ │ ├── naive_sgemm.cuh # Naive: basic triple loop
68+
│ │ ├── tiled_sgemm.cuh # Tiled: shared memory blocking
69+
│ │ ├── bank_conflict_free_sgemm.cuh # Bank conflict elimination
70+
│ │ ├── double_buffer_sgemm.cuh # Double buffer pipeline
71+
│ │ └── tensor_core_sgemm.cuh # Tensor Core (WMMA API)
72+
│ ├── utils/
73+
│ │ ├── cuda_utils.cuh # CUDA error checking & utilities
74+
│ │ ├── benchmark.cuh # Benchmark framework (CUDA Events)
75+
│ │ └── verify.cuh # Correctness verification (vs cuBLAS)
76+
│ └── main.cu # Entry point
77+
├── tests/
78+
│ └── test_sgemm.cu # Google Test property tests
79+
├── roofline_data_*.csv # Roofline analysis data
80+
├── CMakeLists.txt # CMake build (recommended)
81+
└── Makefile # Make build (quick start)
5582
```
5683

84+
## Testing
85+
86+
Property-based tests with Google Test:
87+
88+
| Property | What It Verifies |
89+
|----------|-----------------|
90+
| **Numerical correctness** | All kernels match cuBLAS output (allclose) |
91+
| **Tensor Core tolerance** | Correct under relaxed FP16 tolerance |
92+
| **Error detection** | Verification system catches injected errors |
93+
| **Dimension invariance** | All kernels handle arbitrary aligned sizes |
94+
95+
```bash
96+
make test
97+
# Or: cmake --build build --target test_sgemm && ctest --test-dir build
98+
```
99+
100+
## GPU Architecture Reference
101+
102+
| GPU Family | Architecture | Compute Capability | Build Flag |
103+
|------------|-------------|-------------------|-----------|
104+
| Tesla V100 | Volta | sm_70 | `GPU_ARCH=sm_70` |
105+
| RTX 2080 | Turing | sm_75 | `GPU_ARCH=sm_75` |
106+
| RTX 3090 / A100 | Ampere | sm_80 / sm_86 | `GPU_ARCH=sm_86` |
107+
| RTX 4090 / L40 | Ada Lovelace | sm_89 | `GPU_ARCH=sm_89` |
108+
| H100 | Hopper | sm_90 | `GPU_ARCH=sm_90` |
109+
110+
## Engineering Quality
111+
112+
- **Build**: CMake 3.18+ with `target_include_directories`, `target_compile_options` (generator expressions), FetchContent for GTest v1.14.0
113+
- **Code style**: clang-format enforced via CI
114+
- **CI**: GitHub Actions — CUDA container build + format check
115+
- **Testing**: Google Test property-based verification against cuBLAS
116+
117+
## References
118+
119+
- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)
120+
- [How to Optimize a CUDA Matmul Kernel](https://siboehm.com/articles/22/CUDA-MMM) — Simon Boehm
121+
- [CUTLASS](https://github.com/NVIDIA/cutlass) — NVIDIA's high-performance GEMM library
122+
- [cuBLAS Documentation](https://docs.nvidia.com/cuda/cublas/)
123+
- [Roofline Model](https://crd.lbl.gov/divisions/amcr/computer-science-amcr/par/research/roofline/)
124+
57125
## License
58126

59127
MIT License

_config.yml

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
title: SGEMM Optimization
2-
description: From Naive to Tensor Core — Progressive SGEMM Optimization
2+
description: >-
3+
From Naive to Tensor Core — Progressive CUDA SGEMM Optimization.
4+
Hand-written matrix multiplication kernels demonstrating core GPU optimization techniques.
5+
url: "https://lessup.github.io"
6+
baseurl: "/sgemm-optimization"
7+
lang: zh-CN
8+
39
remote_theme: pages-themes/cayman@v0.2.0
410
plugins:
511
- jekyll-remote-theme
12+
13+
exclude:
14+
- build/
15+
- cmake-build-*/
16+
- src/
17+
- tests/
18+
- .github/
19+
- .kiro/
20+
- .vscode/
21+
- .editorconfig
22+
- .gitignore
23+
- .clang-format
24+
- CMakeLists.txt
25+
- CMakePresets.json
26+
- Makefile
27+
- CONTRIBUTING.md
28+
- LICENSE
29+
- "*.csv"
30+
- "*.cu"
31+
- "*.cuh"
32+
- "*.o"
33+
- "*.so"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# GitHub Pages 优化 (2026-03-10)
2+
3+
## 变更内容
4+
5+
### _config.yml
6+
- 添加 SEO 元数据(url、baseurl、lang)
7+
- 添加 exclude 列表,排除源代码、构建产物、CSV 数据等非文档文件,加速 Jekyll 构建
8+
9+
### index.md(GitHub Pages 首页)
10+
- 重写为专业中文落地页,与项目整体风格一致
11+
- 添加 CI badge
12+
- 性能表格增加耗时列
13+
- 优化演进路线图使用 ASCII 框图
14+
- 新增 Roofline 模型分析表、GPU 架构参考表、技术栈表
15+
- 新增项目结构、测试验证、快速开始(含 CMake 构建)等章节
16+
17+
### pages.yml
18+
- 路径触发过滤从宽泛的 `*.md` 收窄为具体文件名(index.md、README.md、README.zh-CN.md)
19+
- sparse-checkout 同步收窄,仅检出 Jekyll 构建所需文件
20+
21+
### README.md
22+
- 添加 CI / Pages badges
23+
- 性能表格增加耗时列
24+
- 添加 ASCII 优化演进框图
25+
- 新增 CMake 构建指令
26+
- 扩展项目结构(含文件描述)
27+
- 新增测试验证、GPU 架构参考表、Engineering Quality 章节
28+
29+
### .gitignore
30+
- 添加 `_site/``.jekyll-cache/``.jekyll-metadata``.cache/`
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Workflow 深度标准化
2+
3+
日期:2026-03-10
4+
5+
## 变更内容
6+
7+
- CI workflow 统一 `permissions: contents: read``concurrency` 配置
8+
- Pages workflow 补充 `actions/configure-pages@v5` 步骤
9+
- Pages workflow 添加 `paths` 触发过滤,减少无效构建
10+
11+
## 背景
12+
13+
全仓库第二轮 GitHub Actions 深度标准化:统一命名、权限、并发、路径过滤与缓存策略。

0 commit comments

Comments
 (0)