chore: GitHub Pages optimization & workflow standardization

LessUp · LessUp · commit b96125d0c6d0 · 2026-03-10T01:37:44.000+08:00
diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -4,8 +4,10 @@ on:
   push:
     branches: [main]
     paths:
-      - '*.md'
-      - 'docs/**'
+      - 'index.md'
+      - 'README.md'
+      - 'README.zh-CN.md'
+      - 'changelog/**'
       - '_config.yml'
       - '.github/workflows/pages.yml'
   workflow_dispatch:
@@ -23,14 +25,15 @@ jobs:
   build:
     runs-on: ubuntu-latest
     steps:
-      - name: Checkout
+      - name: Checkout (sparse — docs only)
         uses: actions/checkout@v4
         with:
           sparse-checkout-cone-mode: false
           sparse-checkout: |
-            *.md
+            index.md
+            README.md
+            README.zh-CN.md
             _config.yml
-            docs
             changelog
             LICENSE
 
diff --git a/.gitignore b/.gitignore
@@ -31,6 +31,14 @@ CMakeUserPresets.json
 *.swo
 *~
 
+# Jekyll
+_site/
+.jekyll-cache/
+.jekyll-metadata
+
+# Cache
+.cache/
+
 # OS
 .DS_Store
 Thumbs.db
diff --git a/README.md b/README.md
@@ -1,59 +1,127 @@
 # SGEMM Optimization: From Naive to Tensor Core
 
+[![CI](https://github.com/LessUp/sgemm-optimization/actions/workflows/ci.yml/badge.svg)](https://github.com/LessUp/sgemm-optimization/actions/workflows/ci.yml)
+[![Pages](https://github.com/LessUp/sgemm-optimization/actions/workflows/pages.yml/badge.svg)](https://lessup.github.io/sgemm-optimization/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
 ![CUDA](https://img.shields.io/badge/CUDA-11.0+-76B900?logo=nvidia&logoColor=white)
 ![C++](https://img.shields.io/badge/C%2B%2B-17-00599C?logo=c%2B%2B&logoColor=white)
 
 English | [简体中文](README.zh-CN.md)
 
-Hand-written, progressively optimized matrix multiplication — the "Hello World" of HPC.
+Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to **Tensor Core WMMA reaching 40% of cuBLAS throughput**.
 
 ## Performance (RTX 3060 Laptop, 1024×1024×1024)
 
-| Kernel | GFLOPS | vs cuBLAS |
-|--------|--------|-----------|
-| cuBLAS (ref) | 5727 | 100% |
-| Tensor Core (WMMA) | 2300 | 40.2% |
-| Tiled (32×32) | 753 | 13.1% |
-| Double Buffer | 701 | 12.2% |
-| Bank Conflict Free | 673 | 11.8% |
-| Naive | 604 | 10.6% |
-
-## Optimization Levels
-
-| Level | Description | Key Technique |
-|-------|-------------|---------------|
-| Naive | Basic triple loop | One thread per output element |
-| Tiled | Shared memory tiling | Data reuse, reduced global memory access |
-| Bank Conflict Free | Eliminate bank conflicts | Shared memory padding (+1) |
-| Double Buffer | Pipeline overlap | Compute/memory overlap |
-| Tensor Core | WMMA API | Hardware-accelerated matrix ops (FP16→FP32) |
+| Kernel | GFLOPS | vs cuBLAS | Time | Key Technique |
+|--------|-------:|----------:|-----:|---------------|
+| **cuBLAS** (ref) | 5727 | 100% | 0.375 ms | NVIDIA optimized library |
+| **Tensor Core** (WMMA) | 2300 | 40.2% | 0.934 ms | FP16→FP32 mixed precision |
+| **Tiled** (32×32) | 753 | 13.1% | 2.853 ms | Shared memory blocking |
+| **Double Buffer** | 701 | 12.2% | 3.064 ms | Compute-memory overlap |
+| **Bank Conflict Free** | 673 | 11.8% | 3.190 ms | Shared memory padding (+1) |
+| **Naive** | 604 | 10.6% | 3.553 ms | One thread per output element |
 
-## Build & Run
+*All kernels verified against cuBLAS (allclose: rtol=1e-3, atol=1e-4; Tensor Core: rtol=5e-2)*
+
+## Optimization Roadmap
 
-```bash
-make GPU_ARCH=sm_86   # Adjust for your GPU
-./build/sgemm_benchmark
+```
+  ┌─────────┐     ┌──────────┐     ┌──────────────┐     ┌───────────────┐
+  │  Naive  │────▶│  Tiled   │────▶│  Bank-Free   │────▶│ Double Buffer │
+  │ 604 GF  │     │ 753 GF   │     │   673 GF     │     │   701 GF      │
+  └─────────┘     └──────────┘     └──────────────┘     └───────┬───────┘
+                                                                │
+                                                                ▼
+                                                    ┌───────────────────┐
+                                                    │   Tensor Core     │
+                                                    │   2300 GF (WMMA)  │
+                                                    └───────────────────┘
 ```
 
-## Key Optimization Techniques
+| Stage | What Changes | Why It Helps |
+|-------|-------------|--------------|
+| **Naive → Tiled** | Load tiles into shared memory | Data reuse reduces global memory traffic by TILE_SIZE× |
+| **Tiled → Bank-Free** | Pad shared memory `[32][33]` | Eliminates 32-way bank conflicts on column access |
+| **Bank-Free → Double Buffer** | Two shared-memory buffers | Overlaps next-tile load with current-tile compute |
+| **→ Tensor Core** | WMMA API `mma_sync` | Dedicated matrix units, ~8× peak over CUDA cores |
 
-1. **Memory Coalescing** — Warp-aligned memory access for full bandwidth
-2. **Shared Memory Tiling** — O(N³/TILE_SIZE) global memory reduction
-3. **Bank Conflict Elimination** — +1 padding for 32x bandwidth recovery
-4. **Double Buffering** — Overlap next-tile load with current-tile compute
-5. **Tensor Core (WMMA)** — 16×16×16 hardware MMA, ~8x over CUDA Cores
+## Build & Run
+
+```bash
+# Makefile (adjust GPU arch for your hardware)
+make GPU_ARCH=sm_86
+make benchmark
+
+# Or CMake
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(nproc)
+./build/bin/sgemm_benchmark
+```
 
 ## Project Structure
 
 ```
-├── src/kernels/           # 5 kernel implementations
-├── src/utils/             # CUDA utils, benchmark, verification
-├── src/main.cu            # Entry point
-├── tests/test_sgemm.cu    # Google Test property tests
-└── Makefile
+sgemm-optimization/
+├── src/
+│   ├── kernels/
+│   │   ├── naive_sgemm.cuh              # Naive: basic triple loop
+│   │   ├── tiled_sgemm.cuh              # Tiled: shared memory blocking
+│   │   ├── bank_conflict_free_sgemm.cuh # Bank conflict elimination
+│   │   ├── double_buffer_sgemm.cuh      # Double buffer pipeline
+│   │   └── tensor_core_sgemm.cuh        # Tensor Core (WMMA API)
+│   ├── utils/
+│   │   ├── cuda_utils.cuh               # CUDA error checking & utilities
+│   │   ├── benchmark.cuh                # Benchmark framework (CUDA Events)
+│   │   └── verify.cuh                   # Correctness verification (vs cuBLAS)
+│   └── main.cu                          # Entry point
+├── tests/
+│   └── test_sgemm.cu                    # Google Test property tests
+├── roofline_data_*.csv                  # Roofline analysis data
+├── CMakeLists.txt                       # CMake build (recommended)
+└── Makefile                             # Make build (quick start)
 ```
 
+## Testing
+
+Property-based tests with Google Test:
+
+| Property | What It Verifies |
+|----------|-----------------|
+| **Numerical correctness** | All kernels match cuBLAS output (allclose) |
+| **Tensor Core tolerance** | Correct under relaxed FP16 tolerance |
+| **Error detection** | Verification system catches injected errors |
+| **Dimension invariance** | All kernels handle arbitrary aligned sizes |
+
+```bash
+make test
+# Or: cmake --build build --target test_sgemm && ctest --test-dir build
+```
+
+## GPU Architecture Reference
+
+| GPU Family | Architecture | Compute Capability | Build Flag |
+|------------|-------------|-------------------|-----------|
+| Tesla V100 | Volta | sm_70 | `GPU_ARCH=sm_70` |
+| RTX 2080 | Turing | sm_75 | `GPU_ARCH=sm_75` |
+| RTX 3090 / A100 | Ampere | sm_80 / sm_86 | `GPU_ARCH=sm_86` |
+| RTX 4090 / L40 | Ada Lovelace | sm_89 | `GPU_ARCH=sm_89` |
+| H100 | Hopper | sm_90 | `GPU_ARCH=sm_90` |
+
+## Engineering Quality
+
+- **Build**: CMake 3.18+ with `target_include_directories`, `target_compile_options` (generator expressions), FetchContent for GTest v1.14.0
+- **Code style**: clang-format enforced via CI
+- **CI**: GitHub Actions — CUDA container build + format check
+- **Testing**: Google Test property-based verification against cuBLAS
+
+## References
+
+- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)
+- [How to Optimize a CUDA Matmul Kernel](https://siboehm.com/articles/22/CUDA-MMM) — Simon Boehm
+- [CUTLASS](https://github.com/NVIDIA/cutlass) — NVIDIA's high-performance GEMM library
+- [cuBLAS Documentation](https://docs.nvidia.com/cuda/cublas/)
+- [Roofline Model](https://crd.lbl.gov/divisions/amcr/computer-science-amcr/par/research/roofline/)
+
 ## License
 
 MIT License
diff --git a/_config.yml b/_config.yml
@@ -1,5 +1,33 @@
 title: SGEMM Optimization
-description: From Naive to Tensor Core — Progressive SGEMM Optimization
+description: >-
+  From Naive to Tensor Core — Progressive CUDA SGEMM Optimization.
+  Hand-written matrix multiplication kernels demonstrating core GPU optimization techniques.
+url: "https://lessup.github.io"
+baseurl: "/sgemm-optimization"
+lang: zh-CN
+
 remote_theme: pages-themes/cayman@v0.2.0
 plugins:
   - jekyll-remote-theme
+
+exclude:
+  - build/
+  - cmake-build-*/
+  - src/
+  - tests/
+  - .github/
+  - .kiro/
+  - .vscode/
+  - .editorconfig
+  - .gitignore
+  - .clang-format
+  - CMakeLists.txt
+  - CMakePresets.json
+  - Makefile
+  - CONTRIBUTING.md
+  - LICENSE
+  - "*.csv"
+  - "*.cu"
+  - "*.cuh"
+  - "*.o"
+  - "*.so"
diff --git a/changelog/2026-03-10_pages-optimization.md b/changelog/2026-03-10_pages-optimization.md
@@ -0,0 +1,30 @@
+# GitHub Pages 优化 (2026-03-10)
+
+## 变更内容
+
+### _config.yml
+- 添加 SEO 元数据（url、baseurl、lang）
+- 添加 exclude 列表，排除源代码、构建产物、CSV 数据等非文档文件，加速 Jekyll 构建
+
+### index.md（GitHub Pages 首页）
+- 重写为专业中文落地页，与项目整体风格一致
+- 添加 CI badge
+- 性能表格增加耗时列
+- 优化演进路线图使用 ASCII 框图
+- 新增 Roofline 模型分析表、GPU 架构参考表、技术栈表
+- 新增项目结构、测试验证、快速开始（含 CMake 构建）等章节
+
+### pages.yml
+- 路径触发过滤从宽泛的 `*.md` 收窄为具体文件名（index.md、README.md、README.zh-CN.md）
+- sparse-checkout 同步收窄，仅检出 Jekyll 构建所需文件
+
+### README.md
+- 添加 CI / Pages badges
+- 性能表格增加耗时列
+- 添加 ASCII 优化演进框图
+- 新增 CMake 构建指令
+- 扩展项目结构（含文件描述）
+- 新增测试验证、GPU 架构参考表、Engineering Quality 章节
+
+### .gitignore
+- 添加 `_site/`、`.jekyll-cache/`、`.jekyll-metadata`、`.cache/`
diff --git a/changelog/2026-03-10_workflow-deep-standardization.md b/changelog/2026-03-10_workflow-deep-standardization.md
@@ -0,0 +1,13 @@
+# Workflow 深度标准化
+
+日期：2026-03-10
+
+## 变更内容
+
+- CI workflow 统一 `permissions: contents: read` 与 `concurrency` 配置
+- Pages workflow 补充 `actions/configure-pages@v5` 步骤
+- Pages workflow 添加 `paths` 触发过滤，减少无效构建
+
+## 背景
+
+全仓库第二轮 GitHub Actions 深度标准化：统一命名、权限、并发、路径过滤与缓存策略。
diff --git a/index.md b/index.md