|
2 | 2 |
|
3 | 3 | English | [简体中文](README.zh-CN.md) |
4 | 4 |
|
5 | | -<p align="center"> |
6 | | - <b>A Living Textbook for High-Performance CUDA Kernel Development</b> |
7 | | -</p> |
| 5 | +A CUDA optimization lab for AI kernels, organized as a set of focused kernel modules, tests, examples, and lightweight Python bindings. |
8 | 6 |
|
9 | | - |
10 | | - |
11 | | - |
12 | | - |
| 7 | +## What is in the repository |
13 | 8 |
|
14 | | ---- |
| 9 | +- `src/common/`: shared CUDA utilities such as tensor wrappers, timers, launch helpers, and reduction primitives |
| 10 | +- `src/01_elementwise/` to `src/07_cuda13_features/`: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and newer CUDA features |
| 11 | +- `tests/`: GoogleTest + RapidCheck coverage across kernel modules |
| 12 | +- `examples/`: currently shipped CUDA and Python examples |
| 13 | +- `python/`: nanobind bindings plus benchmark scripts |
| 14 | +- `docs/`: optimization notes and Python binding docs |
15 | 15 |
|
16 | | -## Overview |
| 16 | +## Build the C++/CUDA project |
17 | 17 |
|
18 | | -A systematic CUDA high-performance computing tutorial, from naive implementations to extreme optimization, covering core operators needed by modern AI models (LLM, Diffusion). |
19 | | - |
20 | | -## Modules |
| 18 | +```bash |
| 19 | +cmake -S . -B build -DCMAKE_BUILD_TYPE=Release |
| 20 | +cmake --build build -j$(nproc) |
| 21 | +ctest --test-dir build --output-on-failure |
| 22 | +``` |
21 | 23 |
|
22 | | -| Module | Description | Key Techniques | |
23 | | -|--------|-------------|----------------| |
24 | | -| **GEMM** | Matrix multiplication optimization | Tiled → Register Blocked → Tensor Core | |
25 | | -| **Attention** | FlashAttention variants | Online Softmax, causal masking | |
26 | | -| **Normalization** | LayerNorm, RMSNorm | Warp shuffle, vectorized loads | |
27 | | -| **Elementwise** | Activation functions | GELU, SiLU, vectorized | |
28 | | -| **Quantization** | INT8/FP8 | Calibration, per-channel scaling | |
29 | | -| **Fusion** | Kernel fusion patterns | Bias+Act, LayerNorm+Residual | |
| 24 | +## Build the Python bindings |
30 | 25 |
|
31 | | -## Quick Start |
| 26 | +The current Python extension is named `hpc_ai_opt` and exposes low-level submodules such as `elementwise`, `reduction`, and `gemm`. |
32 | 27 |
|
33 | 28 | ```bash |
34 | | -git clone https://github.com/LessUp/hpc-ai-optimization-lab.git |
35 | | -cd hpc-ai-optimization-lab |
| 29 | +cmake -S . -B build -DBUILD_PYTHON_BINDINGS=ON |
| 30 | +cmake --build build |
| 31 | +export PYTHONPATH="$(pwd)/build/python:${PYTHONPATH}" |
| 32 | +python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)" |
| 33 | +python examples/python/basic_usage.py |
| 34 | +``` |
36 | 35 |
|
37 | | -cmake -B build -DCMAKE_BUILD_TYPE=Release |
38 | | -cmake --build build -j$(nproc) |
39 | | -ctest --test-dir build --output-on-failure |
| 36 | +## Build the shipped examples |
| 37 | + |
| 38 | +```bash |
| 39 | +cmake -S . -B build -DBUILD_EXAMPLES=ON |
| 40 | +cmake --build build --target relu_example gemm_benchmark |
40 | 41 | ``` |
41 | 42 |
|
42 | | -## Requirements |
| 43 | +## Current Python API shape |
43 | 44 |
|
44 | | -- CUDA Toolkit 13.1+ (Hopper/Blackwell recommended) |
45 | | -- CMake 3.20+, C++20 compiler |
46 | | -- GPU: SM 8.0+ (Ampere or newer) |
| 45 | +```python |
| 46 | +import torch |
| 47 | +import hpc_ai_opt |
47 | 48 |
|
48 | | -## Project Structure |
| 49 | +x = torch.randn(1024, 1024, device="cuda", dtype=torch.float32) |
| 50 | +y = torch.empty_like(x) |
49 | 51 |
|
| 52 | +hpc_ai_opt.elementwise.relu(x, y) |
50 | 53 | ``` |
51 | | -hpc-ai-optimization-lab/ |
52 | | -├── src/ # Kernel implementations |
53 | | -│ ├── gemm/ # GEMM optimization levels |
54 | | -│ ├── attention/ # Attention kernels |
55 | | -│ ├── normalization/ # Norm kernels |
56 | | -│ ├── elementwise/ # Activation kernels |
57 | | -│ └── quantization/ # Quantization kernels |
58 | | -├── include/ # Public headers |
59 | | -├── tests/ # Google Test suite |
60 | | -├── benchmarks/ # Performance benchmarks |
61 | | -├── docs/ # Documentation |
62 | | -└── .github/workflows/ # CI |
63 | | -``` |
64 | 54 |
|
65 | | -## Key Topics |
| 55 | +The current bindings are intentionally thin: |
| 56 | +- CUDA tensors are passed in directly |
| 57 | +- output tensors are allocated by the caller |
| 58 | +- some kernels require explicit shape arguments |
| 59 | + |
| 60 | +## Requirements |
66 | 61 |
|
67 | | -- **Memory Hierarchy**: Global → Shared → Register optimization |
68 | | -- **Tensor Core Programming**: WMMA / MMA for mixed-precision compute |
69 | | -- **Async Operations**: TMA, async copy, pipeline overlapping |
70 | | -- **Warp-Level Primitives**: Shuffle, vote, cooperative groups |
71 | | -- **Kernel Fusion**: Reducing HBM round-trips |
| 62 | +- CUDA Toolkit 13.1+ |
| 63 | +- CMake 3.24+ |
| 64 | +- A C++20 compiler |
| 65 | +- An NVIDIA GPU with CUDA support |
| 66 | +- PyTorch with CUDA support for the Python example path |
72 | 67 |
|
73 | | -## License |
| 68 | +## Documentation |
74 | 69 |
|
75 | | -MIT License |
| 70 | +- `docs/README.md` |
| 71 | +- `docs/python/index.rst` |
| 72 | +- `docs/01_gemm_optimization.md` |
| 73 | +- `docs/04_flash_attention.md` |
0 commit comments