Skip to content

Commit 4cdb149

Browse files
author
shijiashuai
committed
docs: align support matrix and CI messaging
Update project documentation, issue links, and CI wording to match the current CUDA baseline, experimental module boundaries, and the fact that GitHub-hosted checks do not exercise full native CUDA validation.
1 parent e157199 commit 4cdb149

File tree

10 files changed

+152
-622
lines changed

10 files changed

+152
-622
lines changed

.github/ISSUE_TEMPLATE/config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
blank_issues_enabled: false
22
contact_links:
33
- name: Documentation
4-
url: https://github.com/yourusername/HPC-AI-Optimization-Lab/tree/main/docs
4+
url: https://github.com/LessUp/hpc-ai-optimization-lab/tree/main/docs
55
about: Check the documentation before opening an issue
66
- name: Discussions
7-
url: https://github.com/yourusername/HPC-AI-Optimization-Lab/discussions
7+
url: https://github.com/LessUp/hpc-ai-optimization-lab/discussions
88
about: Ask questions and discuss ideas

.github/workflows/ci.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,16 @@ jobs:
8282
raise SystemExit(f"Expected '{expected}' in {file_path}")
8383
PY
8484
85+
- name: Verify CI scope is documented
86+
run: |
87+
python - <<'PY'
88+
from pathlib import Path
89+
90+
readme = Path('README.md').read_text(encoding='utf-8')
91+
if 'does **not** currently provide full native CUDA build-and-test coverage' not in readme:
92+
raise SystemExit('README.md must describe the current CI scope clearly')
93+
PY
94+
8595
docs:
8696
name: Build Documentation
8797
runs-on: ubuntu-latest

.kiro/specs/hpc-ai-optimization-lab/requirements.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书,旨在提供从 Naive 实现到极致优化的完整演进路径。项目采用现代 C++20 标准,利用 CUDA 13.1 及 Hopper/Blackwell 架构特性(TMA, WGMMA, FP8),并通过 Python Binding 与 PyTorch 进行实战验证。
5+
HPC-AI-Optimization-Lab 是一个以教学和实验为导向的高性能 CUDA 算子仓库,旨在提供从 Naive 实现到逐步优化的演进路径。项目采用现代 C++20 标准,核心稳定范围基于 CUDA 12.4+;同时包含面向较新 CUDA / Hopper 特性的实验性示例与 fallback 路径,并通过 Python Binding 与 PyTorch 进行实战验证。
66

77
## Glossary
88

@@ -34,7 +34,7 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书,
3434
2. THE Build_System SHALL 使用 FetchContent 自动拉取依赖(fmt, googletest, nanobind, cutlass)
3535
3. THE Build_System SHALL 自动检测当前显卡架构并设置对应的 -gencode 参数
3636
4. WHEN 用户执行 cmake && make THEN THE Build_System SHALL 成功编译所有 Kernel
37-
5. THE Docker_Environment SHALL 基于 CUDA 13.1 镜像提供可复现的开发环境
37+
5. THE Docker_Environment SHALL 基于项目文档声明的 CUDA 基线镜像提供可复现的开发环境
3838
6. WHEN Docker 容器启动 THEN THE Docker_Environment SHALL 包含所有必要的编译工具和依赖
3939

4040
### Requirement 2: 通用工具库
@@ -112,11 +112,11 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书,
112112

113113
#### Acceptance Criteria
114114

115-
1. THE TMA_Module SHALL 使用 cuda::memcpy_async 或 PTX 指令实现异步数据搬运
116-
2. THE TMA_Module SHALL 展示如何解放 Register 和 SM,让 Copy Engine 自动搬运数据
117-
3. THE Cluster_Module SHALL 利用 Hopper 架构的 Thread Block Clusters 特性
118-
4. THE Cluster_Module SHALL 实现 Block 间的 Shared Memory 直接访问(Distributed Shared Memory
119-
5. THE FP8_Module SHALL 使用 e4m3 和 e5m2 数据类型实现 GEMM
115+
1. THE TMA_Module SHALL 至少提供一个可运行的实验性示例或 fallback 路径,并明确标注与真实 TMA 硬件路径的差异
116+
2. THE TMA_Module SHALL 在文档中解释真实 TMA 路径的目标能力与当前实现边界
117+
3. THE Cluster_Module SHALL 至少提供一个可运行的实验性示例或 fallback 路径,并明确标注与真实 Thread Block Clusters 的差异
118+
4. THE Cluster_Module SHALL 在文档中解释 Distributed Shared Memory 的目标语义与当前实现边界
119+
5. THE FP8_Module SHALL 至少提供可测试的实验性演示路径,并明确它不等同于真实 Hopper FP8 Tensor Core 实现
120120
6. THE FP8_Module SHALL 展示 FP8 Scaling 技术
121121

122122
### Requirement 8: 量化算子
@@ -175,6 +175,6 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书,
175175
#### Acceptance Criteria
176176

177177
1. THE Convolution_Module SHALL 实现 Implicit GEMM 卷积
178-
2. THE Convolution_Module SHALL 实现 Winograd 卷积
178+
2. THE Convolution_Module SHALL 提供 Winograd 接口;在完整实现缺失时,必须明确标注 fallback 语义并保持结果正确
179179
3. THE Convolution_Module SHALL 支持常见的卷积参数(stride、padding、dilation)
180180
4. WHEN 运行优化后的卷积 Kernel THEN THE Kernel SHALL 性能接近 cuDNN

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -115,5 +115,5 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
115115
|---------|------|-------------|
116116
| 0.1.0 | 2024-01-01 | Initial release |
117117

118-
[Unreleased]: https://github.com/yourusername/HPC-AI-Optimization-Lab/compare/v0.1.0...HEAD
119-
[0.1.0]: https://github.com/yourusername/HPC-AI-Optimization-Lab/releases/tag/v0.1.0
118+
[Unreleased]: https://github.com/LessUp/hpc-ai-optimization-lab/compare/v0.1.0...HEAD
119+
[0.1.0]: https://github.com/LessUp/hpc-ai-optimization-lab/releases/tag/v0.1.0

CONTRIBUTING.md

Lines changed: 22 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
2929

3030
### 功能请求 | Feature Requests
3131

32-
1. 检查 [Issues](https://github.com/yourusername/HPC-AI-Optimization-Lab/issues) 确认功能未被请求
32+
1. 检查 [Issues](https://github.com/LessUp/hpc-ai-optimization-lab/issues) 确认功能未被请求
3333
2. 使用 Feature Request 模板创建新 Issue
3434
3. 描述功能的用途和预期行为
3535

@@ -38,7 +38,7 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
3838
1. Fork 本仓库
3939
2. 创建特性分支 (`git checkout -b feature/amazing-feature`)
4040
3. 编写代码和测试
41-
4. 确保所有测试通过
41+
4. 在本地 CUDA 环境中确保相关构建和测试通过
4242
5. 提交更改 (`git commit -m 'Add amazing feature'`)
4343
6. 推送到分支 (`git push origin feature/amazing-feature`)
4444
7. 创建 Pull Request
@@ -49,37 +49,37 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
4949

5050
| 依赖 | 版本要求 |
5151
|------|----------|
52-
| CUDA | 12.0+ (推荐 13.1+) |
52+
| CUDA | 12.4+ |
5353
| CMake | 3.24+ |
5454
| C++ 编译器 | GCC 11+ / Clang 14+ |
5555
| Python | 3.8+ |
5656

57+
说明:
58+
- 当前 `docker/Dockerfile` 基于 CUDA 12.4.1。
59+
- `src/07_cuda13_features/` 下的内容目前属于实验性/占位性质。
60+
- 当前 `flash_attention` 的已实现路径仅正式支持 `float + head_dim == 64`
61+
5762
### 本地构建 | Local Build
5863

5964
```bash
6065
# 克隆仓库
6166
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
62-
cd HPC-AI-Optimization-Lab
63-
64-
# 创建构建目录
65-
mkdir build && cd build
66-
67-
# 配置
68-
cmake .. -DCMAKE_BUILD_TYPE=Debug -GNinja
67+
cd hpc-ai-optimization-lab
6968

70-
# 编译
71-
ninja
69+
# 配置 + 编译
70+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -GNinja
71+
cmake --build build
7272

7373
# 运行测试
74-
ctest --output-on-failure
74+
ctest --test-dir build --output-on-failure
7575
```
7676

7777
### Docker 环境 | Docker Environment
7878

7979
```bash
8080
cd docker
8181
docker-compose up -d
82-
docker exec -it hpc-dev bash
82+
docker exec -it hpc-ai-lab bash
8383
```
8484

8585
### 安装 pre-commit hooks
@@ -97,26 +97,13 @@ pre-commit install
9797
- 使用 Modern C++20 特性
9898
- 使用 RAII 管理资源
9999
- 使用 Concepts 约束模板参数
100-
- 所有公开 API 需要 Doxygen 注释
101-
102-
```cpp
103-
/**
104-
* @brief ReLU activation function
105-
* @tparam T Data type (float, __half)
106-
* @param input Input tensor pointer
107-
* @param output Output tensor pointer
108-
* @param n Number of elements
109-
* @param stream CUDA stream (optional)
110-
*/
111-
template<typename T>
112-
void relu(const T* input, T* output, size_t n, cudaStream_t stream = nullptr);
113-
```
100+
- 为公开 API 提供清晰注释
114101

115102
### Python
116103

117104
- 遵循 PEP 8
118-
- 使用类型提示
119-
- 使用 docstring 文档
105+
- 保持接口薄且与底层 C++ 语义一致
106+
- 在边界处校验输入参数
120107

121108
### 命名规范 | Naming Conventions
122109

@@ -152,30 +139,17 @@ void relu(const T* input, T* output, size_t n, cudaStream_t stream = nullptr);
152139
- `test`: 测试相关
153140
- `chore`: 构建/工具相关
154141

155-
### 示例
156-
157-
```
158-
feat(gemm): add Tensor Core WMMA implementation
159-
160-
- Implement WMMA API for FP16 GEMM
161-
- Add unit tests for correctness
162-
- Update documentation
163-
164-
Closes #123
165-
```
166-
167142
## Pull Request 流程 | Pull Request Process
168143

169-
1. **确保测试通过**: 所有 CI 检查必须通过
170-
2. **更新文档**: 如果添加新功能,更新相关文档
171-
3. **添加测试**: 新功能必须有对应的测试
172-
4. **代码审查**: 至少需要一位维护者审查
173-
5. **Squash 合并**: PR 将被 squash 合并到 main 分支
144+
1. **确保本地验证通过**:默认 CI 目前只覆盖轻量检查,原生 CUDA 构建和测试需要你在本地或 GPU runner 上完成
145+
2. **更新文档**:如果修改了公开行为、支持矩阵或实验模块定位,请同步更新文档
146+
3. **添加测试**:修复缺陷或新增行为时补回归测试
147+
4. **代码审查**:至少需要一位维护者审查
174148

175149
### PR 检查清单
176150

177151
- [ ] 代码遵循项目风格指南
178-
- [ ] 所有测试通过
152+
- [ ] 本地相关构建与测试通过
179153
- [ ] 添加了必要的测试
180154
- [ ] 更新了相关文档
181155
- [ ] Commit message 遵循规范
@@ -186,5 +160,3 @@ Closes #123
186160
- 查看 [文档](docs/)
187161
- 搜索 [Issues](https://github.com/LessUp/hpc-ai-optimization-lab/issues)
188162
- 创建新 Issue 提问
189-
190-
感谢你的贡献!🚀

README.md

Lines changed: 42 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,34 @@
22

33
English | [简体中文](README.zh-CN.md)
44

5-
A CUDA optimization lab for AI kernels, organized as a set of focused kernel modules, tests, examples, and lightweight Python bindings.
5+
A CUDA kernel lab for AI workloads, organized as focused modules for elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA paths.
66

77
## What is in the repository
88

99
- `src/common/`: shared CUDA utilities such as tensor wrappers, timers, launch helpers, and reduction primitives
10-
- `src/01_elementwise/` to `src/07_cuda13_features/`: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and newer CUDA features
10+
- `src/01_elementwise/` to `src/07_cuda13_features/`: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA features
1111
- `tests/`: GoogleTest + RapidCheck coverage across kernel modules
12-
- `examples/`: currently shipped CUDA and Python examples
12+
- `examples/`: shipped CUDA and Python examples
1313
- `python/`: nanobind bindings plus benchmark scripts
1414
- `docs/`: optimization notes and Python binding docs
1515

16+
## Support matrix
17+
18+
### Stable / validated focus of this repository
19+
20+
- Core CUDA kernels under `src/01_elementwise` to `src/06_quantization`
21+
- CMake-based native builds and CTest-based validation
22+
- Thin Python bindings for a subset of elementwise, reduction, and GEMM kernels
23+
24+
### Experimental / fallback areas
25+
26+
The following modules currently exist as educational or compatibility-oriented paths rather than production-grade implementations:
27+
28+
- `src/04_convolution/conv_winograd.cu`: currently falls back to the validated implicit-GEMM convolution path
29+
- `src/07_cuda13_features/tma.cu`: currently uses a regular kernel copy fallback
30+
- `src/07_cuda13_features/cluster.cu`: currently uses a portable block-reduction fallback
31+
- `src/07_cuda13_features/fp8_gemm.cu`: currently demonstrates scaled float behavior rather than a true Hopper FP8 kernel
32+
1633
## Build the C++/CUDA project
1734

1835
```bash
@@ -33,6 +50,12 @@ python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"
3350
python examples/python/basic_usage.py
3451
```
3552

53+
The bindings are intentionally thin:
54+
- CUDA tensors are passed in directly
55+
- output tensors are allocated by the caller
56+
- several kernels require explicit shape arguments
57+
- wrappers validate basic tensor sizes/arguments, then launch CUDA work asynchronously
58+
3659
## Build the shipped examples
3760

3861
```bash
@@ -52,19 +75,29 @@ y = torch.empty_like(x)
5275
hpc_ai_opt.elementwise.relu(x, y)
5376
```
5477

55-
The current bindings are intentionally thin:
56-
- CUDA tensors are passed in directly
57-
- output tensors are allocated by the caller
58-
- some kernels require explicit shape arguments
59-
6078
## Requirements
6179

62-
- CUDA Toolkit 13.1+
80+
- CUDA Toolkit 12.4+
6381
- CMake 3.24+
6482
- A C++20 compiler
6583
- An NVIDIA GPU with CUDA support
6684
- PyTorch with CUDA support for the Python example path
6785

86+
Notes:
87+
- The Docker development environment is currently based on CUDA 12.4.1.
88+
- Experimental modules under `src/07_cuda13_features/` are not evidence of full Hopper/Blackwell feature coverage.
89+
- `flash_attention` currently supports `float` with `head_dim == 64` in the shipped implementation.
90+
91+
## CI and verification scope
92+
93+
The default GitHub Actions workflow is intentionally lightweight and currently validates:
94+
- formatting
95+
- repository/documentation consistency checks
96+
- documentation builds
97+
98+
It does **not** currently provide full native CUDA build-and-test coverage on GitHub-hosted runners.
99+
For native verification, run the local CMake + CTest flow shown above on a machine with a working CUDA toolchain and GPU.
100+
68101
## Documentation
69102

70103
- `docs/README.md`

0 commit comments

Comments
 (0)