LessUp
diff --git a/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/ISSUE_TEMPLATE/config.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 10 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎.kiro/specs/hpc-ai-optimization-lab/requirements.md‎
Lines changed: 8 additions & 8 deletions b/‎.kiro/specs/hpc-ai-optimization-lab/requirements.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 2 additions & 2 deletions b/‎CHANGELOG.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 22 additions & 50 deletions b/‎CONTRIBUTING.md‎
Lines changed: 22 additions & 50 deletions
diff --git a/‎README.md‎
Lines changed: 42 additions & 9 deletions b/‎README.md‎
Lines changed: 42 additions & 9 deletions
@@ -1,8 +1,8 @@
 blank_issues_enabled: false
 contact_links:
   - name: Documentation
-    url: https://github.com/yourusername/HPC-AI-Optimization-Lab/tree/main/docs
+    url: https://github.com/LessUp/hpc-ai-optimization-lab/tree/main/docs
     about: Check the documentation before opening an issue
   - name: Discussions
-    url: https://github.com/yourusername/HPC-AI-Optimization-Lab/discussions
+    url: https://github.com/LessUp/hpc-ai-optimization-lab/discussions
     about: Ask questions and discuss ideas
@@ -82,6 +82,16 @@ jobs:
                   raise SystemExit(f"Expected '{expected}' in {file_path}")
           PY
 
+      - name: Verify CI scope is documented
+        run: |
+          python - <<'PY'
+          from pathlib import Path
+
+          readme = Path('README.md').read_text(encoding='utf-8')
+          if 'does **not** currently provide full native CUDA build-and-test coverage' not in readme:
+              raise SystemExit('README.md must describe the current CI scope clearly')
+          PY
+
   docs:
     name: Build Documentation
     runs-on: ubuntu-latest
 
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书，旨在提供从 Naive 实现到极致优化的完整演进路径。项目采用现代 C++20 标准，利用 CUDA 13.1 及 Hopper/Blackwell 架构特性（TMA, WGMMA, FP8），并通过 Python Binding 与 PyTorch 进行实战验证。
+HPC-AI-Optimization-Lab 是一个以教学和实验为导向的高性能 CUDA 算子仓库，旨在提供从 Naive 实现到逐步优化的演进路径。项目采用现代 C++20 标准，核心稳定范围基于 CUDA 12.4+；同时包含面向较新 CUDA / Hopper 特性的实验性示例与 fallback 路径，并通过 Python Binding 与 PyTorch 进行实战验证。
 
 ## Glossary
 
@@ -34,7 +34,7 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书，
 2. THE Build_System SHALL 使用 FetchContent 自动拉取依赖（fmt, googletest, nanobind, cutlass）
 3. THE Build_System SHALL 自动检测当前显卡架构并设置对应的 -gencode 参数
 4. WHEN 用户执行 cmake && make THEN THE Build_System SHALL 成功编译所有 Kernel
-5. THE Docker_Environment SHALL 基于 CUDA 13.1 镜像提供可复现的开发环境
+5. THE Docker_Environment SHALL 基于项目文档声明的 CUDA 基线镜像提供可复现的开发环境
 6. WHEN Docker 容器启动 THEN THE Docker_Environment SHALL 包含所有必要的编译工具和依赖
 
 ### Requirement 2: 通用工具库
@@ -112,11 +112,11 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书，
 
 #### Acceptance Criteria
 
-1. THE TMA_Module SHALL 使用 cuda::memcpy_async 或 PTX 指令实现异步数据搬运
-2. THE TMA_Module SHALL 展示如何解放 Register 和 SM，让 Copy Engine 自动搬运数据
-3. THE Cluster_Module SHALL 利用 Hopper 架构的 Thread Block Clusters 特性
-4. THE Cluster_Module SHALL 实现 Block 间的 Shared Memory 直接访问（Distributed Shared Memory）
-5. THE FP8_Module SHALL 使用 e4m3 和 e5m2 数据类型实现 GEMM
+1. THE TMA_Module SHALL 至少提供一个可运行的实验性示例或 fallback 路径，并明确标注与真实 TMA 硬件路径的差异
+2. THE TMA_Module SHALL 在文档中解释真实 TMA 路径的目标能力与当前实现边界
+3. THE Cluster_Module SHALL 至少提供一个可运行的实验性示例或 fallback 路径，并明确标注与真实 Thread Block Clusters 的差异
+4. THE Cluster_Module SHALL 在文档中解释 Distributed Shared Memory 的目标语义与当前实现边界
+5. THE FP8_Module SHALL 至少提供可测试的实验性演示路径，并明确它不等同于真实 Hopper FP8 Tensor Core 实现
 6. THE FP8_Module SHALL 展示 FP8 Scaling 技术
 
 ### Requirement 8: 量化算子
@@ -175,6 +175,6 @@ HPC-AI-Optimization-Lab 是一本"活的"高性能 CUDA 算子开发教科书，
 #### Acceptance Criteria
 
 1. THE Convolution_Module SHALL 实现 Implicit GEMM 卷积
-2. THE Convolution_Module SHALL 实现 Winograd 卷积
+2. THE Convolution_Module SHALL 提供 Winograd 接口；在完整实现缺失时，必须明确标注 fallback 语义并保持结果正确
 3. THE Convolution_Module SHALL 支持常见的卷积参数（stride、padding、dilation）
 4. WHEN 运行优化后的卷积 Kernel THEN THE Kernel SHALL 性能接近 cuDNN
@@ -115,5 +115,5 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 |---------|------|-------------|
 | 0.1.0 | 2024-01-01 | Initial release |
 
-[Unreleased]: https://github.com/yourusername/HPC-AI-Optimization-Lab/compare/v0.1.0...HEAD
-[0.1.0]: https://github.com/yourusername/HPC-AI-Optimization-Lab/releases/tag/v0.1.0
+[Unreleased]: https://github.com/LessUp/hpc-ai-optimization-lab/compare/v0.1.0...HEAD
+[0.1.0]: https://github.com/LessUp/hpc-ai-optimization-lab/releases/tag/v0.1.0
@@ -29,7 +29,7 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
 
 ### 功能请求 | Feature Requests
 
-1. 检查 [Issues](https://github.com/yourusername/HPC-AI-Optimization-Lab/issues) 确认功能未被请求
+1. 检查 [Issues](https://github.com/LessUp/hpc-ai-optimization-lab/issues) 确认功能未被请求
 2. 使用 Feature Request 模板创建新 Issue
 3. 描述功能的用途和预期行为
 
@@ -38,7 +38,7 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
 1. Fork 本仓库
 2. 创建特性分支 (`git checkout -b feature/amazing-feature`)
 3. 编写代码和测试
-4. 确保所有测试通过
+4. 在本地 CUDA 环境中确保相关构建和测试通过
 5. 提交更改 (`git commit -m 'Add amazing feature'`)
 6. 推送到分支 (`git push origin feature/amazing-feature`)
 7. 创建 Pull Request
@@ -49,37 +49,37 @@ Please read and follow our [Code of Conduct](CODE_OF_CONDUCT.md).
 
 | 依赖 | 版本要求 |
 |------|----------|
-| CUDA | 12.0+ (推荐 13.1+) |
+| CUDA | 12.4+ |
 | CMake | 3.24+ |
 | C++ 编译器 | GCC 11+ / Clang 14+ |
 | Python | 3.8+ |
 
+说明：
+- 当前 `docker/Dockerfile` 基于 CUDA 12.4.1。
+- `src/07_cuda13_features/` 下的内容目前属于实验性/占位性质。
+- 当前 `flash_attention` 的已实现路径仅正式支持 `float + head_dim == 64`。
+
 ### 本地构建 | Local Build
 
 ```bash
 # 克隆仓库
 git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
-cd HPC-AI-Optimization-Lab
-
-# 创建构建目录
-mkdir build && cd build
-
-# 配置
-cmake .. -DCMAKE_BUILD_TYPE=Debug -GNinja
+cd hpc-ai-optimization-lab
 
-# 编译
-ninja
+# 配置 + 编译
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -GNinja
+cmake --build build
 
 # 运行测试
-ctest --output-on-failure
+ctest --test-dir build --output-on-failure
 ```
 
 ### Docker 环境 | Docker Environment
 
 ```bash
 cd docker
 docker-compose up -d
-docker exec -it hpc-dev bash
+docker exec -it hpc-ai-lab bash
 ```
 
 ### 安装 pre-commit hooks
@@ -97,26 +97,13 @@ pre-commit install
 - 使用 Modern C++20 特性
 - 使用 RAII 管理资源
 - 使用 Concepts 约束模板参数
-- 所有公开 API 需要 Doxygen 注释
-
-```cpp
-/**
- * @brief ReLU activation function
- * @tparam T Data type (float, __half)
- * @param input Input tensor pointer
- * @param output Output tensor pointer
- * @param n Number of elements
- * @param stream CUDA stream (optional)
- */
-template<typename T>
-void relu(const T* input, T* output, size_t n, cudaStream_t stream = nullptr);
-```
+- 为公开 API 提供清晰注释
 
 ### Python
 
 - 遵循 PEP 8
-- 使用类型提示
-- 使用 docstring 文档
+- 保持接口薄且与底层 C++ 语义一致
+- 在边界处校验输入参数
 
 ### 命名规范 | Naming Conventions
 
@@ -152,30 +139,17 @@ void relu(const T* input, T* output, size_t n, cudaStream_t stream = nullptr);
 - `test`: 测试相关
 - `chore`: 构建/工具相关
 
-### 示例
-
-```
-feat(gemm): add Tensor Core WMMA implementation
-
-- Implement WMMA API for FP16 GEMM
-- Add unit tests for correctness
-- Update documentation
-
-Closes #123
-```
-
 ## Pull Request 流程 | Pull Request Process
 
-1. **确保测试通过**: 所有 CI 检查必须通过
-2. **更新文档**: 如果添加新功能，更新相关文档
-3. **添加测试**: 新功能必须有对应的测试
-4. **代码审查**: 至少需要一位维护者审查
-5. **Squash 合并**: PR 将被 squash 合并到 main 分支
+1. **确保本地验证通过**：默认 CI 目前只覆盖轻量检查，原生 CUDA 构建和测试需要你在本地或 GPU runner 上完成
+2. **更新文档**：如果修改了公开行为、支持矩阵或实验模块定位，请同步更新文档
+3. **添加测试**：修复缺陷或新增行为时补回归测试
+4. **代码审查**：至少需要一位维护者审查
 
 ### PR 检查清单
 
 - [ ] 代码遵循项目风格指南
-- [ ] 所有测试通过
+- [ ] 本地相关构建与测试通过
 - [ ] 添加了必要的测试
 - [ ] 更新了相关文档
 - [ ] Commit message 遵循规范
@@ -186,5 +160,3 @@ Closes #123
 - 查看 [文档](docs/)
 - 搜索 [Issues](https://github.com/LessUp/hpc-ai-optimization-lab/issues)
 - 创建新 Issue 提问
-
-感谢你的贡献！🚀
@@ -2,17 +2,34 @@
 
 English | [简体中文](README.zh-CN.md)
 
-A CUDA optimization lab for AI kernels, organized as a set of focused kernel modules, tests, examples, and lightweight Python bindings.
+A CUDA kernel lab for AI workloads, organized as focused modules for elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA paths.
 
 ## What is in the repository
 
 - `src/common/`: shared CUDA utilities such as tensor wrappers, timers, launch helpers, and reduction primitives
-- `src/01_elementwise/` to `src/07_cuda13_features/`: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and newer CUDA features
+- `src/01_elementwise/` to `src/07_cuda13_features/`: numbered kernel modules covering elementwise ops, reductions, GEMM, convolution, attention, quantization, and experimental newer-CUDA features
 - `tests/`: GoogleTest + RapidCheck coverage across kernel modules
-- `examples/`: currently shipped CUDA and Python examples
+- `examples/`: shipped CUDA and Python examples
 - `python/`: nanobind bindings plus benchmark scripts
 - `docs/`: optimization notes and Python binding docs
 
+## Support matrix
+
+### Stable / validated focus of this repository
+
+- Core CUDA kernels under `src/01_elementwise` to `src/06_quantization`
+- CMake-based native builds and CTest-based validation
+- Thin Python bindings for a subset of elementwise, reduction, and GEMM kernels
+
+### Experimental / fallback areas
+
+The following modules currently exist as educational or compatibility-oriented paths rather than production-grade implementations:
+
+- `src/04_convolution/conv_winograd.cu`: currently falls back to the validated implicit-GEMM convolution path
+- `src/07_cuda13_features/tma.cu`: currently uses a regular kernel copy fallback
+- `src/07_cuda13_features/cluster.cu`: currently uses a portable block-reduction fallback
+- `src/07_cuda13_features/fp8_gemm.cu`: currently demonstrates scaled float behavior rather than a true Hopper FP8 kernel
+
 ## Build the C++/CUDA project
 
 ```bash
@@ -33,6 +50,12 @@ python -c "import hpc_ai_opt; print(hpc_ai_opt.__doc__)"
 python examples/python/basic_usage.py
 ```
 
+The bindings are intentionally thin:
+- CUDA tensors are passed in directly
+- output tensors are allocated by the caller
+- several kernels require explicit shape arguments
+- wrappers validate basic tensor sizes/arguments, then launch CUDA work asynchronously
+
 ## Build the shipped examples
 
 ```bash
@@ -52,19 +75,29 @@ y = torch.empty_like(x)
 hpc_ai_opt.elementwise.relu(x, y)
 ```
 
-The current bindings are intentionally thin:
-- CUDA tensors are passed in directly
-- output tensors are allocated by the caller
-- some kernels require explicit shape arguments
-
 ## Requirements
 
-- CUDA Toolkit 13.1+
+- CUDA Toolkit 12.4+
 - CMake 3.24+
 - A C++20 compiler
 - An NVIDIA GPU with CUDA support
 - PyTorch with CUDA support for the Python example path
 
+Notes:
+- The Docker development environment is currently based on CUDA 12.4.1.
+- Experimental modules under `src/07_cuda13_features/` are not evidence of full Hopper/Blackwell feature coverage.
+- `flash_attention` currently supports `float` with `head_dim == 64` in the shipped implementation.
+
+## CI and verification scope
+
+The default GitHub Actions workflow is intentionally lightweight and currently validates:
+- formatting
+- repository/documentation consistency checks
+- documentation builds
+
+It does **not** currently provide full native CUDA build-and-test coverage on GitHub-hosted runners.
+For native verification, run the local CMake + CTest flow shown above on a machine with a working CUDA toolchain and GPU.
+
 ## Documentation
 
 - `docs/README.md`