Skip to content

Commit 6c02402

Browse files
author
shijiashuai
committed
refactor: complete stabilization and close-out phase
Major improvements: - Fix GEMM Step 6/7 documentation inconsistencies - Add type support matrix for GEMM optimizations - Add status management to all OpenSpec specs - Redesign Git Pages landing page with high marketing value - Remove docs/node_modules (95MB) and boilerplate docs - Enhance CUTLASS and INT8 GEMM documentation Code changes: - Add status annotations to GEMM optimization steps - Document CUTLASS baseline type limitations - Document INT8 GEMM optimization level delegation - Add frontmatter status to all 9 specs Documentation improvements: - Complete type support matrix in API docs (EN/ZH-CN) - Update GEMM guide with implementation status - Create marketing-focused landing page with performance highlights - Add learning path and quick start sections Cleanup: - Remove docs/node_modules (95MB) - Remove boilerplate: CODE_OF_CONDUCT.md, SECURITY.md, CONTRIBUTING.md - Add docs/.gitignore This completes the stabilization phase. The repository is now in a clean, credible, and easy-to-understand state ready for archival.
1 parent eb04f93 commit 6c02402

24 files changed

Lines changed: 389 additions & 354 deletions

File tree

.github/workflows/ci.yml

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,12 @@ on:
66
pull_request:
77
branches: [main, master]
88
workflow_dispatch:
9+
inputs:
10+
cuda_build:
11+
description: 'Run CUDA compilation test'
12+
required: false
13+
default: false
14+
type: boolean
915

1016
permissions:
1117
contents: read
@@ -74,3 +80,27 @@ jobs:
7480

7581
- name: Build VitePress site
7682
run: npm --prefix docs run docs:build
83+
84+
# CUDA compilation test - manually triggered or on workflow_dispatch
85+
cuda-build:
86+
name: CUDA Build
87+
runs-on: ubuntu-latest
88+
if: ${{ github.event_name == 'workflow_dispatch' && inputs.cuda_build == true }}
89+
container:
90+
image: nvidia/cuda:12.4.0-devel-ubuntu22.04
91+
steps:
92+
- name: Checkout repository
93+
uses: actions/checkout@v4
94+
95+
- name: Install CMake
96+
run: |
97+
apt-get update
98+
apt-get install -y cmake
99+
100+
- name: Configure CMake
101+
run: cmake --preset default
102+
env:
103+
CXX: g++
104+
105+
- name: Build
106+
run: cmake --build --preset default -j$(nproc)

AGENTS.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,13 @@ Shared repository guidance for AI assistants.
66

77
This project is in **stabilization / close-out mode**. Prefer simplification, consolidation, and trustworthiness over feature expansion. Delete or archive low-value material once a stronger canonical replacement exists.
88

9+
## Project-specific context
10+
11+
- **Implementation status**: GEMM Steps 1-5 are fully implemented. Step 6 (MMA PTX) delegates to Step 5 for stability. Step 7 (Software Pipelining) is planned for future implementation.
12+
- **Type support**: CUTLASS baseline only supports float. INT8 GEMM has complete SharedMemTiling optimization; other optimization levels delegate to SharedMemTiling.
13+
- **Python bindings**: Currently expose `elementwise`, `reduction`, and `gemm` only. Not all C++ modules have Python bindings.
14+
- **CI limitation**: GPU validation requires local execution or self-hosted infrastructure; GitHub-hosted runners do not provide CUDA support.
15+
916
## Canonical sources of truth
1017

1118
- **Active work**: `openspec/changes/<change>/`
@@ -59,6 +66,13 @@ ctest --preset default
5966

6067
If the configured build tree exposes zero tests or stale results, reconfigure before trusting it.
6168

69+
## Code style and conventions
70+
71+
- **CUDA kernel organization**: One kernel per file under `src/<module>/`
72+
- **Error handling**: Use `CUDA_CHECK` macro for all CUDA API calls; throw `std::invalid_argument` for invalid parameters
73+
- **Documentation sync**: When updating code, ensure README, docs, and API reference reflect the changes
74+
- **Performance claims**: Only state TFLOPS numbers that have been measured; mark projected or estimated values clearly
75+
6276
## Tooling and automation posture
6377

6478
- Keep hooks and automation narrow; every retained check should protect a real recurring failure mode.

CODE_OF_CONDUCT.md

Lines changed: 0 additions & 133 deletions
This file was deleted.

CONTRIBUTING.md

Lines changed: 0 additions & 120 deletions
This file was deleted.

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -186,17 +186,20 @@ See [Troubleshooting Guide](docs/en/guide/troubleshooting.md) for more.
186186

187187
FP32 matrix multiplication (4096×4096) on NVIDIA A100:
188188

189-
| Step | Technique | Performance | Speedup |
190-
|:----:|-----------|-------------|:-------:|
191-
| 1 | Naive implementation | 0.5 TFLOPS ||
192-
| 2 | Shared memory tiling | 2.0 TFLOPS ||
193-
| 3 | Double buffering | 3.5 TFLOPS ||
194-
| 4 | Register tiling | 6.0 TFLOPS | 12× |
195-
| 5 | **Tensor Core (WMMA)** | **50+ TFLOPS** | **100×** |
196-
| 6 | Tensor Core (MMA PTX) | 60+ TFLOPS | 120× |
197-
| 7 | Software pipelining | 70+ TFLOPS | 140× |
189+
| Step | Technique | Performance | Speedup | Status |
190+
|:----:|-----------|-------------|:-------:|:------:|
191+
| 1 | Naive implementation | 0.5 TFLOPS |||
192+
| 2 | Shared memory tiling | 2.0 TFLOPS |||
193+
| 3 | Double buffering | 3.5 TFLOPS |||
194+
| 4 | Register tiling | 6.0 TFLOPS | 12× ||
195+
| 5 | **Tensor Core (WMMA)** | **50+ TFLOPS** | **100×** ||
196+
| 6 | Tensor Core (MMA PTX)* | ~60 TFLOPS | ~120× | 🚧 |
197+
| 7 | Software pipelining* | ~70 TFLOPS | ~140× | 🚧 |
198198

199199
> 💡 The progression from Step 1 to Step 5 demonstrates why modern AI hardware achieves remarkable speedups through specialized units.
200+
>
201+
> *Step 6 currently delegates to Step 5 for stability. Step 7 is planned for future implementation.
202+
> †Performance values are projected estimates.
200203
201204
### Module Status
202205

SECURITY.md

Lines changed: 0 additions & 36 deletions
This file was deleted.

docs/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
node_modules/

0 commit comments

Comments
 (0)