AICL-Lab
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 30 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 14 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎CODE_OF_CONDUCT.md‎
Lines changed: 0 additions & 133 deletions b/‎CODE_OF_CONDUCT.md‎
Lines changed: 0 additions & 133 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 120 deletions b/‎CONTRIBUTING.md‎
Lines changed: 0 additions & 120 deletions
diff --git a/‎README.md‎
Lines changed: 12 additions & 9 deletions b/‎README.md‎
Lines changed: 12 additions & 9 deletions
diff --git a/‎SECURITY.md‎
Lines changed: 0 additions & 36 deletions b/‎SECURITY.md‎
Lines changed: 0 additions & 36 deletions
diff --git a/‎docs/.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎docs/.gitignore‎
Lines changed: 1 addition & 0 deletions
@@ -6,6 +6,12 @@ on:
   pull_request:
     branches: [main, master]
   workflow_dispatch:
+    inputs:
+      cuda_build:
+        description: 'Run CUDA compilation test'
+        required: false
+        default: false
+        type: boolean
 
 permissions:
   contents: read
@@ -74,3 +80,27 @@ jobs:
 
       - name: Build VitePress site
         run: npm --prefix docs run docs:build
+
+  # CUDA compilation test - manually triggered or on workflow_dispatch
+  cuda-build:
+    name: CUDA Build
+    runs-on: ubuntu-latest
+    if: ${{ github.event_name == 'workflow_dispatch' && inputs.cuda_build == true }}
+    container:
+      image: nvidia/cuda:12.4.0-devel-ubuntu22.04
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Install CMake
+        run: |
+          apt-get update
+          apt-get install -y cmake
+
+      - name: Configure CMake
+        run: cmake --preset default
+        env:
+          CXX: g++
+
+      - name: Build
+        run: cmake --build --preset default -j$(nproc)
@@ -6,6 +6,13 @@ Shared repository guidance for AI assistants.
 
 This project is in **stabilization / close-out mode**. Prefer simplification, consolidation, and trustworthiness over feature expansion. Delete or archive low-value material once a stronger canonical replacement exists.
 
+## Project-specific context
+
+- **Implementation status**: GEMM Steps 1-5 are fully implemented. Step 6 (MMA PTX) delegates to Step 5 for stability. Step 7 (Software Pipelining) is planned for future implementation.
+- **Type support**: CUTLASS baseline only supports float. INT8 GEMM has complete SharedMemTiling optimization; other optimization levels delegate to SharedMemTiling.
+- **Python bindings**: Currently expose `elementwise`, `reduction`, and `gemm` only. Not all C++ modules have Python bindings.
+- **CI limitation**: GPU validation requires local execution or self-hosted infrastructure; GitHub-hosted runners do not provide CUDA support.
+
 ## Canonical sources of truth
 
 - **Active work**: `openspec/changes/<change>/`
@@ -59,6 +66,13 @@ ctest --preset default
 
 If the configured build tree exposes zero tests or stale results, reconfigure before trusting it.
 
+## Code style and conventions
+
+- **CUDA kernel organization**: One kernel per file under `src/<module>/`
+- **Error handling**: Use `CUDA_CHECK` macro for all CUDA API calls; throw `std::invalid_argument` for invalid parameters
+- **Documentation sync**: When updating code, ensure README, docs, and API reference reflect the changes
+- **Performance claims**: Only state TFLOPS numbers that have been measured; mark projected or estimated values clearly
+
 ## Tooling and automation posture
 
 - Keep hooks and automation narrow; every retained check should protect a real recurring failure mode.
 
@@ -186,17 +186,20 @@ See [Troubleshooting Guide](docs/en/guide/troubleshooting.md) for more.
 
 FP32 matrix multiplication (4096×4096) on NVIDIA A100:
 
-| Step | Technique | Performance | Speedup |
-|:----:|-----------|-------------|:-------:|
-| 1 | Naive implementation | 0.5 TFLOPS | 1× |
-| 2 | Shared memory tiling | 2.0 TFLOPS | 4× |
-| 3 | Double buffering | 3.5 TFLOPS | 7× |
-| 4 | Register tiling | 6.0 TFLOPS | 12× |
-| 5 | **Tensor Core (WMMA)** | **50+ TFLOPS** | **100×** |
-| 6 | Tensor Core (MMA PTX) | 60+ TFLOPS | 120× |
-| 7 | Software pipelining | 70+ TFLOPS | 140× |
+| Step | Technique | Performance | Speedup | Status |
+|:----:|-----------|-------------|:-------:|:------:|
+| 1 | Naive implementation | 0.5 TFLOPS | 1× | ✅ |
+| 2 | Shared memory tiling | 2.0 TFLOPS | 4× | ✅ |
+| 3 | Double buffering | 3.5 TFLOPS | 7× | ✅ |
+| 4 | Register tiling | 6.0 TFLOPS | 12× | ✅ |
+| 5 | **Tensor Core (WMMA)** | **50+ TFLOPS** | **100×** | ✅ |
+| 6 | Tensor Core (MMA PTX)* | ~60 TFLOPS† | ~120× | 🚧 |
+| 7 | Software pipelining* | ~70 TFLOPS† | ~140× | 🚧 |
 
 > 💡 The progression from Step 1 to Step 5 demonstrates why modern AI hardware achieves remarkable speedups through specialized units.
+> 
+> *Step 6 currently delegates to Step 5 for stability. Step 7 is planned for future implementation.  
+> †Performance values are projected estimates.
 
 ### Module Status
 
 
@@ -0,0 +1 @@
+node_modules/