Skip to content

Commit 60b7ee0

Browse files
shijiashuaiqwencoder
andcommitted
refactor: optimize project directory structure
Major improvements: - Rename python/ to cuda_llm_ops/ for proper Python package structure - Update setup.py to use correct package paths and extension naming - Add .pyi stub file for IDE type safety - Add __init__.py to tests/ and benchmarks/ directories - Add .pre-commit-config.yaml for automated code quality checks - Remove duplicate CLAUDE.md (merged into AGENTS.md) - Remove empty specs/db/ directory (no database in project) - Clean up requirements.txt to core dependencies only - Update ruff configuration to allow conventional M,N,K variable names - Reorganize documentation structure with proper categorization - Update CI workflow paths to reflect new directory structure Breaking changes: - Module import path changed: use 'from cuda_llm_ops import ...' - Extension module renamed: cuda_llm_ops._cuda_llm_ops (internal) Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
1 parent ecefca0 commit 60b7ee0

52 files changed

Lines changed: 1361 additions & 606 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/ci.yml

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,11 @@ jobs:
3131
- name: Install linters
3232
run: pip install ruff==0.15.10
3333

34-
- name: Debug ruff version
35-
run: ruff --version
36-
3734
- name: Ruff lint
38-
run: ruff check python/ tests/ benchmarks/ || true
35+
run: ruff check cuda_llm_ops/ tests/ benchmarks/
3936

4037
- name: Ruff format check
41-
run: ruff format python/ tests/ benchmarks/
38+
run: ruff format --check cuda_llm_ops/ tests/ benchmarks/
4239

4340
test-cpu:
4441
name: CPU Tests

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,9 @@ ipython_config.py
8080
# Ruff
8181
.ruff_cache/
8282

83+
# Pre-commit
84+
.pre-commit-cache/
85+
8386
# mypy
8487
.mypy_cache/
8588
.dmypy.json

.pre-commit-config.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
repos:
2+
- repo: https://github.com/astral-sh/ruff-pre-commit
3+
rev: v0.15.10
4+
hooks:
5+
- id: ruff
6+
args: [--fix]
7+
files: ^(cuda_llm_ops|tests|benchmarks)/.*\.py$
8+
- id: ruff-format
9+
files: ^(cuda_llm_ops|tests|benchmarks)/.*\.py$
10+
11+
- repo: https://github.com/pre-commit/mirrors-clang-format
12+
rev: v18.1.8
13+
hooks:
14+
- id: clang-format
15+
files: \.(cu|cuh|cpp|h|hpp)$
16+
17+
- repo: https://github.com/pre-commit/pre-commit-hooks
18+
rev: v5.0.0
19+
hooks:
20+
- id: trailing-whitespace
21+
- id: end-of-file-fixer
22+
- id: check-yaml
23+
args: [--allow-multiple-documents]
24+
- id: check-json
25+
- id: check-merge-conflict
26+
- id: check-added-large-files
27+
args: [--maxkb=1024]

AGENTS.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# AGENTS.md — AI Agent Workflow Instructions
2+
3+
This file provides instructions for AI coding assistants (Claude Code, Cursor, GitHub Copilot, etc.) working on this repository.
4+
5+
---
6+
7+
## Project Philosophy: Spec-Driven Development (SDD)
8+
9+
This project strictly follows the **Spec-Driven Development (SDD)** paradigm. All code implementations must use the specification documents in the `/specs` directory as the **Single Source of Truth (SSOT)**.
10+
11+
**Core Principle**: Specs first, code second. Never write code without a corresponding spec.
12+
13+
---
14+
15+
## Directory Context
16+
17+
| Directory | Purpose |
18+
|-----------|---------|
19+
| `/specs/product/` | Product feature definitions, user stories, and acceptance criteria |
20+
| `/specs/rfc/` | Technical design documents, architecture decisions, and implementation plans |
21+
| `/specs/api/` | API interface definitions (OpenAPI, GraphQL schemas, etc.) |
22+
| `/specs/db/` | Database and schema specifications |
23+
| `/specs/testing/` | BDD test case specifications (Gherkin `.feature` files) |
24+
| `/docs/` | User guides, tutorials, setup guides, and developer documentation |
25+
26+
---
27+
28+
## AI Agent Workflow Instructions
29+
30+
When you (AI) are asked to develop a new feature, modify an existing one, or fix a bug, **you must strictly follow this workflow without skipping any steps**:
31+
32+
### Step 1: Review Specs (Review First)
33+
34+
- **Before writing any code**, read the relevant spec documents in `/specs`:
35+
- Product specs: `/specs/product/*.md`
36+
- RFC/Architecture: `/specs/rfc/*.md`
37+
- API definitions: `/specs/api/*.yaml` or `/specs/api/*.md`
38+
- If the user's request **conflicts with existing specs**:
39+
- **Stop coding immediately**
40+
- Point out the conflict clearly
41+
- Ask the user whether to update the spec first
42+
43+
### Step 2: Spec-First Update
44+
45+
- If this is a **new feature**, or if it requires changes to existing interfaces/database structures:
46+
- **You must first propose modifications** to the corresponding spec documents
47+
- Examples: update `openapi.yaml`, create a new RFC, or modify product requirements
48+
- **Wait for user confirmation** of the spec changes before entering the code-writing phase
49+
- Never assume spec changes are approved without explicit user acknowledgment
50+
51+
### Step 3: Code Implementation
52+
53+
- When writing code, **100% comply with the spec definitions**:
54+
- Variable naming conventions
55+
- API paths and HTTP methods
56+
- Data types and validation rules
57+
- HTTP status codes and error responses
58+
- **Do not add features not defined in the spec** (No Gold-Plating)
59+
- If you need to make a technical decision not covered by the spec, document it and ask the user whether to add it to the spec
60+
61+
### Step 4: Test Against Spec
62+
63+
- Write unit and integration tests based on the **acceptance criteria** in `/specs`
64+
- Ensure test cases cover all **boundary conditions** described in the specs
65+
- For property-based tests, ensure properties align with the correctness criteria defined in `/specs/product/`
66+
- If a test fails, reference the specific spec requirement that is not met
67+
68+
---
69+
70+
## Code Generation Rules
71+
72+
| Rule | Description |
73+
|------|-------------|
74+
| **API Changes** | Any externally exposed API changes must update `/specs/api/` |
75+
| **Architecture Decisions** | Consult `/specs/rfc/` for conventions; do not invent design patterns |
76+
| **No Spec, No Code** | Never write code without a corresponding spec or spec update proposal |
77+
| **No Gold-Plating** | Do not implement features beyond what the spec defines |
78+
| **Traceability** | Reference spec requirements in commit messages (e.g., `feat: implement REQ-3 FlashAttention`) |
79+
80+
---
81+
82+
## Why This Matters
83+
84+
1. **Prevent AI Hallucination**: Forcing the AI to read `/specs` first anchors its thinking to the project's actual requirements and constraints.
85+
86+
2. **Document-Code Synchronization**: "Modify spec first, then code" ensures documentation and code are always in sync.
87+
88+
3. **PR Quality**: When the AI generates Pull Requests, the implementation will be highly aligned with business logic because it was developed against the acceptance criteria you defined.
89+
90+
---
91+
92+
## Quick Reference: Common Commands
93+
94+
| Task | Command |
95+
|------|---------|
96+
| Install dependencies | `pip install -r requirements.txt` |
97+
| Build CUDA extension | `pip install -e .` |
98+
| Run all tests | `pytest tests/ -v` |
99+
| Run property tests | `pytest tests/ -v -m property` |
100+
| Run CPU-safe tests | `pytest tests/ -v -m "not cuda"` |
101+
| Lint check | `ruff check python/ tests/ benchmarks/` |
102+
| Format code | `ruff format python/ tests/ benchmarks/` |
103+
| Benchmark attention | `python benchmarks/benchmark_attention.py` |
104+
| Benchmark GEMM | `python benchmarks/benchmark_gemm.py` |
105+
106+
---
107+
108+
## Current Specifications
109+
110+
| Spec | Location | Description |
111+
|------|----------|-------------|
112+
| Product Requirements | [`/specs/product/cuda-llm-kernel-optimization.md`](specs/product/cuda-llm-kernel-optimization.md) | Feature requirements and acceptance criteria |
113+
| Core Architecture RFC | [`/specs/rfc/0001-core-architecture.md`](specs/rfc/0001-core-architecture.md) | Technical design and architecture |
114+
| Implementation Tasks RFC | [`/specs/rfc/0002-implementation-tasks.md`](specs/rfc/0002-implementation-tasks.md) | Implementation plan and task breakdown |
115+
116+
---
117+
118+
## Project Overview
119+
120+
This is a **CUDA kernel optimization library for LLM inference**, providing:
121+
122+
- **FlashAttention**: O(N) memory complexity with online softmax algorithm
123+
- **Tensor Core GEMM**: Hardware-accelerated matrix multiplication (FP16/INT8)
124+
- **High-Performance GEMM**: Register tiling and double buffering
125+
- **PyTorch Integration**: Python bindings via pybind11
126+
127+
### Optimization Roadmap
128+
129+
```
130+
Naive → Tiled → FlashAttention → Tensor Core
131+
│ │ │ │
132+
│ │ │ └─ Hardware acceleration
133+
│ │ └─ O(N) memory, online softmax
134+
│ └─ Shared memory tiling
135+
└─ Baseline (O(N²) memory)
136+
```
137+
138+
### Core Components
139+
140+
**CUDA Kernels (`src/`):**
141+
142+
| File | Description | Key Features |
143+
|------|-------------|--------------|
144+
| `naive_attention.cu` | Baseline attention | O(N²) memory, correctness reference |
145+
| `tiled_attention.cu` | Tiled optimization | Shared memory, bank conflict padding |
146+
| `flash_attention.cu` | FlashAttention | O(N) memory, online softmax, double buffering |
147+
| `tensor_core_gemm.cu` | Tensor Core GEMM | WMMA API, FP16/INT8, tiled version |
148+
| `hgemm_kernel.cu` | High-perf GEMM | Register tiling, double buffering, layout support |
149+
150+
**Header Primitives (`include/`):**
151+
152+
| File | Description |
153+
|------|-------------|
154+
| `common.cuh` | Core types (`AttentionConfig`, `GemmConfig`, `KernelMetrics`), CUDA_CHECK macro |
155+
| `online_softmax.cuh` | Online softmax algorithm for FlashAttention |
156+
| `warp_primitives.cuh` | Warp-level operations (reduce_sum, reduce_max, broadcast) |
157+
| `shared_memory.cuh` | Shared memory management, padding utilities |
158+
| `pipeline.cuh` | Double buffering, async copy (Ampere+), software pipeline |
159+
160+
**Python Bindings (`python/`):**
161+
162+
| File | Description |
163+
|------|-------------|
164+
| `bindings.cpp` | pybind11 bindings exposing all kernel functions |
165+
| `__init__.py` | Module interface, exports all functions |
166+
| `profiler.py` | Performance profiling utilities |
167+
168+
**Module name:** `cuda_llm_ops`
169+
170+
---
171+
172+
## API Reference
173+
174+
### Attention Functions
175+
176+
```python
177+
from cuda_llm_ops import naive_attention, tiled_attention, flash_attention
178+
179+
# All functions share the same signature:
180+
output = flash_attention(q, k, v, scale=0.0, is_causal=False)
181+
182+
# Input shape: [batch, heads, seq_len, head_dim]
183+
# Output shape: [batch, heads, seq_len, head_dim]
184+
# dtype: float32 or float16
185+
# device: CUDA, contiguous
186+
```
187+
188+
### GEMM Functions
189+
190+
```python
191+
from cuda_llm_ops import gemm, tensor_core_gemm, tensor_core_gemm_int8
192+
193+
# Standard GEMM: C = alpha * A @ B + beta * C
194+
c = gemm(a, b, alpha=1.0, beta=0.0, trans_a=False, trans_b=False)
195+
196+
# Tensor Core GEMM: FP16 input, FP32 output
197+
c = tensor_core_gemm(a, b, alpha=1.0, beta=0.0)
198+
199+
# INT8 GEMM: INT8 input, INT32 output (requires Turing+ SM≥7.2)
200+
c = tensor_core_gemm_int8(a, b)
201+
```
202+
203+
---
204+
205+
## Testing
206+
207+
### Test Categories
208+
209+
| Marker | Purpose | Command |
210+
|--------|---------|---------|
211+
| `cuda` | Requires GPU | `pytest -m cuda` |
212+
| `property` | Hypothesis tests | `pytest -m property` |
213+
| `slow` | Long-running | `pytest -m "not slow"` |
214+
215+
### Test Structure
216+
217+
```python
218+
import pytest
219+
from hypothesis import given, settings, strategies as st
220+
221+
class TestFlashAttention:
222+
@pytest.mark.cuda
223+
def test_correctness(self, device):
224+
"""Verify output matches PyTorch reference."""
225+
q = torch.randn(2, 4, 64, 32, device=device)
226+
output = flash_attention(q, k, v)
227+
reference = torch.nn.functional.scaled_dot_product_attention(q, k, v)
228+
assert_close(output, reference)
229+
230+
@pytest.mark.cuda
231+
@pytest.mark.property
232+
@settings(max_examples=100, deadline=None)
233+
@given(batch=st.integers(1, 4), seq_len=st.integers(16, 256))
234+
def test_property(self, batch, seq_len, device):
235+
"""Property-based testing with Hypothesis."""
236+
pass
237+
```
238+
239+
---
240+
241+
## Code Style
242+
243+
- **C++/CUDA**: Follow `.clang-format`, use `snake_case` for functions
244+
- **Python**: PEP 8, 4 spaces, max 100 chars, use f-strings
245+
- **Commits**: Conventional Commits (`feat:`, `fix:`, `perf:`, etc.)
246+
247+
---
248+
249+
## Related Documentation
250+
251+
| Document | Description |
252+
|----------|-------------|
253+
| [API Reference](docs/api/api-en.md) | Detailed API documentation |
254+
| [Architecture](docs/architecture/architecture-en.md) | Technical deep dive |
255+
| [Performance Guide](docs/tutorials/performance-en.md) | Optimization strategies |
256+
| [Contributing](CONTRIBUTING.md) | Development workflow |

0 commit comments

Comments
 (0)