|
| 1 | +# AGENTS.md — AI Agent Workflow Instructions |
| 2 | + |
| 3 | +This file provides instructions for AI coding assistants (Claude Code, Cursor, GitHub Copilot, etc.) working on this repository. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Project Philosophy: Spec-Driven Development (SDD) |
| 8 | + |
| 9 | +This project strictly follows the **Spec-Driven Development (SDD)** paradigm. All code implementations must use the specification documents in the `/specs` directory as the **Single Source of Truth (SSOT)**. |
| 10 | + |
| 11 | +**Core Principle**: Specs first, code second. Never write code without a corresponding spec. |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Directory Context |
| 16 | + |
| 17 | +| Directory | Purpose | |
| 18 | +|-----------|---------| |
| 19 | +| `/specs/product/` | Product feature definitions, user stories, and acceptance criteria | |
| 20 | +| `/specs/rfc/` | Technical design documents, architecture decisions, and implementation plans | |
| 21 | +| `/specs/api/` | API interface definitions (OpenAPI, GraphQL schemas, etc.) | |
| 22 | +| `/specs/db/` | Database and schema specifications | |
| 23 | +| `/specs/testing/` | BDD test case specifications (Gherkin `.feature` files) | |
| 24 | +| `/docs/` | User guides, tutorials, setup guides, and developer documentation | |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## AI Agent Workflow Instructions |
| 29 | + |
| 30 | +When you (AI) are asked to develop a new feature, modify an existing one, or fix a bug, **you must strictly follow this workflow without skipping any steps**: |
| 31 | + |
| 32 | +### Step 1: Review Specs (Review First) |
| 33 | + |
| 34 | +- **Before writing any code**, read the relevant spec documents in `/specs`: |
| 35 | + - Product specs: `/specs/product/*.md` |
| 36 | + - RFC/Architecture: `/specs/rfc/*.md` |
| 37 | + - API definitions: `/specs/api/*.yaml` or `/specs/api/*.md` |
| 38 | +- If the user's request **conflicts with existing specs**: |
| 39 | + - **Stop coding immediately** |
| 40 | + - Point out the conflict clearly |
| 41 | + - Ask the user whether to update the spec first |
| 42 | + |
| 43 | +### Step 2: Spec-First Update |
| 44 | + |
| 45 | +- If this is a **new feature**, or if it requires changes to existing interfaces/database structures: |
| 46 | + - **You must first propose modifications** to the corresponding spec documents |
| 47 | + - Examples: update `openapi.yaml`, create a new RFC, or modify product requirements |
| 48 | +- **Wait for user confirmation** of the spec changes before entering the code-writing phase |
| 49 | +- Never assume spec changes are approved without explicit user acknowledgment |
| 50 | + |
| 51 | +### Step 3: Code Implementation |
| 52 | + |
| 53 | +- When writing code, **100% comply with the spec definitions**: |
| 54 | + - Variable naming conventions |
| 55 | + - API paths and HTTP methods |
| 56 | + - Data types and validation rules |
| 57 | + - HTTP status codes and error responses |
| 58 | +- **Do not add features not defined in the spec** (No Gold-Plating) |
| 59 | +- If you need to make a technical decision not covered by the spec, document it and ask the user whether to add it to the spec |
| 60 | + |
| 61 | +### Step 4: Test Against Spec |
| 62 | + |
| 63 | +- Write unit and integration tests based on the **acceptance criteria** in `/specs` |
| 64 | +- Ensure test cases cover all **boundary conditions** described in the specs |
| 65 | +- For property-based tests, ensure properties align with the correctness criteria defined in `/specs/product/` |
| 66 | +- If a test fails, reference the specific spec requirement that is not met |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## Code Generation Rules |
| 71 | + |
| 72 | +| Rule | Description | |
| 73 | +|------|-------------| |
| 74 | +| **API Changes** | Any externally exposed API changes must update `/specs/api/` | |
| 75 | +| **Architecture Decisions** | Consult `/specs/rfc/` for conventions; do not invent design patterns | |
| 76 | +| **No Spec, No Code** | Never write code without a corresponding spec or spec update proposal | |
| 77 | +| **No Gold-Plating** | Do not implement features beyond what the spec defines | |
| 78 | +| **Traceability** | Reference spec requirements in commit messages (e.g., `feat: implement REQ-3 FlashAttention`) | |
| 79 | + |
| 80 | +--- |
| 81 | + |
| 82 | +## Why This Matters |
| 83 | + |
| 84 | +1. **Prevent AI Hallucination**: Forcing the AI to read `/specs` first anchors its thinking to the project's actual requirements and constraints. |
| 85 | + |
| 86 | +2. **Document-Code Synchronization**: "Modify spec first, then code" ensures documentation and code are always in sync. |
| 87 | + |
| 88 | +3. **PR Quality**: When the AI generates Pull Requests, the implementation will be highly aligned with business logic because it was developed against the acceptance criteria you defined. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Quick Reference: Common Commands |
| 93 | + |
| 94 | +| Task | Command | |
| 95 | +|------|---------| |
| 96 | +| Install dependencies | `pip install -r requirements.txt` | |
| 97 | +| Build CUDA extension | `pip install -e .` | |
| 98 | +| Run all tests | `pytest tests/ -v` | |
| 99 | +| Run property tests | `pytest tests/ -v -m property` | |
| 100 | +| Run CPU-safe tests | `pytest tests/ -v -m "not cuda"` | |
| 101 | +| Lint check | `ruff check python/ tests/ benchmarks/` | |
| 102 | +| Format code | `ruff format python/ tests/ benchmarks/` | |
| 103 | +| Benchmark attention | `python benchmarks/benchmark_attention.py` | |
| 104 | +| Benchmark GEMM | `python benchmarks/benchmark_gemm.py` | |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Current Specifications |
| 109 | + |
| 110 | +| Spec | Location | Description | |
| 111 | +|------|----------|-------------| |
| 112 | +| Product Requirements | [`/specs/product/cuda-llm-kernel-optimization.md`](specs/product/cuda-llm-kernel-optimization.md) | Feature requirements and acceptance criteria | |
| 113 | +| Core Architecture RFC | [`/specs/rfc/0001-core-architecture.md`](specs/rfc/0001-core-architecture.md) | Technical design and architecture | |
| 114 | +| Implementation Tasks RFC | [`/specs/rfc/0002-implementation-tasks.md`](specs/rfc/0002-implementation-tasks.md) | Implementation plan and task breakdown | |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Project Overview |
| 119 | + |
| 120 | +This is a **CUDA kernel optimization library for LLM inference**, providing: |
| 121 | + |
| 122 | +- **FlashAttention**: O(N) memory complexity with online softmax algorithm |
| 123 | +- **Tensor Core GEMM**: Hardware-accelerated matrix multiplication (FP16/INT8) |
| 124 | +- **High-Performance GEMM**: Register tiling and double buffering |
| 125 | +- **PyTorch Integration**: Python bindings via pybind11 |
| 126 | + |
| 127 | +### Optimization Roadmap |
| 128 | + |
| 129 | +``` |
| 130 | +Naive → Tiled → FlashAttention → Tensor Core |
| 131 | + │ │ │ │ |
| 132 | + │ │ │ └─ Hardware acceleration |
| 133 | + │ │ └─ O(N) memory, online softmax |
| 134 | + │ └─ Shared memory tiling |
| 135 | + └─ Baseline (O(N²) memory) |
| 136 | +``` |
| 137 | + |
| 138 | +### Core Components |
| 139 | + |
| 140 | +**CUDA Kernels (`src/`):** |
| 141 | + |
| 142 | +| File | Description | Key Features | |
| 143 | +|------|-------------|--------------| |
| 144 | +| `naive_attention.cu` | Baseline attention | O(N²) memory, correctness reference | |
| 145 | +| `tiled_attention.cu` | Tiled optimization | Shared memory, bank conflict padding | |
| 146 | +| `flash_attention.cu` | FlashAttention | O(N) memory, online softmax, double buffering | |
| 147 | +| `tensor_core_gemm.cu` | Tensor Core GEMM | WMMA API, FP16/INT8, tiled version | |
| 148 | +| `hgemm_kernel.cu` | High-perf GEMM | Register tiling, double buffering, layout support | |
| 149 | + |
| 150 | +**Header Primitives (`include/`):** |
| 151 | + |
| 152 | +| File | Description | |
| 153 | +|------|-------------| |
| 154 | +| `common.cuh` | Core types (`AttentionConfig`, `GemmConfig`, `KernelMetrics`), CUDA_CHECK macro | |
| 155 | +| `online_softmax.cuh` | Online softmax algorithm for FlashAttention | |
| 156 | +| `warp_primitives.cuh` | Warp-level operations (reduce_sum, reduce_max, broadcast) | |
| 157 | +| `shared_memory.cuh` | Shared memory management, padding utilities | |
| 158 | +| `pipeline.cuh` | Double buffering, async copy (Ampere+), software pipeline | |
| 159 | + |
| 160 | +**Python Bindings (`python/`):** |
| 161 | + |
| 162 | +| File | Description | |
| 163 | +|------|-------------| |
| 164 | +| `bindings.cpp` | pybind11 bindings exposing all kernel functions | |
| 165 | +| `__init__.py` | Module interface, exports all functions | |
| 166 | +| `profiler.py` | Performance profiling utilities | |
| 167 | + |
| 168 | +**Module name:** `cuda_llm_ops` |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## API Reference |
| 173 | + |
| 174 | +### Attention Functions |
| 175 | + |
| 176 | +```python |
| 177 | +from cuda_llm_ops import naive_attention, tiled_attention, flash_attention |
| 178 | + |
| 179 | +# All functions share the same signature: |
| 180 | +output = flash_attention(q, k, v, scale=0.0, is_causal=False) |
| 181 | + |
| 182 | +# Input shape: [batch, heads, seq_len, head_dim] |
| 183 | +# Output shape: [batch, heads, seq_len, head_dim] |
| 184 | +# dtype: float32 or float16 |
| 185 | +# device: CUDA, contiguous |
| 186 | +``` |
| 187 | + |
| 188 | +### GEMM Functions |
| 189 | + |
| 190 | +```python |
| 191 | +from cuda_llm_ops import gemm, tensor_core_gemm, tensor_core_gemm_int8 |
| 192 | + |
| 193 | +# Standard GEMM: C = alpha * A @ B + beta * C |
| 194 | +c = gemm(a, b, alpha=1.0, beta=0.0, trans_a=False, trans_b=False) |
| 195 | + |
| 196 | +# Tensor Core GEMM: FP16 input, FP32 output |
| 197 | +c = tensor_core_gemm(a, b, alpha=1.0, beta=0.0) |
| 198 | + |
| 199 | +# INT8 GEMM: INT8 input, INT32 output (requires Turing+ SM≥7.2) |
| 200 | +c = tensor_core_gemm_int8(a, b) |
| 201 | +``` |
| 202 | + |
| 203 | +--- |
| 204 | + |
| 205 | +## Testing |
| 206 | + |
| 207 | +### Test Categories |
| 208 | + |
| 209 | +| Marker | Purpose | Command | |
| 210 | +|--------|---------|---------| |
| 211 | +| `cuda` | Requires GPU | `pytest -m cuda` | |
| 212 | +| `property` | Hypothesis tests | `pytest -m property` | |
| 213 | +| `slow` | Long-running | `pytest -m "not slow"` | |
| 214 | + |
| 215 | +### Test Structure |
| 216 | + |
| 217 | +```python |
| 218 | +import pytest |
| 219 | +from hypothesis import given, settings, strategies as st |
| 220 | + |
| 221 | +class TestFlashAttention: |
| 222 | + @pytest.mark.cuda |
| 223 | + def test_correctness(self, device): |
| 224 | + """Verify output matches PyTorch reference.""" |
| 225 | + q = torch.randn(2, 4, 64, 32, device=device) |
| 226 | + output = flash_attention(q, k, v) |
| 227 | + reference = torch.nn.functional.scaled_dot_product_attention(q, k, v) |
| 228 | + assert_close(output, reference) |
| 229 | + |
| 230 | + @pytest.mark.cuda |
| 231 | + @pytest.mark.property |
| 232 | + @settings(max_examples=100, deadline=None) |
| 233 | + @given(batch=st.integers(1, 4), seq_len=st.integers(16, 256)) |
| 234 | + def test_property(self, batch, seq_len, device): |
| 235 | + """Property-based testing with Hypothesis.""" |
| 236 | + pass |
| 237 | +``` |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +## Code Style |
| 242 | + |
| 243 | +- **C++/CUDA**: Follow `.clang-format`, use `snake_case` for functions |
| 244 | +- **Python**: PEP 8, 4 spaces, max 100 chars, use f-strings |
| 245 | +- **Commits**: Conventional Commits (`feat:`, `fix:`, `perf:`, etc.) |
| 246 | + |
| 247 | +--- |
| 248 | + |
| 249 | +## Related Documentation |
| 250 | + |
| 251 | +| Document | Description | |
| 252 | +|----------|-------------| |
| 253 | +| [API Reference](docs/api/api-en.md) | Detailed API documentation | |
| 254 | +| [Architecture](docs/architecture/architecture-en.md) | Technical deep dive | |
| 255 | +| [Performance Guide](docs/tutorials/performance-en.md) | Optimization strategies | |
| 256 | +| [Contributing](CONTRIBUTING.md) | Development workflow | |
0 commit comments