Skip to content

Commit eb04f93

Browse files
author
shijiashuai
committed
feat: align runtime surface with GEMM baseline implementation
- Update README/docs to match actual Python binding surface (elementwise, reduction, gemm) - Implement CUTLASS GEMM baseline for supported configurations - Replace placeholder benchmark with real GEMM execution - Add comprehensive runtime surface contract tests - Sync OpenSpec delta specs to main specs - Archive completed change
1 parent a57a64c commit eb04f93

22 files changed

Lines changed: 1315 additions & 168 deletions

File tree

README.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ ctest --output-on-failure
156156
./examples/gemm/gemm_benchmark
157157

158158
# Python example (if bindings enabled)
159-
python ../examples/python/basic_usage.py
159+
python3 ../examples/python/basic_usage.py
160160
```
161161

162162
<details>
@@ -279,34 +279,49 @@ hpc-ai-optimization-lab/
279279
#include "common/tensor.cuh"
280280

281281
// Allocate GPU tensors
282-
auto A = hpc::common::make_tensor<float>({1024, 1024});
283-
auto B = hpc::common::make_tensor<float>({1024, 1024});
284-
auto C = hpc::common::make_tensor<float>({1024, 1024});
282+
constexpr int M = 1024;
283+
constexpr int N = 1024;
284+
constexpr int K = 1024;
285285

286-
// Launch optimized kernel
287-
hpc::gemm::gemm<float, hpc::gemm::OptLevel::Advanced>(
288-
A.data(), B.data(), C.data(), 1024, 1024, 1024);
286+
hpc::Tensor<float> A(M * K);
287+
hpc::Tensor<float> B(K * N);
288+
hpc::Tensor<float> C(M * N);
289+
C.zero();
290+
291+
// Launch the current shared-memory-tiling GEMM path
292+
hpc::gemm::gemm<float, hpc::gemm::GemmOpt::SharedMemTiling>(
293+
A.data(), B.data(), C.data(), M, N, K);
289294

290295
// Automatic memory cleanup when tensors go out of scope
291296
```
292297
293298
### Python API
294299
300+
Current Python bindings expose `elementwise`, `reduction`, and `gemm`.
301+
295302
```python
296303
import hpc_ai_opt
297-
import numpy as np
304+
import torch
305+
306+
# Create CUDA tensors
307+
a = torch.randn(128, 64, device="cuda", dtype=torch.float32)
308+
b = torch.randn(64, 96, device="cuda", dtype=torch.float32)
309+
c = torch.zeros(128, 96, device="cuda", dtype=torch.float32)
310+
311+
# Execute the currently shipped GEMM binding
312+
hpc_ai_opt.gemm.matmul(a, b, c, 128, 96, 64, 1.0, 0.0)
298313
299-
# Create input data
300-
A = np.random.randn(1024, 1024).astype(np.float32)
301-
B = np.random.randn(1024, 1024).astype(np.float32)
314+
print(c.shape)
315+
```
302316

303-
# Execute optimized GEMM
304-
C = hpc_ai_opt.gemm(A, B)
317+
Current phase benchmark CLI:
305318

306-
print(f"Result shape: {C.shape}")
307-
print(f"Performance: {hpc_ai_opt.last_tflops:.1f} TFLOPS")
319+
```bash
320+
python3 python/benchmark/benchmark.py --suite gemm --sizes 256,512 --output results.json
308321
```
309322

323+
The Python benchmark entrypoint currently wires the GEMM suite by default and emits reports only from measured result sets.
324+
310325
---
311326

312327
## Testing
@@ -378,6 +393,8 @@ The repository is in a finishing-and-hardening phase.
378393
| Attention ||| - | - | - | Stable |
379394
| Quantization ||| - || 🚧 | Stable |
380395

396+
The support matrix describes the C++/CUDA core. In this phase, Python bindings cover `elementwise`, `reduction`, and `gemm` only.
397+
381398
🚧 = Partial support / In development
382399

383400
---

README.zh-CN.md

Lines changed: 32 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ ctest --output-on-failure
156156
./examples/gemm/gemm_benchmark
157157

158158
# Python 示例(如果启用了绑定)
159-
python ../examples/python/basic_usage.py
159+
python3 ../examples/python/basic_usage.py
160160
```
161161

162162
<details>
@@ -279,13 +279,18 @@ hpc-ai-optimization-lab/
279279
#include "common/tensor.cuh"
280280

281281
// 分配 GPU 张量
282-
auto A = hpc::common::make_tensor<float>({1024, 1024});
283-
auto B = hpc::common::make_tensor<float>({1024, 1024});
284-
auto C = hpc::common::make_tensor<float>({1024, 1024});
282+
constexpr int M = 1024;
283+
constexpr int N = 1024;
284+
constexpr int K = 1024;
285285

286-
// 启动优化后的内核
287-
hpc::gemm::gemm<float, hpc::gemm::OptLevel::Advanced>(
288-
A.data(), B.data(), C.data(), 1024, 1024, 1024);
286+
hpc::Tensor<float> A(M * K);
287+
hpc::Tensor<float> B(K * N);
288+
hpc::Tensor<float> C(M * N);
289+
C.zero();
290+
291+
// 启动当前的共享内存分块 GEMM 路径
292+
hpc::gemm::gemm<float, hpc::gemm::GemmOpt::SharedMemTiling>(
293+
A.data(), B.data(), C.data(), M, N, K);
289294

290295
// 张量超出作用域时自动释放内存
291296
```
@@ -294,19 +299,29 @@ hpc::gemm::gemm<float, hpc::gemm::OptLevel::Advanced>(
294299
295300
```python
296301
import hpc_ai_opt
297-
import numpy as np
302+
import torch
303+
304+
# 创建 CUDA 张量
305+
a = torch.randn(128, 64, device="cuda", dtype=torch.float32)
306+
b = torch.randn(64, 96, device="cuda", dtype=torch.float32)
307+
c = torch.zeros(128, 96, device="cuda", dtype=torch.float32)
308+
309+
# 调用当前已提供的 GEMM 绑定
310+
hpc_ai_opt.gemm.matmul(a, b, c, 128, 96, 64, 1.0, 0.0)
311+
312+
print(c.shape)
313+
```
298314

299-
# 创建输入数据
300-
A = np.random.randn(1024, 1024).astype(np.float32)
301-
B = np.random.randn(1024, 1024).astype(np.float32)
315+
当前 Python 绑定暴露 `elementwise``reduction``gemm`
302316

303-
# 执行优化的 GEMM
304-
C = hpc_ai_opt.gemm(A, B)
317+
当前阶段的 benchmark CLI:
305318

306-
print(f"结果形状: {C.shape}")
307-
print(f"性能: {hpc_ai_opt.last_tflops:.1f} TFLOPS")
319+
```bash
320+
python3 python/benchmark/benchmark.py --suite gemm --sizes 256,512 --output results.json
308321
```
309322

323+
Python benchmark 入口当前默认只接通 GEMM suite,并且只会基于真实测量结果生成报告。
324+
310325
---
311326

312327
## 测试
@@ -380,6 +395,8 @@ git push origin feature/my-optimization
380395

381396
🚧 = 部分支持 / 开发中
382397

398+
支持矩阵描述的是 C++/CUDA 核心能力。在当前阶段,Python 绑定只覆盖 `elementwise``reduction``gemm`
399+
383400
---
384401

385402
## 许可证

docs/404.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ onMounted(() => {
3939
<ul>
4040
<li><a href="/en/guide/installation">Installation Guide</a></li>
4141
<li><a href="/en/guide/quick-start">Quick Start</a></li>
42-
<li><a href="/en/API_REFERENCE">API Reference</a></li>
42+
<li><a href="/en/api/index">API Reference</a></li>
4343
<li><a href="/en/guide/gemm">GEMM Optimization</a></li>
4444
<li><a href="/en/guide/profiling">Profiling Guide</a></li>
4545
</ul>

examples/python/basic_usage.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def example_gemm(device: torch.device) -> None:
6363
m, n, k = 128, 96, 64
6464
a = torch.randn(m, k, device=device, dtype=torch.float32)
6565
b = torch.randn(k, n, device=device, dtype=torch.float32)
66-
c = torch.empty(m, n, device=device, dtype=torch.float32)
66+
c = torch.zeros(m, n, device=device, dtype=torch.float32)
6767

6868
opt.gemm.matmul(a, b, c, m, n, k, 1.0, 0.0)
6969
torch.testing.assert_close(c, a @ b, rtol=1e-4, atol=1e-4)
@@ -73,6 +73,7 @@ def example_gemm(device: torch.device) -> None:
7373
def main() -> None:
7474
device = require_cuda()
7575
print("Running hpc_ai_opt examples on", torch.cuda.get_device_name(device))
76+
print("Current shipped modules: elementwise, reduction, gemm")
7677
example_elementwise(device)
7778
example_reduction(device)
7879
example_gemm(device)
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-04-27
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
## Context
2+
3+
The repository already contains the ingredients for a credible runtime-facing CUDA lab, but its public surface currently overstates what is actually wired together. The README shows a simplified Python API that the nanobind module does not export, the Python benchmark framework contains device modeling and report generation but still stops at placeholder execution, and the existing GEMM API advertises a CUTLASS comparison hook without implementing it. This creates a trust gap in exactly the area the repository now wants to harden: what users can actually run, compare, and learn from.
4+
5+
This change is intentionally bounded. It targets the first runtime-facing slice that can be made truthful and demonstrable without expanding into a broad new product surface: shipped Python API alignment, executable benchmark entrypoints, and a GEMM baseline path using the CUTLASS dependency that is already fetched by CMake.
6+
7+
## Goals / Non-Goals
8+
9+
**Goals:**
10+
- Make the documented Python surface match the nanobind module and shipped examples.
11+
- Turn the benchmark entrypoint into a real, bounded execution path for GEMM-first comparisons.
12+
- Implement a CUTLASS-backed GEMM baseline/fallback path that can be consumed by benchmarks and comparison tests.
13+
- Add enough validation that docs, bindings, examples, and benchmark behavior stay aligned.
14+
- Express the above as OpenSpec capabilities so later runtime work can build on a stable contract.
15+
16+
**Non-Goals:**
17+
- Expanding Python bindings to every existing CUDA module in this change.
18+
- Implementing Hopper/TMA/FP8 systemization, paged attention, or broader inference-runtime features.
19+
- Rewriting the project into a packaging-first Python distribution in this phase.
20+
- Replacing the educational GEMM path with CUTLASS; CUTLASS is introduced as baseline/fallback, not as the primary pedagogical implementation.
21+
22+
## Decisions
23+
24+
### 1. Split the change into a truthfulness surface and a benchmark baseline within one bounded runtime phase
25+
26+
**Decision:** This change will combine the public-surface corrections (README, examples, support descriptions, benchmark entrypoint) with the minimum kernel-side work needed to make those corrections meaningful: a real GEMM benchmark path and a CUTLASS baseline.
27+
28+
**Rationale:** Doing only documentation cleanup would leave the benchmark surface weak, while doing only CUTLASS work would still leave the repository telling an inaccurate story. Bundling these two pieces creates one coherent "trust restoration + baseline establishment" change.
29+
30+
**Alternatives considered:**
31+
- **Docs-only first:** lower implementation cost, but still leaves the benchmark path unconvincing.
32+
- **CUTLASS-only first:** improves internals, but does not fix the current public mismatch.
33+
34+
### 2. Treat the current nanobind module as the source of truth, then expand only what this change explicitly adds
35+
36+
**Decision:** The shipped Python surface will be defined by the actual exported module structure (`elementwise`, `reduction`, `gemm`) plus any deliberate additions made in this bounded change. Documentation and examples will be rewritten to that surface rather than to aspirational convenience APIs.
37+
38+
**Rationale:** The current trust issue comes from docs getting ahead of implementation. Reversing that relationship makes later growth safer.
39+
40+
**Alternatives considered:**
41+
- **Add convenience APIs just to preserve the README shape:** rejected because it would expand scope unnecessarily.
42+
- **Keep aspirational examples with disclaimers:** rejected because they still blur what is shipped.
43+
44+
### 3. Implement CUTLASS as a comparison path, not the default GEMM teaching path
45+
46+
**Decision:** `gemm_cutlass()` will be implemented as a dedicated baseline/fallback path that sits alongside the existing staged GEMM implementations, and benchmarks will compare against it explicitly.
47+
48+
**Rationale:** The repository's educational value depends on keeping the 7-step GEMM progression visible. CUTLASS should strengthen credibility and benchmarking, not collapse the learning narrative into a single black-box kernel.
49+
50+
**Alternatives considered:**
51+
- **Replace advanced GEMM paths with CUTLASS:** rejected because it weakens the lab's pedagogical structure.
52+
- **Leave CUTLASS unused:** rejected because the dependency already exists and the missing baseline is currently a notable gap.
53+
54+
### 4. Make the benchmark CLI execute real workloads with explicit bounded scope
55+
56+
**Decision:** The benchmark framework entrypoint will execute real kernels for the suites supported by this phase, starting with GEMM and any already-wired adjacent comparisons, and it will emit JSON/HTML/chart outputs only from real result sets.
57+
58+
**Rationale:** The repository already has result formatting, roofline analysis, and report generation. The missing value is not more framework code but actual kernel invocation and a truthful execution contract.
59+
60+
**Alternatives considered:**
61+
- **Keep the framework as a library and defer the CLI:** rejected because the user-facing gap is specifically at the entrypoint.
62+
- **Support every suite immediately:** rejected because it would turn a bounded change into a broad expansion.
63+
64+
### 5. Use tests and examples to lock the public/runtime contract together
65+
66+
**Decision:** This change will add or adjust tests and examples so README snippets, Python examples, benchmark execution, and the CUTLASS path are all validated against the same bounded contract.
67+
68+
**Rationale:** Public-surface drift is likely to recur unless it is enforced through executable checks close to the changed code.
69+
70+
**Alternatives considered:**
71+
- **Rely on manual review only:** rejected because the mismatch already survived without machine-enforced checks.
72+
73+
## Risks / Trade-offs
74+
75+
- **[Risk] CUTLASS integration adds complexity or architecture-specific constraints** → Mitigation: keep the initial path limited to bounded GEMM shapes/types already supported by the repository and document any constraints explicitly.
76+
- **[Risk] Benchmark CLI grows into a large product surface** → Mitigation: scope this phase to real execution for a bounded suite rather than to universal benchmark coverage.
77+
- **[Risk] README truthfulness work looks like a feature reduction** → Mitigation: pair the doc alignment with an actually improved benchmark and baseline story so the change clearly increases credibility.
78+
- **[Risk] Python ergonomics remain limited after alignment** → Mitigation: document that broader Python-surface expansion is a later sequential change and avoid overloading this one.
79+
- **[Risk] Existing tests do not fully cover public-surface drift** → Mitigation: add focused tests or executable example coverage around the changed API and benchmark paths.
80+
81+
## Migration Plan
82+
83+
1. Define the runtime-facing OpenSpec requirements for the Python surface and GEMM baseline/benchmark contract.
84+
2. Align README, support descriptions, and Python examples with the bounded shipped surface.
85+
3. Implement `gemm_cutlass()` and any supporting benchmark wiring needed for real comparisons.
86+
4. Turn the benchmark entrypoint into a real execution path for the supported suite(s).
87+
5. Add or update validation around bindings, examples, and benchmark outputs.
88+
6. Leave later Python-surface expansion and advanced Hopper/attention work for subsequent changes.
89+
90+
Rollback is straightforward at repository level: the public-surface edits and benchmark wiring can be reverted together if the CUTLASS or benchmark path proves too unstable, restoring the previous bounded behavior while keeping the OpenSpec history.
91+
92+
## Open Questions
93+
94+
- Whether the first benchmarked GEMM baseline should be limited to `float` or immediately include the repository's current half/int8 support envelope.
95+
- Whether benchmark result artifacts should be committed only as documentation/examples or remain purely generated output.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
## Why
2+
3+
The repository currently presents a more complete Python and benchmarking surface than it actually ships: the README shows a Python API shape that does not match the current nanobind module, support claims blur the line between C++-only and Python-accessible functionality, and the benchmark framework entrypoint is still placeholder-driven. This change is needed now to restore trust in the public surface while establishing a concrete GEMM baseline path that can support credible performance work.
4+
5+
## What Changes
6+
7+
- Align README Python examples, support claims, and user-facing docs with the bindings and kernels that are actually shipped.
8+
- Define and implement the supported Python kernel surface for the first bounded runtime-facing phase, including how examples and docs must reflect that surface.
9+
- Turn the benchmark framework from a placeholder entrypoint into a real executable path for bounded kernel suites, starting with GEMM and adjacent comparison hooks.
10+
- Add a CUTLASS-backed GEMM baseline/fallback path and make it available to benchmark and comparison workflows.
11+
- Update tests and validation surfaces so documentation, bindings, and benchmark behavior remain consistent after the change.
12+
13+
## Capabilities
14+
15+
### New Capabilities
16+
- `python-kernel-surface`: Define the shipped Python-facing kernel API, its documented examples, and the consistency rules between bindings, examples, and public claims.
17+
- `gemm-baseline-benchmarking`: Define a real GEMM comparison and benchmarking path, including a CUTLASS baseline/fallback and reproducible benchmark outputs.
18+
19+
### Modified Capabilities
20+
- `documentation-rationalization`: Tighten active documentation requirements so runtime-facing README examples and support descriptions reflect the actually shipped API surface rather than aspirational or placeholder behavior.
21+
- `stabilization-sweep`: Capture runtime-surface trust issues discovered in docs, bindings, and benchmark entrypoints as first-class stabilization work rather than leaving them as informal follow-up.
22+
23+
## Impact
24+
25+
- Affected areas include `README.md`, Python examples and bindings under `python/` and `examples/python/`, GEMM implementation files under `src/gemm/`, benchmark tooling, tests, and the new OpenSpec capability set for runtime-facing behavior.
26+
- This change introduces a CUTLASS-backed baseline path on top of the existing dependency already fetched through CMake, but it does not broaden the project into a new runtime family beyond the bounded GEMM and public-surface scope.
27+
- Public-facing behavior will become stricter: documentation and examples may be reduced or rewritten where the current repository overstates shipped functionality.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
## ADDED Requirements
2+
3+
### Requirement: Runtime-facing examples reflect the shipped API surface
4+
Active runtime-facing documentation SHALL show only examples, signatures, and support descriptions that match the repository's shipped bindings and executable benchmark surface for the current change.
5+
6+
#### Scenario: Python API documentation is reviewed
7+
- **WHEN** README or active guides present Python usage or performance-reporting examples
8+
- **THEN** the examples reflect the actual binding structure, callable signatures, and supported reporting outputs shipped by the repository
9+
10+
#### Scenario: Runtime support is described
11+
- **WHEN** active documentation summarizes module or benchmark availability
12+
- **THEN** it avoids aspirational claims and describes unsupported or C++-only surfaces in a way that does not mislead Python users
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
## ADDED Requirements
2+
3+
### Requirement: GEMM baseline path is executable through a CUTLASS-backed implementation
4+
The repository SHALL provide an executable GEMM baseline path backed by CUTLASS so benchmark and comparison workflows can run against a non-placeholder reference implementation.
5+
6+
#### Scenario: GEMM baseline is requested
7+
- **WHEN** benchmark or comparison code requests the CUTLASS GEMM path
8+
- **THEN** the repository executes a real CUTLASS-backed GEMM implementation for the supported type and shape envelope of this change
9+
10+
#### Scenario: CUTLASS baseline is unavailable for a requested case
11+
- **WHEN** a benchmark or test requests an unsupported CUTLASS configuration
12+
- **THEN** the repository fails clearly or scopes the request explicitly rather than silently pretending to run a reference baseline
13+
14+
### Requirement: Benchmark entrypoints execute real workloads for supported suites
15+
The benchmark framework SHALL run real kernel workloads for the suites supported by this change and SHALL emit reports only from real measured result sets.
16+
17+
#### Scenario: Benchmark CLI is invoked for a supported suite
18+
- **WHEN** a user runs the benchmark entrypoint for a supported benchmark suite in this change
19+
- **THEN** the CLI executes the actual kernel and baseline functions, measures results, and can emit the configured JSON, HTML, or chart outputs from those measurements
20+
21+
#### Scenario: Benchmark CLI is invoked for an unsupported suite
22+
- **WHEN** the benchmark entrypoint is asked to run a suite that this change does not wire to real execution
23+
- **THEN** it reports the unsupported state explicitly instead of printing placeholder-success guidance
24+
25+
### Requirement: Public benchmark examples align with the executable benchmark contract
26+
Examples and documentation SHALL describe the benchmark surface in terms of the workloads and outputs that the repository can actually execute in this change.
27+
28+
#### Scenario: Benchmark documentation is updated
29+
- **WHEN** active docs or examples describe benchmark usage
30+
- **THEN** they point to supported benchmark commands and output shapes that match the implemented CLI behavior

0 commit comments

Comments
 (0)