Skip to content

Commit 454e114

Browse files
LessUpCopilot
andcommitted
feat(docs): rebuild whitepaper pages
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 5fe6ec0 commit 454e114

73 files changed

Lines changed: 1330 additions & 936 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

README.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,14 @@
88

99
English | [简体中文](README.zh-CN.md)
1010

11-
A CUDA SGEMM engineering notebook designed for both deep learning and interview presentation: from readable FP32 baselines to guarded Tensor Core WMMA, with cuBLAS-backed verification and explicit benchmark boundaries.
11+
This repository is a CUDA SGEMM case study presented as a technical whitepaper and kernel academy. It starts from readable FP32 baselines, climbs through tiled, bank-conflict-aware, double-buffer, and guarded Tensor Core WMMA paths, then frames every performance claim with explicit validation boundaries.
1212

13-
## Why this project stands out
13+
## Why it stands out
1414

15-
- **Progressive kernel ladder**: naive -> tiled -> bank-conflict-free -> double-buffer -> Tensor Core.
16-
- **Evidence-first reporting**: performance claims are paired with correctness policy and scope labels.
17-
- **Comparable interfaces**: FP32 kernels share a unified `(A, B, C, M, K, N, stream)` launcher contract.
18-
- **Interview-ready narrative**: architecture, methodology, validation, and references reinforce one public story.
19-
- **Bilingual mirrored docs**: English and Chinese public pages stay aligned.
15+
- **Readable optimization ladder**: every kernel stage exists to expose one bottleneck shift.
16+
- **Evidence-first public story**: correctness policy, benchmark scope, and local-versus-CI trust boundaries stay attached to every claim.
17+
- **Interview-grade positioning**: the Pages site is written so the project can be explained, defended, and audited under technical pressure.
18+
- **Bilingual mirrored docs**: English and Chinese routes stay structurally aligned across the full public site.
2019

2120
## Quick start
2221

@@ -30,29 +29,31 @@ cmake --build build -j$(nproc)
3029
ctest --test-dir build
3130
```
3231

33-
Runtime tests and benchmarks require a CUDA-capable local machine. Hosted CI is limited to formatting, repository-structure, OpenSpec/governance, and Pages checks.
32+
Runtime tests and benchmarks require a local CUDA-capable machine. Hosted CI covers repository integrity, documentation, OpenSpec validation, and Pages buildability.
3433

35-
## Start here (GitHub Pages)
34+
## GitHub Pages entry points
35+
36+
The README is the executive summary. The long-form technical narrative lives on Pages.
3637

3738
| Goal | Entry point |
3839
|------|-------------|
39-
| Open English home | [Docs Home](https://lessup.github.io/sgemm-optimization/en/) |
40+
| Open English home | [English Home](https://lessup.github.io/sgemm-optimization/en/) |
4041
| Open Chinese home | [中文首页](https://lessup.github.io/sgemm-optimization/zh/) |
41-
| Build and run once | [Getting Started](https://lessup.github.io/sgemm-optimization/en/getting-started) |
42-
| Understand differentiation | [Architecture Overview](https://lessup.github.io/sgemm-optimization/en/architecture/) |
43-
| Prepare interview explanation | [Methodology](https://lessup.github.io/sgemm-optimization/en/methodology/) |
44-
| Check trust boundaries | [Validation Overview](https://lessup.github.io/sgemm-optimization/en/validation/) |
45-
| Trace technical lineage | [References](https://lessup.github.io/sgemm-optimization/en/references) |
46-
| Read normative specs | [OpenSpec Specs](openspec/specs/) |
42+
| Get oriented quickly | [Project Guide](https://lessup.github.io/sgemm-optimization/en/overview/) |
43+
| Inspect system structure | [Architecture](https://lessup.github.io/sgemm-optimization/en/architecture/) |
44+
| Study the kernel ladder | [Academy](https://lessup.github.io/sgemm-optimization/en/academy/) |
45+
| Check what the evidence proves | [Validation](https://lessup.github.io/sgemm-optimization/en/validation/) |
46+
| Trace papers and related repos | [Research Desk](https://lessup.github.io/sgemm-optimization/en/research/) |
47+
| Read normative repository requirements | [OpenSpec Specs](openspec/specs/) |
4748

4849
## Validation boundary
4950

50-
| Environment | What to trust |
51-
|-------------|---------------|
52-
| Hosted CI | Formatting, docs/structure checks, OpenSpec governance, Pages buildability |
53-
| Local CUDA GPU | Runtime correctness verification and benchmark performance |
51+
| Environment | What it can prove |
52+
|-------------|-------------------|
53+
| Hosted CI | Docs structure, route integrity, OpenSpec consistency, Pages buildability |
54+
| Local CUDA GPU | Runtime correctness, fallback behavior, benchmark performance |
5455

55-
This split is deliberate. CI keeps repository health; real GPU hardware validates runtime behavior and speed claims.
56+
This split is deliberate. CI keeps the repository coherent, but only local GPU execution can validate runtime behavior and speed claims.
5657

5758
## Source map
5859

@@ -61,7 +62,7 @@ src/kernels/ CUDA SGEMM implementations
6162
src/utils/ CUDA RAII, verification, benchmark helpers
6263
src/main.cu benchmark CLI
6364
tests/ Google Test coverage against cuBLAS
64-
docs/ learning documentation mirrored on Pages
65+
docs/ VitePress whitepaper and academy, mirrored under /en and /zh
6566
openspec/ stable specs and change workflow
6667
```
6768

README.zh-CN.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,14 @@
88

99
[English](README.md) | 简体中文
1010

11-
这是一个面向学习与面试展示的 CUDA SGEMM 工程化项目:从可读的 FP32 baseline kernel 演进到带保护回退的 Tensor Core WMMA,并通过 cuBLAS 对照建立可信验证
11+
这是一个被包装成技术白皮书和 Kernel 学院的 CUDA SGEMM 案例仓库。它从可读的 FP32 基线出发,沿着 tiled、bank-conflict-aware、double-buffer、带保护的 Tensor Core WMMA 路径逐级推进,并且把每一个性能结论都放回明确的验证边界里解释
1212

13-
## 为什么它更有竞争力
13+
## 为什么它更强
1414

15-
- **优化链条完整**:naive -> tiled -> bank-conflict-free -> double-buffer -> Tensor Core。
16-
- **证据优先表达**:性能结论与正确性策略、测量范围一起呈现。
17-
- **接口保持一致**:FP32 kernel 使用统一 `(A, B, C, M, K, N, stream)` launcher 契约。
18-
- **面试友好叙事**:架构、方法论、验证与参考资料共同支撑同一条公共叙事。
19-
- **中英文镜像文档**:公开页面结构保持一致,便于传播与复用。
15+
- **优化阶梯可讲清楚**:每一级 kernel 都对应一次明确的瓶颈转移。
16+
- **公共叙事以证据为先**:正确性策略、benchmark 范围和本地 GPU / 托管 CI 的信任边界始终跟着结论走。
17+
- **面试表达友好**:Pages 站点被写成可解释、可答辩、可审查的技术叙事。
18+
- **中英镜像完整**:英文与中文公共路由在整站范围内保持结构一致。
2019

2120
## 快速开始
2221

@@ -30,29 +29,31 @@ cmake --build build -j$(nproc)
3029
ctest --test-dir build
3130
```
3231

33-
运行时测试和 benchmark 需要本地 CUDA GPU。托管 CI 只覆盖格式、仓库结构、OpenSpec / 治理,以及 Pages 构建检查
32+
运行时测试和 benchmark 需要本地 CUDA GPU。托管 CI 主要负责仓库完整性、文档结构、OpenSpec 校验,以及 Pages 可构建性
3433

35-
## 推荐入口(GitHub Pages)
34+
## GitHub Pages 入口
35+
36+
README 是执行摘要,长篇技术叙事在 GitHub Pages 上。
3637

3738
| 目标 | 入口 |
3839
|------|------|
40+
| 打开英文首页 | [English Home](https://lessup.github.io/sgemm-optimization/en/) |
3941
| 打开中文首页 | [中文首页](https://lessup.github.io/sgemm-optimization/zh/) |
40-
| 打开英文首页 | [Docs Home](https://lessup.github.io/sgemm-optimization/en/) |
41-
| 编译运行一次 | [快速上手](https://lessup.github.io/sgemm-optimization/zh/getting-started) |
42-
| 了解项目差异化 | [架构概述](https://lessup.github.io/sgemm-optimization/zh/architecture/) |
43-
| 准备面试表达 | [方法论](https://lessup.github.io/sgemm-optimization/zh/methodology/) |
44-
| 查看可信边界 | [验证概览](https://lessup.github.io/sgemm-optimization/zh/validation/) |
45-
| 追溯技术来源 | [参考文献](https://lessup.github.io/sgemm-optimization/zh/references) |
46-
| 阅读规范源 | [OpenSpec 规范](openspec/specs/) |
42+
| 快速建立全局认知 | [项目导读](https://lessup.github.io/sgemm-optimization/zh/overview/) |
43+
| 查看系统结构 | [架构](https://lessup.github.io/sgemm-optimization/zh/architecture/) |
44+
| 系统学习 kernel 阶梯 | [学院](https://lessup.github.io/sgemm-optimization/zh/academy/) |
45+
| 核对证据到底证明什么 | [验证](https://lessup.github.io/sgemm-optimization/zh/validation/) |
46+
| 追溯论文和相关仓库 | [研究资料台](https://lessup.github.io/sgemm-optimization/zh/research/) |
47+
| 阅读仓库规范来源 | [OpenSpec 规范](openspec/specs/) |
4748

4849
## 验证边界
4950

50-
| 环境 | 可以信任什么 |
51-
|------|--------------|
52-
| 托管 CI | 格式、文档/结构检查、OpenSpec 治理、Pages 可构建性 |
53-
| 本地 CUDA GPU | 运行时正确性与 benchmark 性能 |
51+
| 环境 | 能证明什么 |
52+
|------|------------|
53+
| 托管 CI | 文档结构、路由完整性、OpenSpec 一致性、Pages 可构建性 |
54+
| 本地 CUDA GPU | 运行时正确性、fallback 行为、benchmark 性能 |
5455

55-
这种拆分是刻意设计CI 负责仓库健康,真实 GPU 负责运行时与性能结论
56+
这种拆分是刻意设计CI 负责让仓库保持连贯,只有本地 GPU 执行才能验证运行时行为和速度结论
5657

5758
## 源码地图
5859

@@ -61,7 +62,7 @@ src/kernels/ CUDA SGEMM kernel 实现
6162
src/utils/ CUDA RAII、验证与 benchmark 工具
6263
src/main.cu benchmark CLI
6364
tests/ 基于 cuBLAS 的 Google Test 覆盖
64-
docs/ 中英文 Pages 文档(含 /en /zh
65+
docs/ VitePress 白皮书与学院,公开镜像位于 /en /zh
6566
openspec/ 稳定 specs 与变更工作流
6667
```
6768

0 commit comments

Comments
 (0)