Skip to content

Releases: LessUp/hpc-ai-optimization-lab

v0.3.0 - Documentation Internationalization

21 Apr 19:02

Choose a tag to compare

v0.3.0 - Documentation Internationalization

Overview / 概述

This release focuses on making the documentation accessible to both English and Chinese readers through a complete bilingual documentation suite.

本次发布专注于通过完整的双语文档集使文档对英文和中文读者都可访问。


English Documentation Suite

  • GEMM Optimization - 7-step matrix multiplication optimization journey (0.5 → 70+ TFLOPS)
  • Memory Optimization - Coalesced access, vectorization, shared memory patterns
  • Reduction Optimization - Warp shuffle, online softmax, LayerNorm algorithms
  • FlashAttention - IO-aware attention mechanism with tiling strategy
  • CUDA 13 Features - Hopper architecture: TMA, Clusters, FP8 support
  • API Reference - Complete C++/CUDA API documentation
  • Architecture Overview - Project design patterns and module organization

中文文档集 / Chinese Documentation Suite

  • GEMM 优化 - 7步矩阵乘法优化之旅 (0.5 → 70+ TFLOPS)
  • 访存优化 - 合并访问、向量化、共享内存模式
  • 归约优化 - Warp洗牌、在线Softmax、LayerNorm算法
  • FlashAttention - 基于IO感知的注意力机制与分块策略
  • CUDA 13 特性 - Hopper架构:TMA、集群、FP8支持
  • API 参考 - 完整C++/CUDA API文档
  • 架构概览 - 项目设计模式与模块组织

Features / 功能特性

  • ✅ Complete bilingual documentation (English + Chinese)
  • ✅ VitePress-powered documentation site
  • ✅ 7-step GEMM optimization journey
  • ✅ FlashAttention with online softmax
  • ✅ Tensor Core support (WMMA + MMA PTX)
  • ✅ CUDA 13 Hopper features (experimental)
  • ✅ Python bindings (nanobind)
  • ✅ Comprehensive test coverage (GoogleTest + RapidCheck)

Installation / 安装

# Clone / 克隆
git clone https://github.com/LessUp/hpc-ai-optimization-lab.git
cd hpc-ai-optimization-lab

# Build / 构建
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Test / 测试
ctest --test-dir build --output-on-failure

Documentation / 文档


Full Changelog: v0.2.0...v0.3.0