| layout | default |
|---|---|
| title | Changelog |
| nav_order | 10 |
| permalink | /CHANGELOG/ |
All notable changes to this project are documented here.
The format follows Keep a Changelog and the project aims to follow Semantic Versioning.
- Integer overflow risk in
verify.cuhandtensor_core_sgemm.cuhfor large matrices (usesize_t) - Command-line parsing now uses
strtol()with proper error handling instead ofatoi() - Benchmark startup now fails cleanly instead of crashing when no CUDA device is available
- Compute-only Tensor Core benchmark results now flow through the shared summary/export path
- Simplified repository maintenance around README, CONTRIBUTING, root workflows, and the docs site.
- Reworked README, GitHub Pages content, and supporting docs into clearer SGEMM-focused entry surfaces.
- Standardized on a single CMake build path and removed duplicate maintenance surfaces.
- Duplicate
LICENSEfile (keptLICENSE.mdwith third-party info) - OpenSpec, Claude-specific command/skill files, and other repository-resident AI workflow scaffolding
- Duplicate Makefile-based build path and unused performance-baseline scaffolding
.clang-tidyconfiguration for static analysis
- Tensor Core WMMA SGEMM kernel with guarded FP32 fallback for unsupported dimensions
- Benchmark enhancements, including roofline data export and configurable warmup/benchmark iterations
- Google Test coverage for standard kernels, Tensor Core fast path, fallback behavior, and edge cases
- Bilingual documentation and a GitHub Pages documentation site
- Consolidated source code into
src/kernels/,src/utils/, andtests/ - Adopted CMake as the primary build system while retaining the Makefile for quick local runs
- Expanded supported CUDA architecture targets to cover Volta through Hopper generation GPUs
- Tensor Core path memory management issues
- Double-buffer synchronization issues
- Grid dimension handling for non-square matrices
- Bank-conflict-free and double-buffer SGEMM kernels
- CUDA Events-based benchmark infrastructure
- Nsight-oriented profiling support
- Migrated from an earlier single-file layout to the current modular structure
- Standardized on CUDA 11.0+ and C++17
- Legacy single-file benchmark script
- SM 6.x support
- Initial naive and tiled SGEMM kernels
- Basic cuBLAS correctness verification
- First benchmark CLI