Skip to content

Latest commit

 

History

History
73 lines (55 loc) · 2.59 KB

File metadata and controls

73 lines (55 loc) · 2.59 KB
layout default
title Changelog
nav_order 10
permalink /CHANGELOG/

Changelog

All notable changes to this project are documented here.

The format follows Keep a Changelog and the project aims to follow Semantic Versioning.

[Unreleased]

Fixed

  • Integer overflow risk in verify.cuh and tensor_core_sgemm.cuh for large matrices (use size_t)
  • Command-line parsing now uses strtol() with proper error handling instead of atoi()
  • Benchmark startup now fails cleanly instead of crashing when no CUDA device is available
  • Compute-only Tensor Core benchmark results now flow through the shared summary/export path

Changed

  • Simplified repository maintenance around README, CONTRIBUTING, root workflows, and the docs site.
  • Reworked README, GitHub Pages content, and supporting docs into clearer SGEMM-focused entry surfaces.
  • Standardized on a single CMake build path and removed duplicate maintenance surfaces.

Removed

  • Duplicate LICENSE file (kept LICENSE.md with third-party info)
  • OpenSpec, Claude-specific command/skill files, and other repository-resident AI workflow scaffolding
  • Duplicate Makefile-based build path and unused performance-baseline scaffolding

Added

  • .clang-tidy configuration for static analysis

[2.1.0] - 2026-04-16

Added

  • Tensor Core WMMA SGEMM kernel with guarded FP32 fallback for unsupported dimensions
  • Benchmark enhancements, including roofline data export and configurable warmup/benchmark iterations
  • Google Test coverage for standard kernels, Tensor Core fast path, fallback behavior, and edge cases
  • Bilingual documentation and a GitHub Pages documentation site

Changed

  • Consolidated source code into src/kernels/, src/utils/, and tests/
  • Adopted CMake as the primary build system while retaining the Makefile for quick local runs
  • Expanded supported CUDA architecture targets to cover Volta through Hopper generation GPUs

Fixed

  • Tensor Core path memory management issues
  • Double-buffer synchronization issues
  • Grid dimension handling for non-square matrices

[2.0.0] - 2026-03-13

Added

  • Bank-conflict-free and double-buffer SGEMM kernels
  • CUDA Events-based benchmark infrastructure
  • Nsight-oriented profiling support

Changed

  • Migrated from an earlier single-file layout to the current modular structure
  • Standardized on CUDA 11.0+ and C++17

Removed

  • Legacy single-file benchmark script
  • SM 6.x support

[1.0.0] - 2025-02-13

Added

  • Initial naive and tiled SGEMM kernels
  • Basic cuBLAS correctness verification
  • First benchmark CLI