Skip to content

Commit 826bc00

Browse files
TimDettmersclaude
andcommitted
Add launch_bounds, fast benchmarking suite, and consolidated kernel spec
- Add __launch_bounds__(128, 12) to MMA kernel (TILE_N<=64) to hint compiler for higher occupancy (targets 12 blocks/SM, 100% theoretical) - Add single-process ncu benchmark suite (bench_ncu.sh + ncu_driver.py) that profiles MMA + scalar across all shapes×k×M in ~30s - Add cuBLAS fp16 baseline benchmark (bench_fp16.py) with pre-allocated I/O for fair comparison - Consolidate all scattered docs into kbit-kernel-spec.md covering the four-kernel strategy, architecture details, and benchmarking workflow - Remove obsolete documentation files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 11f4b32 commit 826bc00

File tree

12 files changed

+546
-4100
lines changed

12 files changed

+546
-4100
lines changed

agents/scalar_gemv_guide.md

Lines changed: 0 additions & 383 deletions
This file was deleted.

0 commit comments

Comments
 (0)