Commit 826bc00
Add launch_bounds, fast benchmarking suite, and consolidated kernel spec
- Add __launch_bounds__(128, 12) to MMA kernel (TILE_N<=64) to hint
compiler for higher occupancy (targets 12 blocks/SM, 100% theoretical)
- Add single-process ncu benchmark suite (bench_ncu.sh + ncu_driver.py)
that profiles MMA + scalar across all shapes×k×M in ~30s
- Add cuBLAS fp16 baseline benchmark (bench_fp16.py) with pre-allocated
I/O for fair comparison
- Consolidate all scattered docs into kbit-kernel-spec.md covering the
four-kernel strategy, architecture details, and benchmarking workflow
- Remove obsolete documentation files
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 11f4b32 commit 826bc00
File tree
12 files changed
+546
-4100
lines changed- agents
- benchmarks
- csrc
12 files changed
+546
-4100
lines changedThis file was deleted.
0 commit comments