Commit 4fc1d8e
Increase k_splits occupancy targets on datacenter GPUs
- MMA launcher: increase TARGET_BLOCKS_PER_SM from 4→6 (TN=64) and
1→2 (TN=128) when num_sms > 130 (H100 has 132 SMs).
- Grouped MMA launcher: same occupancy target adjustment.
- Detection is runtime (num_sms > 130), not compile-time, so consumer
GPUs with ≤130 SMs keep the existing targets unchanged.
H100 SXM benchmark improvement (CUDA graph, 500 iters × 5 trials):
- gateup k=4 M=1: 30.6→28.8 µs (-5.9%)
- gateup k=4 M=16: 34.9→29.5 µs (-15.5%)
- Q k=2 M=1: 23.0→21.4 µs (-7.0%)
- O k=4 M=1: 26.4→24.1 µs (-8.7%)
- KV: neutral (small shape, already fully occupied)
Consumer regression: zero (174/174 tests pass, RTX 4090 has 128 SMs
so the higher targets are never activated).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 2cc0224 commit 4fc1d8e
1 file changed
+15
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1449 | 1449 | | |
1450 | 1450 | | |
1451 | 1451 | | |
1452 | | - | |
1453 | | - | |
1454 | | - | |
1455 | | - | |
1456 | | - | |
| 1452 | + | |
| 1453 | + | |
| 1454 | + | |
| 1455 | + | |
| 1456 | + | |
| 1457 | + | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
1457 | 1461 | | |
1458 | 1462 | | |
1459 | 1463 | | |
| |||
1924 | 1928 | | |
1925 | 1929 | | |
1926 | 1930 | | |
1927 | | - | |
1928 | | - | |
| 1931 | + | |
| 1932 | + | |
| 1933 | + | |
| 1934 | + | |
| 1935 | + | |
| 1936 | + | |
1929 | 1937 | | |
1930 | 1938 | | |
1931 | 1939 | | |
| |||
0 commit comments