Commit 23f92e5
Grouped MMA: TILE_N=64 + k_splits for small-N MoE shapes
For moe_gu (K=2048, N=512), the old TILE_N=128 gave only 4 N-tiles
per expert × 8 experts = 32 blocks — 25% SM utilization on 128-SM
GPU. Now uses TILE_N=64 (128 threads, 4 warps) when m_blocks==1,
doubling N-tiles. Combined with auto k_splits that splits K into
chunks when MN-tiles are insufficient, achieves full SM occupancy.
Results: moe_gu drops from constant 26 us to 9-14 us (2.6-2.9x
faster). Per-block total at k=4 M=5-8 improves from 1.03x to
1.24-1.36x vs fp16.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent df55daf commit 23f92e5
File tree
3 files changed
+188
-90
lines changed- bitsandbytes/backends/cuda
- csrc
3 files changed
+188
-90
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1096 | 1096 | | |
1097 | 1097 | | |
1098 | 1098 | | |
1099 | | - | |
| 1099 | + | |
1100 | 1100 | | |
1101 | 1101 | | |
1102 | 1102 | | |
1103 | 1103 | | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
1104 | 1121 | | |
1105 | 1122 | | |
1106 | 1123 | | |
| |||
1111 | 1128 | | |
1112 | 1129 | | |
1113 | 1130 | | |
| 1131 | + | |
| 1132 | + | |
1114 | 1133 | | |
1115 | 1134 | | |
1116 | 1135 | | |
| |||
0 commit comments