Skip configurations with fewer than 4 warps in tuning#188
Skip configurations with fewer than 4 warps in tuning#188thomasfaingnaert wants to merge 1 commit into
Conversation
Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...). We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%. We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the M dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations: - 2 x 4 - 4 x 2 - 4 x 4 - 8 x 4 - 4 x 8 That would reduce the search space by 68.75% in total.
|
FWIW, for the 7 GPUs we ran the tuning on (V100, RTX 4070, V100S, RTX6000, A100, RTX 2080 Ti, H100), this is the distribution of and the amount of optimal configurations that would no longer be tested for different changes: Maybe we should hold off on this for now... |
Maybe we could gate the extended coverage behind a |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #188 +/- ##
==========================================
- Coverage 34.94% 32.04% -2.90%
==========================================
Files 11 11
Lines 933 958 +25
==========================================
- Hits 326 307 -19
- Misses 607 651 +44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...).
We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%.
We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the N dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations:
That would reduce the search space by 68.75% in total.
@maleadt Thoughts?