Skip configurations with fewer than 4 warps in tuning by thomasfaingnaert · Pull Request #188 · JuliaGPU/GemmKernels.jl

thomasfaingnaert · 2024-01-25T22:28:23Z

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the N dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations:

2 x 4
4 x 2
4 x 4
8 x 4
4 x 8

That would reduce the search space by 68.75% in total.

@maleadt Thoughts?

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...). We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%. We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the M dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations: - 2 x 4 - 4 x 2 - 4 x 4 - 8 x 4 - 4 x 8 That would reduce the search space by 68.75% in total.

thomasfaingnaert · 2024-01-25T22:47:25Z

FWIW, for the 7 GPUs we ran the tuning on (V100, RTX 4070, V100S, RTX6000, A100, RTX 2080 Ti, H100), this is the distribution of (WARPS_M, WARPS_N) for the optimal configurations the tuning script found:

julia> counters = countmap(sizes)
Dict{Any, Int64} with 13 entries:
  (1, 2) => 4
  (8, 4) => 1
  (1, 4) => 21
  (4, 1) => 35
  (2, 1) => 5
  (2, 8) => 5
  (4, 2) => 44
  (2, 2) => 39
  (4, 4) => 21
  (8, 1) => 10
  (2, 4) => 31
  (1, 8) => 3
  (8, 2) => 5

and the amount of optimal configurations that would no longer be tested for different changes:

julia> 100 * sum(v for (k, v) in counters if prod(k) < 4) / sum(v for (k, v) in counters)
4.017857142857143

julia> 100 * sum(v for (k, v) in counters if prod(k) < 8) / sum(v for (k, v) in counters)
46.42857142857143

julia> 100 * sum(v for (k, v) in counters if k ∉ [(2, 4), (4, 2), (4, 4), (8, 4), (4, 8)]) / sum(v for (k, v) in counters)
56.69642857142857

Maybe we should hold off on this for now...
Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

maleadt · 2024-01-26T11:39:53Z

Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

Maybe we could gate the extended coverage behind a --slow arg or so (or the limited one behind --fast) and evaluate both? Once we figure out the other tuning script issues, that is.

codecov · 2026-04-20T09:19:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.04%. Comparing base (3052b52) to head (6776fe1).
⚠️ Report is 22 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #188      +/-   ##
==========================================
- Coverage   34.94%   32.04%   -2.90%     
==========================================
  Files          11       11              
  Lines         933      958      +25     
==========================================
- Hits          326      307      -19     
- Misses        607      651      +44

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thomasfaingnaert requested a review from maleadt January 25, 2024 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip configurations with fewer than 4 warps in tuning#188

Skip configurations with fewer than 4 warps in tuning#188
thomasfaingnaert wants to merge 1 commit into
masterfrom
tf/reduce-warps

thomasfaingnaert commented Jan 25, 2024

Uh oh!

thomasfaingnaert commented Jan 25, 2024 •

edited

Loading

Uh oh!

maleadt commented Jan 26, 2024

Uh oh!

codecov Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thomasfaingnaert commented Jan 25, 2024

Uh oh!

thomasfaingnaert commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Jan 26, 2024

Uh oh!

codecov Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thomasfaingnaert commented Jan 25, 2024 •

edited

Loading

codecov Bot commented Apr 20, 2026 •

edited

Loading