Skip to content

Skip configurations with fewer than 4 warps in tuning#188

Open
thomasfaingnaert wants to merge 1 commit into
masterfrom
tf/reduce-warps
Open

Skip configurations with fewer than 4 warps in tuning#188
thomasfaingnaert wants to merge 1 commit into
masterfrom
tf/reduce-warps

Conversation

@thomasfaingnaert
Copy link
Copy Markdown
Member

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the N dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations:

  • 2 x 4
  • 4 x 2
  • 4 x 4
  • 8 x 4
  • 4 x 8

That would reduce the search space by 68.75% in total.

@maleadt Thoughts?

Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four
processing blocks, each with one warp scheduler, I don't think it makes
sense to try configurations during tuning where the number of warps per
CTA is less than 4. This reduces the search space by 18.75% (well,
assuming that each of the options of WARPS_M and WARPS_N amounts to the
same number of valid kernels, which is probably not true...).

We could also bump the limit to 8, so we allocate at least 2 warps per
processing block. That allows the SM to switch to another warp if one
warp stalls. This would reduce the search space by another 18.75%.

We might even want to restrict this further. For example, I don't think
a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has
reduced data reuse across the M dimension compared to the configuration
WARPS_M = 2, WARPS_N = 4, so we might also only want to try the
following configurations:

- 2 x 4
- 4 x 2
- 4 x 4
- 8 x 4
- 4 x 8

That would reduce the search space by 68.75% in total.
@thomasfaingnaert
Copy link
Copy Markdown
Member Author

thomasfaingnaert commented Jan 25, 2024

FWIW, for the 7 GPUs we ran the tuning on (V100, RTX 4070, V100S, RTX6000, A100, RTX 2080 Ti, H100), this is the distribution of (WARPS_M, WARPS_N) for the optimal configurations the tuning script found:

julia> counters = countmap(sizes)
Dict{Any, Int64} with 13 entries:
  (1, 2) => 4
  (8, 4) => 1
  (1, 4) => 21
  (4, 1) => 35
  (2, 1) => 5
  (2, 8) => 5
  (4, 2) => 44
  (2, 2) => 39
  (4, 4) => 21
  (8, 1) => 10
  (2, 4) => 31
  (1, 8) => 3
  (8, 2) => 5

and the amount of optimal configurations that would no longer be tested for different changes:

julia> 100 * sum(v for (k, v) in counters if prod(k) < 4) / sum(v for (k, v) in counters)
4.017857142857143

julia> 100 * sum(v for (k, v) in counters if prod(k) < 8) / sum(v for (k, v) in counters)
46.42857142857143

julia> 100 * sum(v for (k, v) in counters if k ∉ [(2, 4), (4, 2), (4, 4), (8, 4), (4, 8)]) / sum(v for (k, v) in counters)
56.69642857142857

Maybe we should hold off on this for now...
Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

@maleadt
Copy link
Copy Markdown
Member

maleadt commented Jan 26, 2024

Though, the question remains: are those configurations truly better than configurations where WARPS_M and WARPS_N are sufficiently large, or did they just happen to be selected while they are truly similar to those configurations?

Maybe we could gate the extended coverage behind a --slow arg or so (or the limited one behind --fast) and evaluate both? Once we figure out the other tuning script issues, that is.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 32.04%. Comparing base (3052b52) to head (6776fe1).
⚠️ Report is 22 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #188      +/-   ##
==========================================
- Coverage   34.94%   32.04%   -2.90%     
==========================================
  Files          11       11              
  Lines         933      958      +25     
==========================================
- Hits          326      307      -19     
- Misses        607      651      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants