From 6776fe115a6ddf5f4fc8026fa4c70b741319bcac Mon Sep 17 00:00:00 2001 From: Thomas Faingnaert Date: Thu, 25 Jan 2024 23:05:28 +0100 Subject: [PATCH] Skip configurations with fewer than 4 warps in tuning Given the fact that SMs in Volta, Turing, Ampere, and Hopper have four processing blocks, each with one warp scheduler, I don't think it makes sense to try configurations during tuning where the number of warps per CTA is less than 4. This reduces the search space by 18.75% (well, assuming that each of the options of WARPS_M and WARPS_N amounts to the same number of valid kernels, which is probably not true...). We could also bump the limit to 8, so we allocate at least 2 warps per processing block. That allows the SM to switch to another warp if one warp stalls. This would reduce the search space by another 18.75%. We might even want to restrict this further. For example, I don't think a configuration like WARPS_M = 1, WARPS_N = 8 makes sense, as that has reduced data reuse across the M dimension compared to the configuration WARPS_M = 2, WARPS_N = 4, so we might also only want to try the following configurations: - 2 x 4 - 4 x 2 - 4 x 4 - 8 x 4 - 4 x 8 That would reduce the search space by 68.75% in total. --- tuning/tune-wmma.jl | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/tuning/tune-wmma.jl b/tuning/tune-wmma.jl index 7c05a39c..8c319b59 100644 --- a/tuning/tune-wmma.jl +++ b/tuning/tune-wmma.jl @@ -115,6 +115,10 @@ function generate_configs() ], kernel_str in ["singlestage", "pipelined"] + if WARPS_M * WARPS_N < 4 + continue + end + push!(all_configs, Dict( :transpose_a => transpose_a, :transpose_b => transpose_b,