autotuner: cap tile size for imbalanced 2D grid dims #2102
autotuner: cap tile size for imbalanced 2D grid dims #2102umechand-amd wants to merge 3 commits into
Conversation
|
@choijon5 does your new dashboard support benchmark runs with comparisons? I'd like to get some perf data before merging this. |
Yes, @umechand-amd please go to helionlang.com/dashboard -> Compare tab -> specify your branch/commit on Target to compare against main. |
Okay this branch is in my forked repo, let me push this to a branch on the main Helion repo. |
22b3630 to
89ae4ad
Compare
89ae4ad to
97ae502
Compare
97ae502 to
4879c3f
Compare
dab0994 to
654e5ca
Compare
|
Any updated data on the perf for this? Can we tweak the heuristics to avoid those regressions? Some feedback from ./scripts/autoreview.py: Correctness1. 2. 3. Heuristic applied unconditionally for every backend ( Test Bugs4. New test class is unreachable when the file is run as a script ( 5. Tautological assertion in m_max_after = spec.block_sizes.block_id_lookup(0).max_size
...
self.assertEqual(m_max_after, spec.block_sizes.block_id_lookup(0).max_size)Compares a value to its own source — always passes. The intent is to verify the M-tile is unchanged; needs a 6. Hardware-dependent test fragility. Code Quality7. Redundant clamp at 8. Local imports inside test method ( 9. Minor / Stylistic10. Comment-vs-code drift in the docstring ( 11. |
Thanks for the review. I have fixed all the items. Code quality + minor (5): ✅ all fixed (redundant clamp dropped, local imports moved to top of file, Verification:
|
@jansel I think when I last loked at the dashboard was when the Helion CI forM30 was broken and we did not get a complete run for all kernels. I am running the benchmark again with all the latest changes. |
654e5ca to
3ab40d6
Compare
|
@umechand-amd @jansel @choijon5 I wonder if instead of implementing this as a hard constraint on the search space, could we instead encode this heuristic by providing seed configs to the initial population, i.e. for imbalanced shapes this heuristic will insert balanced tile configs like This could make use of compiler seed configs in #2250 . I tested your heuristic in #2276 for imbalanced matmul on H100, and indeed found a 1.34x improvement when seeding with |
Thanks for the feedback. Let me take a look at #2276 |




For skinny GEMM shapes (e.g. M=1024, N=8192), the random sampler rarely explores small balanced tile configs like [64, 64, 256] that Inductor uses, because block sizes are sampled independently in log2 space. Add lower_max_for_imbalanced_grid_dims() which caps the larger grid dim's tile max at max(64, next_power_of_2(min_dim)//2) when max(M,N) >= 4*min(M,N), removing provably bad large tiles from the search space.
Validated on MI350X: avg helion gemm speedup 0.59x -> 0.87x.
Worst case (M=1024, N=8192): 0.39x -> 0.97x (vs torch_compile 0.96x).
All 1D-grid kernels unaffected. flash_attention/int4_gemm accuracy failures are pre-existing (confirmed in CI run #24748289616).