Skip to content

autotuner: cap tile size for imbalanced 2D grid dims #2102

Closed
umechand-amd wants to merge 3 commits into
pytorch:mainfrom
umechand-amd:umechand/update_gemm_serach_space
Closed

autotuner: cap tile size for imbalanced 2D grid dims #2102
umechand-amd wants to merge 3 commits into
pytorch:mainfrom
umechand-amd:umechand/update_gemm_serach_space

Conversation

@umechand-amd
Copy link
Copy Markdown
Collaborator

For skinny GEMM shapes (e.g. M=1024, N=8192), the random sampler rarely explores small balanced tile configs like [64, 64, 256] that Inductor uses, because block sizes are sampled independently in log2 space. Add lower_max_for_imbalanced_grid_dims() which caps the larger grid dim's tile max at max(64, next_power_of_2(min_dim)//2) when max(M,N) >= 4*min(M,N), removing provably bad large tiles from the search space.

Validated on MI350X: avg helion gemm speedup 0.59x -> 0.87x.
Worst case (M=1024, N=8192): 0.39x -> 0.97x (vs torch_compile 0.96x).
All 1D-grid kernels unaffected. flash_attention/int4_gemm accuracy failures are pre-existing (confirmed in CI run #24748289616).

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 24, 2026
@umechand-amd umechand-amd marked this pull request as ready for review April 24, 2026 18:25
@jansel
Copy link
Copy Markdown
Contributor

jansel commented Apr 25, 2026

@choijon5 does your new dashboard support benchmark runs with comparisons? I'd like to get some perf data before merging this.

@choijon5
Copy link
Copy Markdown
Contributor

choijon5 commented Apr 25, 2026

@choijon5 does your new dashboard support benchmark runs with comparisons? I'd like to get some perf data before merging this.

Yes, @umechand-amd please go to helionlang.com/dashboard -> Compare tab -> specify your branch/commit on Target to compare against main.
Let me know if you have issues as the dashboard is pretty new.
image

@umechand-amd
Copy link
Copy Markdown
Collaborator Author

@choijon5 does your new dashboard support benchmark runs with comparisons? I'd like to get some perf data before merging this.

Yes, @umechand-amd please go to helionlang.com/dashboard -> Compare tab -> specify your branch/commit on Target to compare against main. Let me know if you have issues as the dashboard is pretty new. image

Okay this branch is in my forked repo, let me push this to a branch on the main Helion repo.

@umechand-amd umechand-amd force-pushed the umechand/update_gemm_serach_space branch 2 times, most recently from 22b3630 to 89ae4ad Compare April 26, 2026 18:14
@umechand-amd
Copy link
Copy Markdown
Collaborator Author

umechand-amd commented Apr 27, 2026

I was able to run the benchmarks, and I did see 3 improvements and 3 regressions, on the surface, all but 1 improvement seem to be related to this PR but let me debug and find out.
image

image

@umechand-amd umechand-amd force-pushed the umechand/update_gemm_serach_space branch from 89ae4ad to 97ae502 Compare May 5, 2026 07:02
@umechand-amd umechand-amd enabled auto-merge (squash) May 5, 2026 07:02
@umechand-amd umechand-amd requested a review from jansel May 5, 2026 07:12
@umechand-amd umechand-amd force-pushed the umechand/update_gemm_serach_space branch from 97ae502 to 4879c3f Compare May 5, 2026 16:46
@umechand-amd umechand-amd force-pushed the umechand/update_gemm_serach_space branch 3 times, most recently from dab0994 to 654e5ca Compare May 6, 2026 17:22
@jansel
Copy link
Copy Markdown
Contributor

jansel commented May 6, 2026

Any updated data on the perf for this? Can we tweak the heuristics to avoid those regressions?

Some feedback from ./scripts/autoreview.py:

Correctness

1. update_max doesn't preserve min_size / autotuner_min invariants (helion/autotuner/config_spec.py:1262).
spec.update_max(...) only clamps against the current max_size; it does not check spec.min_size or spec.autotuner_min. If lower_max_for_imbalanced_grid_dims lowers max_size below either of those, the invariant breaks and _fragment() (line 1615) computes low = min(max(min, autotuner_min), max_size) == max_size, collapsing the search range to a single value. This can silently undo previously-applied constraints (e.g., dot/tcgen05 minimum tile sizes) and let autotuning sample tiles below the intended floor. Cap should not be applied below max(spec.min_size, spec.autotuner_min).

2. config_spec.grid_block_ids is a flat union, not a single 2-D grid (helion/autotuner/config_spec.py:1230).
device_ir.grid_block_ids is list[list[int]] (per top-level grid root), but config_spec.grid_block_ids is the flattened concatenation populated by every hl.tile(...)/grid loop (see helion/language/loops.py:412-413). A kernel with two independent 1-D grid roots will have len(config_spec.grid_block_ids) == 2 and be erroneously capped as if it were a single skinny 2-D grid. The check should consult device_ir.grid_block_ids (which preserves per-root grouping) or otherwise verify the two ids belong to the same root grid.

3. Heuristic applied unconditionally for every backend (helion/_compiler/device_ir.py:1960).
The hook is wired into lower_to_device_ir for all backends, but num_compute_units() (helion/_compat.py:420) only inspects CUDA and falls back to a hard-coded 128 otherwise. Non-CUDA backends (Pallas/TPU, XPU) get a CUDA-shaped cap with no backend-specific validation or tests. Either gate this on the Triton/CUDA backend or thread a backend-aware compute-unit count through.

Test Bugs

4. New test class is unreachable when the file is run as a script (test/test_best_available.py:1144-1222).
TestLowerMaxForImbalancedGridDims is defined after the existing if __name__ == "__main__": unittest.main() block at line 1144, so python test_best_available.py runs unittest.main() before the class is defined and never registers it. Pytest still discovers it, but the class should be moved above the if __name__ == "__main__": block.

5. Tautological assertion in test_skinny_n_caps_n_tile (test/test_best_available.py:1172-1176).

m_max_after = spec.block_sizes.block_id_lookup(0).max_size
...
self.assertEqual(m_max_after, spec.block_sizes.block_id_lookup(0).max_size)

Compares a value to its own source — always passes. The intent is to verify the M-tile is unchanged; needs a m_max_before capture and comparison against that.

6. Hardware-dependent test fragility.
test_skinny_n_caps_n_tile and test_skinny_m_caps_m_tile assert assertLess(n_max_after, n_max_before), which only holds when num_compute_units() > 1 (since min_blocks_per_dim = ceil(sqrt(n_cus)) must be ≥ 2 to lower the max). On hardware where num_compute_units() returns 1, these tests would fail without a skip marker. Either monkeypatch num_compute_units or document/enforce the hardware floor.

Code Quality

7. Redundant clamp at helion/autotuner/config_spec.py:1262.
spec.update_max(min(spec.max_size, max_tile))update_max already does min(value, self.max_size) internally (line 1588). spec.update_max(max_tile) is equivalent.

8. Local imports inside test method (test/test_best_available.py:1164-1166).
import math and from helion._compat import num_compute_units are inside test_skinny_n_caps_n_tile. CLAUDE.md says avoid local scope imports when possible; move to top-of-file imports.

9. or-default pattern in test helper (test/test_best_available.py:1151-1158).
max_size=m_max or m would replace a falsy 0 with m. Probably benign here, but m_max if m_max is not None else m is the safer idiom.

Minor / Stylistic

10. Comment-vs-code drift in the docstring (config_spec.py:1232).
The phrase "(mirroring raise_grid_block_minimums which uses n_cus*64)" hard-codes a magic factor from a sibling method into prose; if raise_grid_block_minimums is later retuned the comment will go stale.

11. if max_tile < 2: continue quietly drops the cap (config_spec.py:1259).
On extreme hint/CU ratios, the cap may resolve to 1 and silently leave spec.max_size untouched. Probably fine in practice, but worth a brief inline note so a future reader doesn't think it's a bug.

@umechand-amd
Copy link
Copy Markdown
Collaborator Author

Any updated data on the perf for this? Can we tweak the heuristics to avoid those regressions?

Some feedback from ./scripts/autoreview.py:

Correctness

1. update_max doesn't preserve min_size / autotuner_min invariants (helion/autotuner/config_spec.py:1262). spec.update_max(...) only clamps against the current max_size; it does not check spec.min_size or spec.autotuner_min. If lower_max_for_imbalanced_grid_dims lowers max_size below either of those, the invariant breaks and _fragment() (line 1615) computes low = min(max(min, autotuner_min), max_size) == max_size, collapsing the search range to a single value. This can silently undo previously-applied constraints (e.g., dot/tcgen05 minimum tile sizes) and let autotuning sample tiles below the intended floor. Cap should not be applied below max(spec.min_size, spec.autotuner_min).

2. config_spec.grid_block_ids is a flat union, not a single 2-D grid (helion/autotuner/config_spec.py:1230). device_ir.grid_block_ids is list[list[int]] (per top-level grid root), but config_spec.grid_block_ids is the flattened concatenation populated by every hl.tile(...)/grid loop (see helion/language/loops.py:412-413). A kernel with two independent 1-D grid roots will have len(config_spec.grid_block_ids) == 2 and be erroneously capped as if it were a single skinny 2-D grid. The check should consult device_ir.grid_block_ids (which preserves per-root grouping) or otherwise verify the two ids belong to the same root grid.

3. Heuristic applied unconditionally for every backend (helion/_compiler/device_ir.py:1960). The hook is wired into lower_to_device_ir for all backends, but num_compute_units() (helion/_compat.py:420) only inspects CUDA and falls back to a hard-coded 128 otherwise. Non-CUDA backends (Pallas/TPU, XPU) get a CUDA-shaped cap with no backend-specific validation or tests. Either gate this on the Triton/CUDA backend or thread a backend-aware compute-unit count through.

Test Bugs

4. New test class is unreachable when the file is run as a script (test/test_best_available.py:1144-1222). TestLowerMaxForImbalancedGridDims is defined after the existing if __name__ == "__main__": unittest.main() block at line 1144, so python test_best_available.py runs unittest.main() before the class is defined and never registers it. Pytest still discovers it, but the class should be moved above the if __name__ == "__main__": block.

5. Tautological assertion in test_skinny_n_caps_n_tile (test/test_best_available.py:1172-1176).

m_max_after = spec.block_sizes.block_id_lookup(0).max_size
...
self.assertEqual(m_max_after, spec.block_sizes.block_id_lookup(0).max_size)

Compares a value to its own source — always passes. The intent is to verify the M-tile is unchanged; needs a m_max_before capture and comparison against that.

6. Hardware-dependent test fragility. test_skinny_n_caps_n_tile and test_skinny_m_caps_m_tile assert assertLess(n_max_after, n_max_before), which only holds when num_compute_units() > 1 (since min_blocks_per_dim = ceil(sqrt(n_cus)) must be ≥ 2 to lower the max). On hardware where num_compute_units() returns 1, these tests would fail without a skip marker. Either monkeypatch num_compute_units or document/enforce the hardware floor.

Code Quality

7. Redundant clamp at helion/autotuner/config_spec.py:1262. spec.update_max(min(spec.max_size, max_tile))update_max already does min(value, self.max_size) internally (line 1588). spec.update_max(max_tile) is equivalent.

8. Local imports inside test method (test/test_best_available.py:1164-1166). import math and from helion._compat import num_compute_units are inside test_skinny_n_caps_n_tile. CLAUDE.md says avoid local scope imports when possible; move to top-of-file imports.

9. or-default pattern in test helper (test/test_best_available.py:1151-1158). max_size=m_max or m would replace a falsy 0 with m. Probably benign here, but m_max if m_max is not None else m is the safer idiom.

Minor / Stylistic

10. Comment-vs-code drift in the docstring (config_spec.py:1232). The phrase "(mirroring raise_grid_block_minimums which uses n_cus*64)" hard-codes a magic factor from a sibling method into prose; if raise_grid_block_minimums is later retuned the comment will go stale.

11. if max_tile < 2: continue quietly drops the cap (config_spec.py:1259). On extreme hint/CU ratios, the cap may resolve to 1 and silently leave spec.max_size untouched. Probably fine in practice, but worth a brief inline note so a future reader doesn't think it's a bug.

Any updated data on the perf for this? Can we tweak the heuristics to avoid those regressions?

Some feedback from ./scripts/autoreview.py:

Correctness

1. update_max doesn't preserve min_size / autotuner_min invariants (helion/autotuner/config_spec.py:1262). spec.update_max(...) only clamps against the current max_size; it does not check spec.min_size or spec.autotuner_min. If lower_max_for_imbalanced_grid_dims lowers max_size below either of those, the invariant breaks and _fragment() (line 1615) computes low = min(max(min, autotuner_min), max_size) == max_size, collapsing the search range to a single value. This can silently undo previously-applied constraints (e.g., dot/tcgen05 minimum tile sizes) and let autotuning sample tiles below the intended floor. Cap should not be applied below max(spec.min_size, spec.autotuner_min).

2. config_spec.grid_block_ids is a flat union, not a single 2-D grid (helion/autotuner/config_spec.py:1230). device_ir.grid_block_ids is list[list[int]] (per top-level grid root), but config_spec.grid_block_ids is the flattened concatenation populated by every hl.tile(...)/grid loop (see helion/language/loops.py:412-413). A kernel with two independent 1-D grid roots will have len(config_spec.grid_block_ids) == 2 and be erroneously capped as if it were a single skinny 2-D grid. The check should consult device_ir.grid_block_ids (which preserves per-root grouping) or otherwise verify the two ids belong to the same root grid.

3. Heuristic applied unconditionally for every backend (helion/_compiler/device_ir.py:1960). The hook is wired into lower_to_device_ir for all backends, but num_compute_units() (helion/_compat.py:420) only inspects CUDA and falls back to a hard-coded 128 otherwise. Non-CUDA backends (Pallas/TPU, XPU) get a CUDA-shaped cap with no backend-specific validation or tests. Either gate this on the Triton/CUDA backend or thread a backend-aware compute-unit count through.

Test Bugs

4. New test class is unreachable when the file is run as a script (test/test_best_available.py:1144-1222). TestLowerMaxForImbalancedGridDims is defined after the existing if __name__ == "__main__": unittest.main() block at line 1144, so python test_best_available.py runs unittest.main() before the class is defined and never registers it. Pytest still discovers it, but the class should be moved above the if __name__ == "__main__": block.

5. Tautological assertion in test_skinny_n_caps_n_tile (test/test_best_available.py:1172-1176).

m_max_after = spec.block_sizes.block_id_lookup(0).max_size
...
self.assertEqual(m_max_after, spec.block_sizes.block_id_lookup(0).max_size)

Compares a value to its own source — always passes. The intent is to verify the M-tile is unchanged; needs a m_max_before capture and comparison against that.

6. Hardware-dependent test fragility. test_skinny_n_caps_n_tile and test_skinny_m_caps_m_tile assert assertLess(n_max_after, n_max_before), which only holds when num_compute_units() > 1 (since min_blocks_per_dim = ceil(sqrt(n_cus)) must be ≥ 2 to lower the max). On hardware where num_compute_units() returns 1, these tests would fail without a skip marker. Either monkeypatch num_compute_units or document/enforce the hardware floor.

Code Quality

7. Redundant clamp at helion/autotuner/config_spec.py:1262. spec.update_max(min(spec.max_size, max_tile))update_max already does min(value, self.max_size) internally (line 1588). spec.update_max(max_tile) is equivalent.

8. Local imports inside test method (test/test_best_available.py:1164-1166). import math and from helion._compat import num_compute_units are inside test_skinny_n_caps_n_tile. CLAUDE.md says avoid local scope imports when possible; move to top-of-file imports.

9. or-default pattern in test helper (test/test_best_available.py:1151-1158). max_size=m_max or m would replace a falsy 0 with m. Probably benign here, but m_max if m_max is not None else m is the safer idiom.

Minor / Stylistic

10. Comment-vs-code drift in the docstring (config_spec.py:1232). The phrase "(mirroring raise_grid_block_minimums which uses n_cus*64)" hard-codes a magic factor from a sibling method into prose; if raise_grid_block_minimums is later retuned the comment will go stale.

11. if max_tile < 2: continue quietly drops the cap (config_spec.py:1259). On extreme hint/CU ratios, the cap may resolve to 1 and silently leave spec.max_size untouched. Probably fine in practice, but worth a brief inline note so a future reader doesn't think it's a bug.

Thanks for the review. I have fixed all the items.
Issue 1: update_max floor — added floor = max(spec.min_size, spec.autotuner_min) check; skip the dim if cap would collapse below floor.
Issue 2: per-root grid grouping — method now accepts grid_root_block_ids: list[list[int]] from device_ir.grid_block_ids. New regression test test_two_independent_1d_grids_unchangedguards against the original bug. Issue 3: backend gate — restricted totriton/cute. Investigated tileir/pallas/metal: no shipped kernel uses a 2-D grid meeting the 8x imbalance threshold, and num_compute_units()was returning a fictional 128 fallback for non-CUDA hosts anyway, so this is a correctness improvement. The floor-respect logic safely no-ops when the cap would conflict with tcgen05 hard minimums. **Test bugs (3):** ✅ all fixed (4: class moved abovemain, 5: tautology fixed with captured m_max_before, 6: num_compute_units` monkeypatched to 128 for hardware-independence).

Code quality + minor (5): ✅ all fixed (redundant clamp dropped, local imports moved to top of file, or-default replaced with is None check, docstring magic factor detuned, inline comment added for max_tile < 2).

Verification:

  • pytest test/test_best_available.py::TestLowerMaxForImbalancedGridDims -vv — 8/8 pass (6 original + 2 new regression tests).
  • Full pytest test/ -n4 on MI350X (gfx950) — 1355 passed, 0 failed, 555 skipped (all skips are environmental: no TPU, no
    CUTLASS, ROCm hw caps).
  • Re-ran ./scripts/autoreview.py — no remaining correctness issues on the staged diff.

@umechand-amd
Copy link
Copy Markdown
Collaborator Author

Any updated data on the perf for this? Can we tweak the heuristics to avoid those regressions?

Some feedback from ./scripts/autoreview.py:

Correctness

1. update_max doesn't preserve min_size / autotuner_min invariants (helion/autotuner/config_spec.py:1262). spec.update_max(...) only clamps against the current max_size; it does not check spec.min_size or spec.autotuner_min. If lower_max_for_imbalanced_grid_dims lowers max_size below either of those, the invariant breaks and _fragment() (line 1615) computes low = min(max(min, autotuner_min), max_size) == max_size, collapsing the search range to a single value. This can silently undo previously-applied constraints (e.g., dot/tcgen05 minimum tile sizes) and let autotuning sample tiles below the intended floor. Cap should not be applied below max(spec.min_size, spec.autotuner_min).

2. config_spec.grid_block_ids is a flat union, not a single 2-D grid (helion/autotuner/config_spec.py:1230). device_ir.grid_block_ids is list[list[int]] (per top-level grid root), but config_spec.grid_block_ids is the flattened concatenation populated by every hl.tile(...)/grid loop (see helion/language/loops.py:412-413). A kernel with two independent 1-D grid roots will have len(config_spec.grid_block_ids) == 2 and be erroneously capped as if it were a single skinny 2-D grid. The check should consult device_ir.grid_block_ids (which preserves per-root grouping) or otherwise verify the two ids belong to the same root grid.

3. Heuristic applied unconditionally for every backend (helion/_compiler/device_ir.py:1960). The hook is wired into lower_to_device_ir for all backends, but num_compute_units() (helion/_compat.py:420) only inspects CUDA and falls back to a hard-coded 128 otherwise. Non-CUDA backends (Pallas/TPU, XPU) get a CUDA-shaped cap with no backend-specific validation or tests. Either gate this on the Triton/CUDA backend or thread a backend-aware compute-unit count through.

Test Bugs

4. New test class is unreachable when the file is run as a script (test/test_best_available.py:1144-1222). TestLowerMaxForImbalancedGridDims is defined after the existing if __name__ == "__main__": unittest.main() block at line 1144, so python test_best_available.py runs unittest.main() before the class is defined and never registers it. Pytest still discovers it, but the class should be moved above the if __name__ == "__main__": block.

5. Tautological assertion in test_skinny_n_caps_n_tile (test/test_best_available.py:1172-1176).

m_max_after = spec.block_sizes.block_id_lookup(0).max_size
...
self.assertEqual(m_max_after, spec.block_sizes.block_id_lookup(0).max_size)

Compares a value to its own source — always passes. The intent is to verify the M-tile is unchanged; needs a m_max_before capture and comparison against that.

6. Hardware-dependent test fragility. test_skinny_n_caps_n_tile and test_skinny_m_caps_m_tile assert assertLess(n_max_after, n_max_before), which only holds when num_compute_units() > 1 (since min_blocks_per_dim = ceil(sqrt(n_cus)) must be ≥ 2 to lower the max). On hardware where num_compute_units() returns 1, these tests would fail without a skip marker. Either monkeypatch num_compute_units or document/enforce the hardware floor.

Code Quality

7. Redundant clamp at helion/autotuner/config_spec.py:1262. spec.update_max(min(spec.max_size, max_tile))update_max already does min(value, self.max_size) internally (line 1588). spec.update_max(max_tile) is equivalent.

8. Local imports inside test method (test/test_best_available.py:1164-1166). import math and from helion._compat import num_compute_units are inside test_skinny_n_caps_n_tile. CLAUDE.md says avoid local scope imports when possible; move to top-of-file imports.

9. or-default pattern in test helper (test/test_best_available.py:1151-1158). max_size=m_max or m would replace a falsy 0 with m. Probably benign here, but m_max if m_max is not None else m is the safer idiom.

Minor / Stylistic

10. Comment-vs-code drift in the docstring (config_spec.py:1232). The phrase "(mirroring raise_grid_block_minimums which uses n_cus*64)" hard-codes a magic factor from a sibling method into prose; if raise_grid_block_minimums is later retuned the comment will go stale.

11. if max_tile < 2: continue quietly drops the cap (config_spec.py:1259). On extreme hint/CU ratios, the cap may resolve to 1 and silently leave spec.max_size untouched. Probably fine in practice, but worth a brief inline note so a future reader doesn't think it's a bug.

@jansel I think when I last loked at the dashboard was when the Helion CI forM30 was broken and we did not get a complete run for all kernels. I am running the benchmark again with all the latest changes.

@umechand-amd umechand-amd force-pushed the umechand/update_gemm_serach_space branch from 654e5ca to 3ab40d6 Compare May 7, 2026 06:06
@ethche
Copy link
Copy Markdown
Contributor

ethche commented May 7, 2026

@umechand-amd @jansel @choijon5 I wonder if instead of implementing this as a hard constraint on the search space, could we instead encode this heuristic by providing seed configs to the initial population, i.e. for imbalanced shapes this heuristic will insert balanced tile configs like block_sizes = [64, 64, 256]. This way, we are very likely to see performance gains with little risk of regressions because we are not constraining the autotuner for kernels where the heuristic is sub-optimal.

This could make use of compiler seed configs in #2250 . I tested your heuristic in #2276 for imbalanced matmul on H100, and indeed found a 1.34x improvement when seeding with block_sizes = [64, 64, 256]

@umechand-amd
Copy link
Copy Markdown
Collaborator Author

umechand-amd commented May 7, 2026

@umechand-amd @jansel @choijon5 I wonder if instead of implementing this as a hard constraint on the search space, could we instead encode this heuristic by providing seed configs to the initial population, i.e. for imbalanced shapes this heuristic will insert balanced tile configs like block_sizes = [64, 64, 256]. This way, we are very likely to see performance gains with little risk of regressions because we are not constraining the autotuner for kernels where the heuristic is sub-optimal.

This could make use of compiler seed configs in #2250 . I tested your heuristic in #2276 for imbalanced matmul on H100, and indeed found a 1.34x improvement when seeding with block_sizes = [64, 64, 256]

Thanks for the feedback. Let me take a look at #2276
Okay this makes sense, the skinny matmuls can be handled better by using the autotuner hints seeded with the right configs. I am closing this PR and I have opened another PR to handle this.

auto-merge was automatically disabled May 11, 2026 05:50

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants