[AMD][RDNA4]Fix RDNA4 (gfx1201 / Wave32) CI Failures#2210
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds gfx12 support alongside gfx11, strengthens RDNA generation rejection tests, makes HIP reductions and CumSum dispatch on runtime wavefront size, adds FP8 handling in the TVM FFI, and updates WMMA shared-memory indexing. ChangesRDNA Generation 11/12 Support Expansion
HIP CumSum & Warp-size Dispatch
TVM FFI FP8 handling
WMMA RDNA ldmatrix indexing changes
Language test adjustments
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tl_templates/hip/reduce.h`:
- Around line 151-152: The static_assert allowing threads==32 must also enforce
that threads is at least the hardware wavefront size to avoid wave64/threads32
mismatches; update the check that currently lists allowed thread counts (the
static_assert containing "threads == 1024 or ... or threads == 32") to
additionally require threads >= __builtin_amdgcn_wavefrontsize(), or replace it
with a combined condition (allowed sizes AND threads >=
__builtin_amdgcn_wavefrontsize()) so that configurations used by run_seg<T, 64>
never run with fewer active lanes than the wavefront.
- Around line 232-239: The template CumSum2D currently allows thread counts
smaller than the hardware wavefront which leads to TILE_H = threads/SEG becoming
zero and broken shuffles; add a compile-time constraint to enforce threads >=
hardware wavefront size by augmenting the existing static_assert in struct
CumSum2D (affecting run_seg and the TILE_H calculation) to require threads >=
the platform wavefront constant (e.g., TL_WAVE_SZ or equivalent), so invalid
instantiations (like threads=32 on wave64) fail to compile instead of producing
TILE_H==0 and corrupted shuffle behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2b7e4095-87e2-4fc9-a871-03f9485e7bcd
📒 Files selected for processing (3)
src/tl_templates/hip/reduce.htilelang/jit/adapter/tvm_ffi.pytilelang/rocm/intrinsics/wmma_macro_generator.py
| static_assert(threads == 1024 or threads == 512 or threads == 256 or | ||
| threads == 128 or threads == 64); | ||
| template <typename T, int SEG = 64> | ||
| static TL_DEVICE void run(const T *__restrict__ src, T *__restrict__ dst, | ||
| int N) { | ||
| if (N <= 0) | ||
| return; | ||
| threads == 128 or threads == 64 or threads == 32); |
There was a problem hiding this comment.
Add constraint: threads must be >= wavefront size.
Allowing threads == 32 without enforcing threads >= __builtin_amdgcn_wavefrontsize() causes a critical mismatch on CDNA (wave64). When compiled for CDNA with threads=32, the dispatch at lines 224–228 calls run_seg<T, 64>, but only 32 threads are active. Shuffle operations in run_seg then expect 64 active lanes while only lanes 0–31 are populated, producing incorrect cumsum results.
🔒 Proposed fix: enforce threads >= wavefront size
template <int threads, bool reverse = false> struct CumSum1D {
static_assert(threads == 1024 or threads == 512 or threads == 256 or
- threads == 128 or threads == 64 or threads == 32);
+ threads == 128 or threads == 64 or threads == 32);
+ static_assert(threads >= __builtin_amdgcn_wavefrontsize(),
+ "CumSum1D: threads must be >= hardware wavefront size");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/tl_templates/hip/reduce.h` around lines 151 - 152, The static_assert
allowing threads==32 must also enforce that threads is at least the hardware
wavefront size to avoid wave64/threads32 mismatches; update the check that
currently lists allowed thread counts (the static_assert containing "threads ==
1024 or ... or threads == 32") to additionally require threads >=
__builtin_amdgcn_wavefrontsize(), or replace it with a combined condition
(allowed sizes AND threads >= __builtin_amdgcn_wavefrontsize()) so that
configurations used by run_seg<T, 64> never run with fewer active lanes than the
wavefront.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/tl_templates/hip/reduce.h (1)
51-57:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winEnforce the full-wavefront precondition in
SharedReduceWarp.This path now derives
kWarpSizefrom the target, but it still treatsThreads / kWarpSizeas a valid warp count. On wave64 targets, anySharedReduceWarp<..., 32, ...>instantiation makesnum_warps == 0, so the loop at Line 59 never progresses (dest_idx += num_warps). Please either guard this template against sub-wavefront thread counts or confirm the ROCm lowering path never emits such instantiations.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tl_templates/hip/reduce.h` around lines 51 - 57, SharedReduceWarp currently derives kWarpSize at runtime and computes num_warps = Threads / kWarpSize, which can be zero on wave64 hardware for templates instantiated with Threads < hardware wave size (e.g., SharedReduceWarp<...,32,...>), so add a guard to prevent num_warps==0: in SharedReduceWarp (and where kWarpSize, tid, warp_id, lane, num_warps are computed) enforce either a compile-time check (static_assert) when __builtin_amdgcn_wavefrontsize() is a constant to require Threads >= kWarpSize and Threads % kWarpSize == 0, or add a runtime fallback that sets num_warps = max(1, Threads / kWarpSize) or returns/handles the single-partition case so the dest_idx loop (which increments by num_warps) never stalls; reference the symbols SharedReduceWarp, kWarpSize, Threads, num_warps, and dest_idx when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@testing/python/language/test_tilelang_language_eager_jit.py`:
- Around line 77-83: The try/except around determine_target("auto",
return_object=True) currently catches Exception and defaults _is_cuda = True,
which masks unexpected errors and incorrectly enables CUDA-only float32 paths;
update the block to catch only specific expected errors (e.g., ValueError,
RuntimeError, OSError) and use a conservative fallback of _is_cuda = False when
detection fails; locate the code around determine_target and target_is_cuda and
change the except Exception to except (ValueError, RuntimeError, OSError) (or
the specific exceptions your detection can raise) and set _is_cuda = False so
in_dtypes selects the non-CUDA-safe list when detection cannot confirm CUDA.
---
Outside diff comments:
In `@src/tl_templates/hip/reduce.h`:
- Around line 51-57: SharedReduceWarp currently derives kWarpSize at runtime and
computes num_warps = Threads / kWarpSize, which can be zero on wave64 hardware
for templates instantiated with Threads < hardware wave size (e.g.,
SharedReduceWarp<...,32,...>), so add a guard to prevent num_warps==0: in
SharedReduceWarp (and where kWarpSize, tid, warp_id, lane, num_warps are
computed) enforce either a compile-time check (static_assert) when
__builtin_amdgcn_wavefrontsize() is a constant to require Threads >= kWarpSize
and Threads % kWarpSize == 0, or add a runtime fallback that sets num_warps =
max(1, Threads / kWarpSize) or returns/handles the single-partition case so the
dest_idx loop (which increments by num_warps) never stalls; reference the
symbols SharedReduceWarp, kWarpSize, Threads, num_warps, and dest_idx when
making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fcb4a65b-119a-4e0d-ad97-3333a6d265c1
📒 Files selected for processing (3)
src/backend/rocm/op/finalize_reducer.ccsrc/tl_templates/hip/reduce.htesting/python/language/test_tilelang_language_eager_jit.py
This PR fixes several correctness bugs that caused CI failures on AMD RDNA4 (gfx1201, Wave32) hardware.
Changes
Impact: No breaking changes. RDNA3 (gfx11) behavior is unchanged.
Summary by CodeRabbit
New Features
Improvements
Tests