[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring#24469
Draft
nirvedhmeshram wants to merge 3 commits into
Draft
Conversation
9ad8231 to
d8e552d
Compare
…nspose_load for RDNA4 Add TransferReadTransposeToGlobalTransposeLoad pattern to ROCDLLoadToTransposeLoad that matches: vector.transfer_read %src[row, col] : memref<..., global>, vector<1x8xT> vector.transpose %read, [1, 0] : vector<1x8xT> to vector<8x1xT> vector.transfer_write %transposed, %dst[n, k] : ..., workgroup and replaces it with amdgpu.global_transpose_load on gfx1200+ (RDNA4). The hardware 8x8 wave-level transpose (global_load_tr_b128 for bf16/f16) means each lane's result is written at a different N position within the K group. The write indices are corrected to produce contiguous K writes: N_new = N_base + K_single % N K_new = (K_single floordiv N) * N The pass now dispatches separately for gfx950 (LDS transpose, existing) and gfx1200+ (global transpose, new), so neither path interferes with the other. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd promotion Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute for RDNA4 (gfx1200+) that drives matmul operands to be loaded via global_load_tr. The attribute implements both IREEGPU_PromotionAttr and IREECodegen_LoweringConfigAttrInterface, with tiling sizes [N=8, K=1] (vectorSize x 1) derived from the element type. A new transposePromoteOperand path in GPUPromoteMatmulOperands creates a linalg.generic copy with K-inner thread mapping: - input map: reads B[K, N] (K-outer, N-inner in memory) - output map: writes alloc[N, K] (N-outer, K-inner, for contiguous K writes) This K-inner tiling aligns with global_load_tr's 8x8 wave-level transpose semantics: 8 consecutive lanes each read 8 contiguous N-elements, and the hardware transposes so each lane holds a K-direction slice. The copy op is tagged with UseGlobalTransposeLoadAttr as its lowering config so the ROCDLLoadToTransposeLoad pass (PR 2) can recognise and lower it. amdgpu.fat_raw_buffer_cast is stripped before creating the copy because global_load_tr requires a flat global pointer, not a fat buffer descriptor. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd ConfigUtils wiring
Wire UseGlobalTransposeLoadAttr into the kernel configuration pipeline for
RDNA4 (gfx1200+), behind a new hidden flag that defaults to off:
--iree-llvmgpu-use-global-transpose-load
When enabled, ConfigUtils selects UseGlobalTransposeLoadAttr as the promotion
type for:
- RHS operand when !transposedRhs (B matrix is N-inner in memory, so MMA
needs a transpose from K-inner layout)
- LHS operand when transposedLhs (A matrix is K-outer in memory)
Supported element types: bf16, f16, i16, i8, and fp8 variants (f8e4m3fnuz,
f8e5m2fnuz, f8e4m3fn, f8e5m2).
The flag is threaded as a new useGlobalTransposeLoad bool parameter through
setMatmulLoweringConfig, setIGEMMConvolutionLoweringConfig, and
getMatmulOrIGEMMLoweringConfigAndWorkgroupSize so the isRDNA4 promotion
block only fires when the flag is enabled.
Part of: iree-org#24454
Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
d8e552d to
3250d24
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #24454. Stacks on #24457, #24467, #24468.
Summary
Wires
UseGlobalTransposeLoadAttrinto the kernel configuration pipeline forRDNA4 (gfx1200+), behind a new hidden flag that defaults to off:
What the flag does
When enabled,
ConfigUtilsselectsUseGlobalTransposeLoadAttras thepromotion type for matmul operands that require a transpose relative to the
MMA register layout:
!transposedRhs): B matrix is N-inner in memory; MMA expects K-innertransposedLhs): A matrix is K-outer in memory; MMA expects M-innerSupported element types:
bf16,f16,i16,i8, and fp8 variants(
f8e4m3fnuz,f8e5m2fnuz,f8e4m3fn,f8e5m2).Implementation
The flag flows as a new
useGlobalTransposeLoadbool parameter through:setMatmulLoweringConfigsetIGEMMConvolutionLoweringConfiggetMatmulOrIGEMMLoweringConfigAndWorkgroupSizeThe
isRDNA4promotion block only fires when bothisRDNA4 && useGlobalTransposeLoad.This is entirely separate from the existing
useDirectLoad(DMA) path — the twocannot be active simultaneously (DMA requires
promotionArray.empty()).Test
config_tile_and_fuse_gfx1201.mlirgains a new--check-prefix=GTLrun thatverifies
promotion_types = [..., #iree_gpu.use_global_transpose_load]appearson the RHS for a large bf16/f16 matmul when the flag is enabled.