[Codegen][AMDGPU] Add UseGlobalTransposeLoad promotion attr and operand promotion for RDNA4#24468
Draft
nirvedhmeshram wants to merge 2 commits into
Draft
Conversation
b2eb91a to
2e748ec
Compare
…nspose_load for RDNA4 Add TransferReadTransposeToGlobalTransposeLoad pattern to ROCDLLoadToTransposeLoad that matches: vector.transfer_read %src[row, col] : memref<..., global>, vector<1x8xT> vector.transpose %read, [1, 0] : vector<1x8xT> to vector<8x1xT> vector.transfer_write %transposed, %dst[n, k] : ..., workgroup and replaces it with amdgpu.global_transpose_load on gfx1200+ (RDNA4). The hardware 8x8 wave-level transpose (global_load_tr_b128 for bf16/f16) means each lane's result is written at a different N position within the K group. The write indices are corrected to produce contiguous K writes: N_new = N_base + K_single % N K_new = (K_single floordiv N) * N The pass now dispatches separately for gfx950 (LDS transpose, existing) and gfx1200+ (global transpose, new), so neither path interferes with the other. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd promotion Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute for RDNA4 (gfx1200+) that drives matmul operands to be loaded via global_load_tr. The attribute implements both IREEGPU_PromotionAttr and IREECodegen_LoweringConfigAttrInterface, with tiling sizes [N=8, K=1] (vectorSize x 1) derived from the element type. A new transposePromoteOperand path in GPUPromoteMatmulOperands creates a linalg.generic copy with K-inner thread mapping: - input map: reads B[K, N] (K-outer, N-inner in memory) - output map: writes alloc[N, K] (N-outer, K-inner, for contiguous K writes) This K-inner tiling aligns with global_load_tr's 8x8 wave-level transpose semantics: 8 consecutive lanes each read 8 contiguous N-elements, and the hardware transposes so each lane holds a K-direction slice. The copy op is tagged with UseGlobalTransposeLoadAttr as its lowering config so the ROCDLLoadToTransposeLoad pass (PR 2) can recognise and lower it. amdgpu.fat_raw_buffer_cast is stripped before creating the copy because global_load_tr requires a flat global pointer, not a fat buffer descriptor. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
2e748ec to
0e9569d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Part of #24454. Stacks on #24457 and #24467.
Summary
Introduces
IREEGPU_UseGlobalTransposeLoad, a promotion attribute thatdrives matmul B-operands (and transposed A-operands) to be loaded via
global_load_tron RDNA4 (gfx1200+).New attribute
UseGlobalTransposeLoadAttrimplements:IREEGPU_PromotionAttr— hooks intoGPUPromoteMatmulOperandsIREECodegen_LoweringConfigAttrInterface— provides tiling sizes[N=8, K=1](vectorSize × 1, where vectorSize=8 for bf16/f16)
K-inner operand promotion
A new
transposePromoteOperandpath inGPUPromoteMatmulOperandscreatesa
linalg.genericcopy with K-inner thread mapping:(d0, d1) → (d1, d0)— readsB[K, N](K-outer, N-inner in memory)(d0, d1) → (d0, d1)— writesalloc[N, K](N-outer, K-inner)This tiling aligns with
global_load_tr's 8×8 wave-level hardware transpose:8 consecutive lanes each read 8 contiguous N-elements, and the hardware
transposes so each lane holds a K-direction slice. Consecutive lanes → K,
which is exactly what K-inner thread assignment gives.
The copy is tagged with
UseGlobalTransposeLoadAttras its lowering configso
ROCDLLoadToTransposeLoad(PR 2) can pattern-match and lower it.amdgpu.fat_raw_buffer_castis stripped before creating the copy becauseglobal_load_trrequires a flat global pointer, not a fat buffer descriptor.New helpers
DerivedConfigUtils:globalTransposeLoadTileSizes(op)returns{vectorSize, 1}IREEGPUAttrs:UseGlobalTransposeLoadAttrtiling level implementationTest
gpu_promote_matmul_operands_global_transpose.mlirverifies:[N, K](transposed) withuse_global_transpose_loadconfig