[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring by nirvedhmeshram · Pull Request #24469 · iree-org/iree

nirvedhmeshram · 2026-05-13T01:04:19Z

Part of #24454. Stacks on #24457, #24467, #24468.

Summary

Wires UseGlobalTransposeLoadAttr into the kernel configuration pipeline for
RDNA4 (gfx1200+), behind a new hidden flag that defaults to off:

--iree-llvmgpu-use-global-transpose-load

What the flag does

When enabled, ConfigUtils selects UseGlobalTransposeLoadAttr as the
promotion type for matmul operands that require a transpose relative to the
MMA register layout:

RHS (!transposedRhs): B matrix is N-inner in memory; MMA expects K-inner
LHS (transposedLhs): A matrix is K-outer in memory; MMA expects M-inner

Supported element types: bf16, f16, i16, i8, and fp8 variants
(f8e4m3fnuz, f8e5m2fnuz, f8e4m3fn, f8e5m2).

Implementation

The flag flows as a new useGlobalTransposeLoad bool parameter through:

setMatmulLoweringConfig
setIGEMMConvolutionLoweringConfig
getMatmulOrIGEMMLoweringConfigAndWorkgroupSize

The isRDNA4 promotion block only fires when both isRDNA4 && useGlobalTransposeLoad.
This is entirely separate from the existing useDirectLoad (DMA) path — the two
cannot be active simultaneously (DMA requires promotionArray.empty()).

Test

config_tile_and_fuse_gfx1201.mlir gains a new --check-prefix=GTL run that
verifies promotion_types = [..., #iree_gpu.use_global_transpose_load] appears
on the RHS for a large bf16/f16 matmul when the flag is enabled.

…nspose_load for RDNA4 Add TransferReadTransposeToGlobalTransposeLoad pattern to ROCDLLoadToTransposeLoad that matches: vector.transfer_read %src[row, col] : memref<..., global>, vector<1x8xT> vector.transpose %read, [1, 0] : vector<1x8xT> to vector<8x1xT> vector.transfer_write %transposed, %dst[n, k] : ..., workgroup and replaces it with amdgpu.global_transpose_load on gfx1200+ (RDNA4). The hardware 8x8 wave-level transpose (global_load_tr_b128 for bf16/f16) means each lane's result is written at a different N position within the K group. The write indices are corrected to produce contiguous K writes: N_new = N_base + K_single % N K_new = (K_single floordiv N) * N The pass now dispatches separately for gfx950 (LDS transpose, existing) and gfx1200+ (global transpose, new), so neither path interferes with the other. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>

…nd promotion Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute for RDNA4 (gfx1200+) that drives matmul operands to be loaded via global_load_tr. The attribute implements both IREEGPU_PromotionAttr and IREECodegen_LoweringConfigAttrInterface, with tiling sizes [N=8, K=1] (vectorSize x 1) derived from the element type. A new transposePromoteOperand path in GPUPromoteMatmulOperands creates a linalg.generic copy with K-inner thread mapping: - input map: reads B[K, N] (K-outer, N-inner in memory) - output map: writes alloc[N, K] (N-outer, K-inner, for contiguous K writes) This K-inner tiling aligns with global_load_tr's 8x8 wave-level transpose semantics: 8 consecutive lanes each read 8 contiguous N-elements, and the hardware transposes so each lane holds a K-direction slice. The copy op is tagged with UseGlobalTransposeLoadAttr as its lowering config so the ROCDLLoadToTransposeLoad pass (PR 2) can recognise and lower it. amdgpu.fat_raw_buffer_cast is stripped before creating the copy because global_load_tr requires a flat global pointer, not a fat buffer descriptor. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>

…nd ConfigUtils wiring Wire UseGlobalTransposeLoadAttr into the kernel configuration pipeline for RDNA4 (gfx1200+), behind a new hidden flag that defaults to off: --iree-llvmgpu-use-global-transpose-load When enabled, ConfigUtils selects UseGlobalTransposeLoadAttr as the promotion type for: - RHS operand when !transposedRhs (B matrix is N-inner in memory, so MMA needs a transpose from K-inner layout) - LHS operand when transposedLhs (A matrix is K-outer in memory) Supported element types: bf16, f16, i16, i8, and fp8 variants (f8e4m3fnuz, f8e5m2fnuz, f8e4m3fn, f8e5m2). The flag is threaded as a new useGlobalTransposeLoad bool parameter through setMatmulLoweringConfig, setIGEMMConvolutionLoweringConfig, and getMatmulOrIGEMMLoweringConfigAndWorkgroupSize so the isRDNA4 promotion block only fires when the flag is enabled. Part of: iree-org#24454 Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com> Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>

nirvedhmeshram force-pushed the amdgpu-global-transpose-load-pr4-config branch 11 times, most recently from 9ad8231 to d8e552d Compare May 13, 2026 16:23

nirvedhmeshram and others added 3 commits May 13, 2026 13:20

nirvedhmeshram force-pushed the amdgpu-global-transpose-load-pr4-config branch from d8e552d to 3250d24 Compare May 13, 2026 19:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring#24469

[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring#24469
nirvedhmeshram wants to merge 3 commits into
iree-org:mainfrom
nirvedhmeshram:amdgpu-global-transpose-load-pr4-config

nirvedhmeshram commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nirvedhmeshram commented May 13, 2026

Summary

What the flag does

Implementation

Test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant