Skip to content

[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring#24469

Draft
nirvedhmeshram wants to merge 3 commits into
iree-org:mainfrom
nirvedhmeshram:amdgpu-global-transpose-load-pr4-config
Draft

[Codegen][AMDGPU] Add --iree-llvmgpu-use-global-transpose-load flag and ConfigUtils wiring#24469
nirvedhmeshram wants to merge 3 commits into
iree-org:mainfrom
nirvedhmeshram:amdgpu-global-transpose-load-pr4-config

Conversation

@nirvedhmeshram
Copy link
Copy Markdown
Contributor

Part of #24454. Stacks on #24457, #24467, #24468.

Summary

Wires UseGlobalTransposeLoadAttr into the kernel configuration pipeline for
RDNA4 (gfx1200+), behind a new hidden flag that defaults to off:

--iree-llvmgpu-use-global-transpose-load

What the flag does

When enabled, ConfigUtils selects UseGlobalTransposeLoadAttr as the
promotion type for matmul operands that require a transpose relative to the
MMA register layout:

  • RHS (!transposedRhs): B matrix is N-inner in memory; MMA expects K-inner
  • LHS (transposedLhs): A matrix is K-outer in memory; MMA expects M-inner

Supported element types: bf16, f16, i16, i8, and fp8 variants
(f8e4m3fnuz, f8e5m2fnuz, f8e4m3fn, f8e5m2).

Implementation

The flag flows as a new useGlobalTransposeLoad bool parameter through:

  • setMatmulLoweringConfig
  • setIGEMMConvolutionLoweringConfig
  • getMatmulOrIGEMMLoweringConfigAndWorkgroupSize

The isRDNA4 promotion block only fires when both isRDNA4 && useGlobalTransposeLoad.
This is entirely separate from the existing useDirectLoad (DMA) path — the two
cannot be active simultaneously (DMA requires promotionArray.empty()).

Test

config_tile_and_fuse_gfx1201.mlir gains a new --check-prefix=GTL run that
verifies promotion_types = [..., #iree_gpu.use_global_transpose_load] appears
on the RHS for a large bf16/f16 matmul when the flag is enabled.

@nirvedhmeshram nirvedhmeshram force-pushed the amdgpu-global-transpose-load-pr4-config branch 11 times, most recently from 9ad8231 to d8e552d Compare May 13, 2026 16:23
nirvedhmeshram and others added 3 commits May 13, 2026 13:20
…nspose_load for RDNA4

Add TransferReadTransposeToGlobalTransposeLoad pattern to
ROCDLLoadToTransposeLoad that matches:
  vector.transfer_read %src[row, col] : memref<..., global>, vector<1x8xT>
  vector.transpose %read, [1, 0] : vector<1x8xT> to vector<8x1xT>
  vector.transfer_write %transposed, %dst[n, k] : ..., workgroup

and replaces it with amdgpu.global_transpose_load on gfx1200+ (RDNA4).

The hardware 8x8 wave-level transpose (global_load_tr_b128 for bf16/f16)
means each lane's result is written at a different N position within the
K group. The write indices are corrected to produce contiguous K writes:
  N_new = N_base + K_single % N
  K_new = (K_single floordiv N) * N

The pass now dispatches separately for gfx950 (LDS transpose, existing)
and gfx1200+ (global transpose, new), so neither path interferes with the
other.

Part of: iree-org#24454

Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd promotion

Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute for RDNA4
(gfx1200+) that drives matmul operands to be loaded via global_load_tr.

The attribute implements both IREEGPU_PromotionAttr and
IREECodegen_LoweringConfigAttrInterface, with tiling sizes [N=8, K=1]
(vectorSize x 1) derived from the element type.

A new transposePromoteOperand path in GPUPromoteMatmulOperands creates a
linalg.generic copy with K-inner thread mapping:
  - input  map: reads B[K, N] (K-outer, N-inner in memory)
  - output map: writes alloc[N, K] (N-outer, K-inner, for contiguous K writes)

This K-inner tiling aligns with global_load_tr's 8x8 wave-level transpose
semantics: 8 consecutive lanes each read 8 contiguous N-elements, and the
hardware transposes so each lane holds a K-direction slice.

The copy op is tagged with UseGlobalTransposeLoadAttr as its lowering config
so the ROCDLLoadToTransposeLoad pass (PR 2) can recognise and lower it.

amdgpu.fat_raw_buffer_cast is stripped before creating the copy because
global_load_tr requires a flat global pointer, not a fat buffer descriptor.

Part of: iree-org#24454

Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd ConfigUtils wiring

Wire UseGlobalTransposeLoadAttr into the kernel configuration pipeline for
RDNA4 (gfx1200+), behind a new hidden flag that defaults to off:

  --iree-llvmgpu-use-global-transpose-load

When enabled, ConfigUtils selects UseGlobalTransposeLoadAttr as the promotion
type for:
  - RHS operand when !transposedRhs (B matrix is N-inner in memory, so MMA
    needs a transpose from K-inner layout)
  - LHS operand when transposedLhs (A matrix is K-outer in memory)

Supported element types: bf16, f16, i16, i8, and fp8 variants (f8e4m3fnuz,
f8e5m2fnuz, f8e4m3fn, f8e5m2).

The flag is threaded as a new useGlobalTransposeLoad bool parameter through
setMatmulLoweringConfig, setIGEMMConvolutionLoweringConfig, and
getMatmulOrIGEMMLoweringConfigAndWorkgroupSize so the isRDNA4 promotion
block only fires when the flag is enabled.

Part of: iree-org#24454

Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
@nirvedhmeshram nirvedhmeshram force-pushed the amdgpu-global-transpose-load-pr4-config branch from d8e552d to 3250d24 Compare May 13, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant