Skip to content

[Codegen][AMDGPU] Add UseGlobalTransposeLoad promotion attr and operand promotion for RDNA4#24468

Draft
nirvedhmeshram wants to merge 2 commits into
iree-org:mainfrom
nirvedhmeshram:amdgpu-global-transpose-load-pr3-attr
Draft

[Codegen][AMDGPU] Add UseGlobalTransposeLoad promotion attr and operand promotion for RDNA4#24468
nirvedhmeshram wants to merge 2 commits into
iree-org:mainfrom
nirvedhmeshram:amdgpu-global-transpose-load-pr3-attr

Conversation

@nirvedhmeshram
Copy link
Copy Markdown
Contributor

Part of #24454. Stacks on #24457 and #24467.

Summary

Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute that
drives matmul B-operands (and transposed A-operands) to be loaded via
global_load_tr on RDNA4 (gfx1200+).

New attribute

UseGlobalTransposeLoadAttr implements:

  • IREEGPU_PromotionAttr — hooks into GPUPromoteMatmulOperands
  • IREECodegen_LoweringConfigAttrInterface — provides tiling sizes [N=8, K=1]
    (vectorSize × 1, where vectorSize=8 for bf16/f16)

K-inner operand promotion

A new transposePromoteOperand path in GPUPromoteMatmulOperands creates
a linalg.generic copy with K-inner thread mapping:

  • input map: (d0, d1) → (d1, d0) — reads B[K, N] (K-outer, N-inner in memory)
  • output map: (d0, d1) → (d0, d1) — writes alloc[N, K] (N-outer, K-inner)

This tiling aligns with global_load_tr's 8×8 wave-level hardware transpose:
8 consecutive lanes each read 8 contiguous N-elements, and the hardware
transposes so each lane holds a K-direction slice. Consecutive lanes → K,
which is exactly what K-inner thread assignment gives.

The copy is tagged with UseGlobalTransposeLoadAttr as its lowering config
so ROCDLLoadToTransposeLoad (PR 2) can pattern-match and lower it.

amdgpu.fat_raw_buffer_cast is stripped before creating the copy because
global_load_tr requires a flat global pointer, not a fat buffer descriptor.

New helpers

  • DerivedConfigUtils: globalTransposeLoadTileSizes(op) returns {vectorSize, 1}
  • IREEGPUAttrs: UseGlobalTransposeLoadAttr tiling level implementation

Test

gpu_promote_matmul_operands_global_transpose.mlir verifies:

  • RHS promotion: copy gets output shape [N, K] (transposed) with use_global_transpose_load config
  • LHS promotion (transposedLhs): same attribute on the copy

nirvedhmeshram and others added 2 commits May 13, 2026 13:20
…nspose_load for RDNA4

Add TransferReadTransposeToGlobalTransposeLoad pattern to
ROCDLLoadToTransposeLoad that matches:
  vector.transfer_read %src[row, col] : memref<..., global>, vector<1x8xT>
  vector.transpose %read, [1, 0] : vector<1x8xT> to vector<8x1xT>
  vector.transfer_write %transposed, %dst[n, k] : ..., workgroup

and replaces it with amdgpu.global_transpose_load on gfx1200+ (RDNA4).

The hardware 8x8 wave-level transpose (global_load_tr_b128 for bf16/f16)
means each lane's result is written at a different N position within the
K group. The write indices are corrected to produce contiguous K writes:
  N_new = N_base + K_single % N
  K_new = (K_single floordiv N) * N

The pass now dispatches separately for gfx950 (LDS transpose, existing)
and gfx1200+ (global transpose, new), so neither path interferes with the
other.

Part of: iree-org#24454

Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
…nd promotion

Introduces IREEGPU_UseGlobalTransposeLoad, a promotion attribute for RDNA4
(gfx1200+) that drives matmul operands to be loaded via global_load_tr.

The attribute implements both IREEGPU_PromotionAttr and
IREECodegen_LoweringConfigAttrInterface, with tiling sizes [N=8, K=1]
(vectorSize x 1) derived from the element type.

A new transposePromoteOperand path in GPUPromoteMatmulOperands creates a
linalg.generic copy with K-inner thread mapping:
  - input  map: reads B[K, N] (K-outer, N-inner in memory)
  - output map: writes alloc[N, K] (N-outer, K-inner, for contiguous K writes)

This K-inner tiling aligns with global_load_tr's 8x8 wave-level transpose
semantics: 8 consecutive lanes each read 8 contiguous N-elements, and the
hardware transposes so each lane holds a K-direction slice.

The copy op is tagged with UseGlobalTransposeLoadAttr as its lowering config
so the ROCDLLoadToTransposeLoad pass (PR 2) can recognise and lower it.

amdgpu.fat_raw_buffer_cast is stripped before creating the copy because
global_load_tr requires a flat global pointer, not a fat buffer descriptor.

Part of: iree-org#24454

Co-authored-by: Claude Sonnet 4 (1M context) <noreply@anthropic.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
@nirvedhmeshram nirvedhmeshram force-pushed the amdgpu-global-transpose-load-pr3-attr branch from 2e748ec to 0e9569d Compare May 13, 2026 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant