Implement asynchronous LDS loads for MI350 by avbokovoy · Pull Request #5348 · pytorch/FBGEMM

avbokovoy · 2026-01-26T15:27:14Z

This PR implements direct HBM->LDS stores in tbe inference kernel. There are 2 major changes:

Rows data isn't loaded in-place, instead we store pointers to global memory and store the actual data w.r.t. the predicate into LDS. In case predicate is false, we pre-allocate small chunk of static device memory of 16B once, fill it with zeros, and fallback to this chunk
HBM->LDS 16B loads are implemented for ROCm >= 7.0 and MI350. We can expand the support range to MI30* through 4B loads, however it doesn't bring any performance benefits because we'll have to introduce an overhead of addresses transposition and 4x more load operations. You can find out the reference implementation here: fe52557.

Due to pre-7.2 ROCm features, we are forced to used assembly inline to get 16B loads to work, so manual synchronization was added. In case of ROCm >= 7.2, we use proper intrinsics to handle memory synchronization.

This change brings ~10% performance boost on average for weighted and unweighted cases. We may try to push it further by doing async loads for indices weights.

meta-codesync · 2026-01-26T19:14:56Z

@q10 has imported this pull request. If you are a Meta employee, you can view this in D91496421.

spcyppt · 2026-02-13T04:51:01Z


  asm volatile("cp.async.wait_group %0;\n" ::"n"(N));
+#elif defined(USE_ROCM) &&                                                     \
+    (ROCM_VERSION_MAJOR <= 7 && ROCM_VERSION_MINOR < 2) && defined(__gfx950__)


Is this supposed to be supported for rocm version < 7.2?

If so, it should be
(ROCM_VERSION_MAJOR < 7 || (ROCM_VERSION_MAJOR == 7 && ROCM_VERSION_MINOR < 2))?

Indeed. Your comment + some further tweaks with adjustment to future lds intrinsic API are addressed in 2c739ab

spcyppt · 2026-02-13T04:51:16Z


  asm volatile("cp.async.wait_all;\n" ::);
+#elif defined(USE_ROCM) &&                                                     \
+    (ROCM_VERSION_MAJOR <= 7 && ROCM_VERSION_MINOR < 2) && defined(__gfx950__)


same as above

Addressed in 2c739ab

spcyppt · 2026-02-24T19:22:07Z

        const uint4* row_v[kRowUnroll];
        int32_t idx_v[kRowUnroll];
        int32_t cache_idx_v[kRowUnroll];
+        bool row_valid_v[kRowUnroll];


could you ensure all the changes only affect rocm?

The whole block (lines 162-228) is under if is_rocm jinja guard

spcyppt · 2026-02-24T19:23:54Z

+        }
+        {% if weighted %}
+        #pragma unroll
+        for (uint32_t inner_i = 0; inner_i < kRowUnroll; inner_i++) {


Please guard the changes to be ROCM only? We see small regression in NVIDIA.

The whole block (lines 162-228) is under if is_rocm jinja guard

cp_async_zfill_cg is async on Ampere+ and gfx950 but synchronous elsewhere. Inlining the sync fallback into the per-iteration row-load loop kills load pipelining (load->store dependency forces N waitcnts instead of one) and adds wave divergence on mixed-validity warps. Measured up to -19% BW on MI300 (gfx942) for weighted L=20/L=50. Wrap the row-store section in a #if matching the helper's dispatch: gfx950/Ampere keep the fused cp_async_zfill_cg loop; everything else gets the original two-loop pattern (load all -> masked store). Helper and gfx950 paths untouched. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>

…ents about pipelining of memory ops

avbokovoy added 2 commits December 19, 2025 10:34

Implement asynchronous LDS loads for MI350

856f1af

Hardcode size value in __builtin_amdgcn_global_load_lds intrinsic

dc3b15b

meta-cla Bot added the cla signed label Jan 26, 2026

spcyppt reviewed Feb 13, 2026

View reviewed changes

Fix ROCm version and arch guards

2c739ab

avbokovoy requested a review from spcyppt February 19, 2026 11:26

spcyppt reviewed Feb 24, 2026

View reviewed changes

aryaman-gupta and others added 2 commits May 18, 2026 11:49

embedding_forward_quantized_split_nbit_kernel_template: shortens comm…

50d6822

…ents about pipelining of memory ops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement asynchronous LDS loads for MI350 #5348

Implement asynchronous LDS loads for MI350 #5348
avbokovoy wants to merge 5 commits into
pytorch:mainfrom
ROCm:abokovoi/async-lds-inference-opt

avbokovoy commented Jan 26, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jan 26, 2026

Uh oh!

spcyppt Feb 13, 2026

Uh oh!

avbokovoy Feb 19, 2026

Uh oh!

spcyppt Feb 13, 2026

Uh oh!

avbokovoy Feb 19, 2026

Uh oh!

spcyppt Feb 24, 2026 •

edited

Loading

Uh oh!

avbokovoy Feb 25, 2026

Uh oh!

spcyppt Feb 24, 2026

Uh oh!

avbokovoy Feb 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

avbokovoy commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Jan 26, 2026

Uh oh!

spcyppt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

spcyppt Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

spcyppt Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avbokovoy Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

spcyppt Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

avbokovoy Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avbokovoy commented Jan 26, 2026 •

edited

Loading

spcyppt Feb 24, 2026 •

edited

Loading

avbokovoy Feb 25, 2026 •

edited

Loading