Gemm-allreduce loop optimization by erieaton-amd · Pull Request #170 · ByteDance-Seed/Triton-distributed

erieaton-amd · 2026-05-13T22:56:46Z

Removes some unnecessary loop iterations. This actually cuts the execution time roughly in half (7ms down to 3ms) on MI350. I also added the allreduce test to the CI.

drprajap · 2026-05-13T22:59:30Z

+        remote_c_ptr = dl.consume_token(remote_c_ptr, token)
+        remote_c_ptrs = remote_c_ptr + offs_cm[:, None] * stride_cm + offs_cn[None, :] * stride_cn
+        final_acc = tl.load(remote_c_ptrs, mask=c_mask, other=0.0).to(tl.float32)
+        for i in range(1, world_size):


this is still wait-all, reduce-all, if we decouple wait and reduce in separate loops, that can help as well.

erieaton-amd added 3 commits May 13, 2026 22:30

Drop unnecessary loop iterations in allreduce

febb792

Unroll first iteration of allreduce accumulation

3e81e44

Add GEMM AR test to CI

5aaef2b

drprajap reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemm-allreduce loop optimization#170

Gemm-allreduce loop optimization#170
erieaton-amd wants to merge 3 commits into
ByteDance-Seed:mainfrom
erieaton-amd:arg-4

erieaton-amd commented May 13, 2026 •

edited

Loading

Uh oh!

drprajap May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erieaton-amd commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drprajap May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erieaton-amd commented May 13, 2026 •

edited

Loading