Skip to content

Port the low latency allgather kernels to AMD#162

Draft
erieaton-amd wants to merge 7 commits intoByteDance-Seed:mainfrom
erieaton-amd:fast-ag
Draft

Port the low latency allgather kernels to AMD#162
erieaton-amd wants to merge 7 commits intoByteDance-Seed:mainfrom
erieaton-amd:fast-ag

Conversation

@erieaton-amd
Copy link
Copy Markdown
Collaborator

This builds on two other PRs which should be reviewed and merged first.

The low_latency_allgather.py kernels are copied and adapted for AMD devices. I needed to add the signal_op function, but rocshmem doesn't have it. It's a fairly simple function so I added the code to the wrapper file.

This work ran into an issue that there seems to be a race condition between benchmark iterations that stops the program from terminating. I've addressed this by changing the signal wait condition to GE instead of EQ. This allows the kernels to pass the correctness test and finish the benchmark, but slightly changes how they are expected to be used.

Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
This file was not included in the CI and wasn't updated with some recent
changes.

Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
There was a change in ROCm 7+ that makes it harder to match up the torch
and amdsmi devices. This change to use KFD makes operations like
sleep_async work again.

Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
Signed-off-by: Eric Eaton <erieaton@amd.com>
@erieaton-amd erieaton-amd marked this pull request as draft March 5, 2026 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant