Skip to content

Replace rocm-smi with amd-smi across ROCm build, CI, and docs#5597

Open
adam360x wants to merge 3 commits intopytorch:mainfrom
adam360x:rocm-smi-to-amd-smi
Open

Replace rocm-smi with amd-smi across ROCm build, CI, and docs#5597
adam360x wants to merge 3 commits intopytorch:mainfrom
adam360x:rocm-smi-to-amd-smi

Conversation

@adam360x
Copy link
Copy Markdown

@adam360x adam360x commented Apr 8, 2026

Summary

  • fbgemm_gpu/src/topology_utils.cpp: Replace legacy rocm_smi C API with amd_smi. Updates include, macro, init/shutdown calls, and device enumeration (now uses the socket→processor handle pattern required by AMD SMI). The HIP device index → handle mapping (hip_device_to_handle) replaces the old integer-index map.
  • ci/utils/gpu_detect.bash: Replace all rocm-smi CLI calls with amd-smi equivalents — vendor detection, GPU model detection (amd-smi static --asic / TARGET_GRAPHICS_VERSION), device count (amd-smi list), and utilization check (amd-smi metric --usage).
  • .github/scripts/fbgemm_gpu_test.bash: Replace rocm-smi --showproductname | grep GUID | wc -l with amd-smi list | grep -c "^GPU" for GPU count.
  • .github/scripts/utils_system.bash: Replace rocm-smi and rocminfo diagnostic calls with amd-smi.
  • .github/scripts/utils_rocm.bash: Replace post-install rocm-smi smoke check with amd-smi.
  • cmake/modules/GpuCppLibrary.cmake, fbgemm_gpu/cmake/Hip.cmake: Remove never-defined ROCM_SMI_INCLUDE and ROCRAND_INCLUDE variables (both were always empty and had no effect).
  • fbgemm_gpu/docs/: Update InstallationInstructions.rst and TestInstructions.rst to reference amd-smi commands and output format.

rocm-smi has been in maintenance mode since ROCm 7.0 with only critical bug fixes. amd-smi is the supported replacement.

Test plan

  • Verify topology_utils.cpp compiles under USE_ROCM=1 with amd-smi-lib installed
  • Run source ci/utils/gpu_detect.bash && detect_gpu_vendor && detect_amd_gpu_model on an AMD GPU host
  • Confirm amd-smi list | grep -c "^GPU" returns correct GPU count on target hardware

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 8, 2026

Hi @adam360x!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Signed-off-by: Adam360x <Adam.pryor@amd.com>
@adam360x adam360x force-pushed the rocm-smi-to-amd-smi branch from e20406b to 012740a Compare April 8, 2026 14:50
@meta-cla meta-cla Bot added the cla signed label Apr 8, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 8, 2026

@q10 has imported this pull request. If you are a Meta employee, you can view this in D100035089.

@q10
Copy link
Copy Markdown
Contributor

q10 commented Apr 8, 2026

@adam360x Looks like one of the runs is failing with:

https://github.com/pytorch/FBGEMM/actions/runs/24141873048/job/70483178549?pr=5597

OSError: /github/home/miniconda/envs/build_binary/lib/python3.14/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: amdsmi_is_P2P_accessible

I think it may have to do with the host machine itself though.

Comment thread cmake/modules/GpuCppLibrary.cmake
Signed-off-by: Adam360x <Adam.pryor@amd.com>
@adam360x
Copy link
Copy Markdown
Author

@adam360x Looks like one of the runs is failing with:

https://github.com/pytorch/FBGEMM/actions/runs/24141873048/job/70483178549?pr=5597

OSError: /github/home/miniconda/envs/build_binary/lib/python3.14/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: undefined symbol: amdsmi_is_P2P_accessible

I think it may have to do with the host machine itself though.

One of the issues was with linking - fixed.
The other, "upload-manywheel-py3_10-cuda13_2" failed downloading an artifact...

@adam360x
Copy link
Copy Markdown
Author

adam360x commented May 1, 2026

@q10 looks like ci EADDRINUSE failure. Amd-smi related changes look to be passing as of now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants