Replace rocm-smi with amd-smi across ROCm build, CI, and docs#5597
Replace rocm-smi with amd-smi across ROCm build, CI, and docs#5597adam360x wants to merge 3 commits intopytorch:mainfrom
Conversation
|
Hi @adam360x! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Signed-off-by: Adam360x <Adam.pryor@amd.com>
e20406b to
012740a
Compare
|
@q10 has imported this pull request. If you are a Meta employee, you can view this in D100035089. |
|
@adam360x Looks like one of the runs is failing with: https://github.com/pytorch/FBGEMM/actions/runs/24141873048/job/70483178549?pr=5597 I think it may have to do with the host machine itself though. |
Signed-off-by: Adam360x <Adam.pryor@amd.com>
One of the issues was with linking - fixed. |
|
@q10 looks like ci |
Summary
fbgemm_gpu/src/topology_utils.cpp: Replace legacyrocm_smiC API withamd_smi. Updates include, macro, init/shutdown calls, and device enumeration (now uses the socket→processor handle pattern required by AMD SMI). The HIP device index → handle mapping (hip_device_to_handle) replaces the old integer-index map.ci/utils/gpu_detect.bash: Replace allrocm-smiCLI calls withamd-smiequivalents — vendor detection, GPU model detection (amd-smi static --asic/TARGET_GRAPHICS_VERSION), device count (amd-smi list), and utilization check (amd-smi metric --usage)..github/scripts/fbgemm_gpu_test.bash: Replacerocm-smi --showproductname | grep GUID | wc -lwithamd-smi list | grep -c "^GPU"for GPU count..github/scripts/utils_system.bash: Replacerocm-smiandrocminfodiagnostic calls withamd-smi..github/scripts/utils_rocm.bash: Replace post-installrocm-smismoke check withamd-smi.cmake/modules/GpuCppLibrary.cmake,fbgemm_gpu/cmake/Hip.cmake: Remove never-definedROCM_SMI_INCLUDEandROCRAND_INCLUDEvariables (both were always empty and had no effect).fbgemm_gpu/docs/: UpdateInstallationInstructions.rstandTestInstructions.rstto referenceamd-smicommands and output format.rocm-smihas been in maintenance mode since ROCm 7.0 with only critical bug fixes.amd-smiis the supported replacement.Test plan
topology_utils.cppcompiles underUSE_ROCM=1withamd-smi-libinstalledsource ci/utils/gpu_detect.bash && detect_gpu_vendor && detect_amd_gpu_modelon an AMD GPU hostamd-smi list | grep -c "^GPU"returns correct GPU count on target hardware