[ROCm] Add AMD ROCm/HIP build for the CUDA particle/SPH solvers#145
[ROCm] Add AMD ROCm/HIP build for the CUDA particle/SPH solvers#145jeffdaily wants to merge 4 commits into
Conversation
This adds an AMD ROCm/HIP build path for CubbyFlow's CUDA particle and SPH solvers (SPH, WCSPH, PCISPH, the particle-system solvers, the point-hash-grid neighbor searcher, and the CUDA array/container layer) so they run on AMD GPUs in addition to NVIDIA GPUs, while keeping the existing CUDA build unchanged when USE_HIP is off. The CUDA sources are compiled with hipcc rather than rewritten. A small compatibility shim, Includes/Core/CUDA/cuda_to_hip.h, is force-included on every HIP translation unit and maps the handful of CUDA runtime symbols the project uses (cudaMalloc, cudaMemcpy, error helpers, and the memcpy-kind enumerators) onto their hip* equivalents. The shim also reconciles the preprocessor idioms the codebase relies on: the project gates its kernel and host/device attribute macros on __CUDACC__ and selects device-vs-host accessor return types on __CUDA_ARCH__, so those guards are extended in place to also accept the HIP compile passes. __CUDACC__ is deliberately left undefined under HIP because rocThrust keys its backend selection on it; instead Thrust is pinned to its HIP backend so it does not pull a CUDA-only CUB header. To review: start with cuda_to_hip.h and Includes/Core/Utils/Macros.hpp (the guard extensions), then the CMake USE_HIP option (enables the HIP language, sets CMAKE_HIP_ARCHITECTURES, force-includes the shim, marks the CUDA directory's .cu and device-touching .cpp as LANGUAGE HIP), then the mechanical __CUDACC__/__CUDA_ARCH__ guard widening across Includes/Core/CUDA and Sources/Core/CUDA. NVIDIA CUDA remains the default; configure with -DUSE_HIP=ON to build for AMD. Test Plan: ``` git submodule update --init --recursive cmake -B build -S . -DCMAKE_BUILD_TYPE=Release -DUSE_HIP=ON \ -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DBUILD_TESTS=ON -DBUILD_EXAMPLES=ON cmake --build build -j$(nproc) ./build/bin/CUDAUnitTests ``` Built and run on AMD Instinct MI250X (gfx90a) and Radeon (gfx1100) on Linux, and on gfx1201 on Windows (ROCm/TheRock SDK). CUDATests pass (35/35 cases, 3168/3168 assertions: CUDA array/vector/stdarray, particle system data and solver, point hash grid searcher cross-checked against the CPU searcher), the CPU UnitTests remain green (722/722), and the CUDASPHSim example runs the full GPU pipeline end to end. The CUDA build and behavior are unchanged when USE_HIP is off. This work was authored with the assistance of Claude (Anthropic's AI assistant).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📜 Recent review details🔇 Additional comments (1)
📝 WalkthroughWalkthroughThe PR adds an AMD ROCm/HIP GPU backend as a mutually exclusive alternative to the existing CUDA backend. Changes include a new ChangesAMD ROCm/HIP Backend Support
Sequence Diagram(s)sequenceDiagram
participant CMake as CMake Configure
participant hipcc as hipcc Compiler
participant cuda_to_hip as cuda_to_hip.h
participant hip_runtime as hip/hip_runtime.h
participant CUDAHeader as CUDAArray/SPH Headers
CMake->>CMake: USE_HIP=ON → set USE_CUDA=OFF
CMake->>CMake: enable_language(HIP), set CMAKE_HIP_FLAGS -include cuda_to_hip.h
CMake->>hipcc: compile *.cu with LANGUAGE HIP
hipcc->>cuda_to_hip: force-include cuda_to_hip.h
cuda_to_hip->>hip_runtime: include hip/hip_runtime.h
cuda_to_hip->>cuda_to_hip: define __CUDA_ARCH__=1 on device pass
cuda_to_hip->>cuda_to_hip: remap cudaMalloc→hipMalloc, cudaMemcpy→hipMemcpy, etc.
hipcc->>CUDAHeader: compile with __HIP__ defined
CUDAHeader->>CUDAHeader: select __HIP__-aware accessor branches
CUDAHeader->>CUDAHeader: compile CUBBYFLOW_CUDA_HOST_DEVICE functions for host+device
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
CMakeLists.txt (1)
70-71: ⚡ Quick winAvoid hard-coding Thrust’s HIP backend numeric ID.
Line 71 uses
THRUST_DEVICE_SYSTEM=5; that value is an internal constant and can drift with Thrust revisions. Prefer the symbolic macro to keep this resilient.Suggested fix
- # THRUST_DEVICE_SYSTEM_HIP == 5 in thrust/detail/config/device_system.h. - add_compile_definitions(THRUST_DEVICE_SYSTEM=5) + add_compile_definitions(THRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@CMakeLists.txt` around lines 70 - 71, In the CMakeLists.txt file, the add_compile_definitions call on line 71 hard-codes the numeric value 5 for THRUST_DEVICE_SYSTEM, which is an internal constant that may change across Thrust versions. Replace the hard-coded numeric value 5 with the symbolic macro THRUST_DEVICE_SYSTEM_HIP to ensure the configuration remains resilient to Thrust revisions and is more maintainable.Tests/CUDATests/CUDAArray2Tests.cu (1)
302-302: ⚡ Quick winAssert
cudaDeviceSynchronize()success instead of discarding it.Line 302 currently suppresses the return value; this can hide backend/runtime failures in the resize test.
Suggested fix
- static_cast<void>(cudaDeviceSynchronize()); + CHECK_EQ(cudaSuccess, cudaDeviceSynchronize());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@Tests/CUDATests/CUDAArray2Tests.cu` at line 302, The call to cudaDeviceSynchronize() is being cast to void, which suppresses its return value and can hide backend/runtime failures. Replace the static_cast that discards the return value with an assertion that checks whether cudaDeviceSynchronize() returns successfully, ensuring that any CUDA runtime errors during device synchronization in the resize test are properly caught and reported.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CMakeLists.txt`:
- Around line 64-65: The CUBBYFLOW_CUDA_TO_HIP_HEADER variable reference in the
CMAKE_HIP_FLAGS set command is not quoted, which causes word-splitting issues if
the path contains spaces. Add quotes around the ${CUBBYFLOW_CUDA_TO_HIP_HEADER}
variable substitution to ensure the absolute path is treated as a single
argument to the -include compiler flag, protecting it from being split on
whitespace during argument parsing.
In `@Documents/Install.md`:
- Around line 237-242: The fenced code block containing the mkdir, cd, cmake,
and make commands is missing a language specification after the opening triple
backticks, which triggers the MD040 linting rule. Add "bash" as the language
identifier immediately after the opening triple backticks to specify that this
is a bash shell command block and resolve the linting violation.
---
Nitpick comments:
In `@CMakeLists.txt`:
- Around line 70-71: In the CMakeLists.txt file, the add_compile_definitions
call on line 71 hard-codes the numeric value 5 for THRUST_DEVICE_SYSTEM, which
is an internal constant that may change across Thrust versions. Replace the
hard-coded numeric value 5 with the symbolic macro THRUST_DEVICE_SYSTEM_HIP to
ensure the configuration remains resilient to Thrust revisions and is more
maintainable.
In `@Tests/CUDATests/CUDAArray2Tests.cu`:
- Line 302: The call to cudaDeviceSynchronize() is being cast to void, which
suppresses its return value and can hide backend/runtime failures. Replace the
static_cast that discards the return value with an assertion that checks whether
cudaDeviceSynchronize() returns successfully, ensuring that any CUDA runtime
errors during device synchronization in the resize test are properly caught and
reported.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 587d8c24-6b54-4254-a701-3b4c5a3099b0
📒 Files selected for processing (28)
Builds/CMake/CompileOptions.cmakeCMakeLists.txtDocuments/Install.mdExamples/CUDASPHSim/CMakeLists.txtExamples/CUDASPHSim/CUDAPCISPHSolver3Example.cuIncludes/Core/CUDA/CUDAAlgorithms.hppIncludes/Core/CUDA/CUDAArray-Impl.hppIncludes/Core/CUDA/CUDAArrayBase-Impl.hppIncludes/Core/CUDA/CUDAArrayBase.hppIncludes/Core/CUDA/CUDAArrayView-Impl.hppIncludes/Core/CUDA/CUDAParticleSystemData2.hppIncludes/Core/CUDA/CUDAParticleSystemData3.hppIncludes/Core/CUDA/CUDAPointHashGridSearcher2.hppIncludes/Core/CUDA/CUDAPointHashGridSearcher3.hppIncludes/Core/CUDA/CUDASPHKernels2-Impl.hppIncludes/Core/CUDA/CUDASPHKernels3-Impl.hppIncludes/Core/CUDA/CUDAStdArray-Impl.hppIncludes/Core/CUDA/CUDAStdVector-Impl.hppIncludes/Core/CUDA/CUDAStdVector.hppIncludes/Core/CUDA/CUDAUtils.hppIncludes/Core/CUDA/ThrustUtils.hppIncludes/Core/CUDA/cuda_to_hip.hIncludes/Core/Utils/Macros.hppSources/Core/CMakeLists.txtSources/Core/CUDA/CUDASPHSystemData2Func.hppSources/Core/CUDA/CUDASPHSystemData3Func.hppTests/CUDATests/CMakeLists.txtTests/CUDATests/CUDAArray2Tests.cu
📜 Review details
🧰 Additional context used
🪛 markdownlint-cli2 (0.22.1)
Documents/Install.md
[warning] 237-237: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (25)
Includes/Core/CUDA/cuda_to_hip.h (1)
14-50: LGTM!Includes/Core/CUDA/ThrustUtils.hpp (1)
18-22: LGTM!Includes/Core/CUDA/CUDAParticleSystemData2.hpp (1)
21-25: LGTM!Includes/Core/CUDA/CUDAParticleSystemData3.hpp (1)
21-25: LGTM!Includes/Core/CUDA/CUDAPointHashGridSearcher2.hpp (1)
19-23: LGTM!Includes/Core/CUDA/CUDAPointHashGridSearcher3.hpp (1)
19-23: LGTM!Includes/Core/Utils/Macros.hpp (1)
17-23: LGTM!Also applies to: 27-35, 38-52
Includes/Core/CUDA/CUDAUtils.hpp (1)
18-22: LGTM!Also applies to: 44-48, 353-353
Sources/Core/CUDA/CUDASPHSystemData2Func.hpp (1)
20-24: LGTM!Sources/Core/CUDA/CUDASPHSystemData3Func.hpp (1)
20-24: LGTM!Examples/CUDASPHSim/CUDAPCISPHSolver3Example.cu (1)
18-22: LGTM!Includes/Core/CUDA/CUDAAlgorithms.hpp (1)
22-22: LGTM!Includes/Core/CUDA/CUDAArray-Impl.hpp (1)
18-18: LGTM!Also applies to: 143-143, 357-357
Includes/Core/CUDA/CUDASPHKernels2-Impl.hpp (1)
21-148: LGTM!Includes/Core/CUDA/CUDASPHKernels3-Impl.hpp (1)
21-144: LGTM!Includes/Core/CUDA/CUDAArrayView-Impl.hpp (1)
19-184: LGTM!Also applies to: 186-188
Includes/Core/CUDA/CUDAStdArray-Impl.hpp (1)
19-27: LGTM!Also applies to: 36-37, 45-46, 54-55, 64-65, 74-75, 86-87, 98-99, 120-123, 127-130, 134-135, 149-153, 157-163, 167-168
Includes/Core/CUDA/CUDAStdVector.hpp (1)
68-80: LGTM!Also applies to: 125-131, 160-168
Includes/Core/CUDA/CUDAStdVector-Impl.hpp (1)
98-123: LGTM!Also applies to: 238-263
Includes/Core/CUDA/CUDAArrayBase.hpp (1)
59-115: LGTM!Includes/Core/CUDA/CUDAArrayBase-Impl.hpp (1)
19-38: LGTM!Also applies to: 41-80, 92-95, 185-268, 271-351, 359-359
Builds/CMake/CompileOptions.cmake (1)
128-133: LGTM!Also applies to: 143-163
Sources/Core/CMakeLists.txt (1)
21-31: LGTM!Examples/CUDASPHSim/CMakeLists.txt (1)
8-9: LGTM!Also applies to: 22-25
Tests/CUDATests/CMakeLists.txt (1)
10-18: LGTM!
Use the symbolic THRUST_DEVICE_SYSTEM_HIP macro instead of its numeric
value when pinning rocThrust's backend, quote the force-included shim
path so a source tree containing spaces still passes a single -include
argument, assert the cudaDeviceSynchronize() result in the CUDAArray2
resize test instead of discarding it, and tag the HIP build's fenced
code block in the install docs with a language.
These are behavior-preserving: the Thrust macro expands to the same
value and the quoted path resolves identically, so the compiled device
code is unchanged on both the CUDA and HIP paths.
Test Plan:
cmake .. -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
make
./bin/CUDATests
Authored with the assistance of Claude (Anthropic's AI assistant).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Tests/CUDATests/CUDAArray2Tests.cu`:
- Line 302: At line 330, there is a similar arr.Resize(1, 9, 3.f) operation with
a fill value that performs async kernel operations, but it lacks a
synchronization check unlike the one added at line 302. Add a
CHECK_EQ(cudaSuccess, cudaDeviceSynchronize()) call immediately after the
arr.Resize operation at line 330 to ensure the async resize-with-fill kernel
completes before proceeding with value verification, maintaining consistency
with line 302 and preventing potential flaky tests on the HIP backend.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2f7a2d4d-7779-4244-9741-3dfb3fba28ee
📒 Files selected for processing (3)
CMakeLists.txtDocuments/Install.mdTests/CUDATests/CUDAArray2Tests.cu
✅ Files skipped from review due to trivial changes (1)
- Documents/Install.md
🚧 Files skipped from review as they are similar to previous changes (1)
- CMakeLists.txt
Mirror the cudaDeviceSynchronize() success check onto the second
resize-with-fill case in the CUDAArray2 resize test so both async
fill kernels are confirmed complete before their results are read,
keeping the two cases consistent.
Test Plan:
cmake .. -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a
make
./bin/CUDATests
Authored with the assistance of Claude (Anthropic's AI assistant).
The default-gfx90a block preempted and duplicated CMake's own host-GPU detection. Removing it lets enable_language(HIP) honor an explicit -DCMAKE_HIP_ARCHITECTURES, auto-detect the host GPU, or error on a no-GPU build host. The block was also dead code: it ran after enable_language(HIP), by which point CMAKE_HIP_ARCHITECTURES was already set, so its guard was always false. This work was authored with the Claude AI assistant.
This adds an AMD ROCm/HIP build path for CubbyFlow's CUDA particle and SPH solvers (SPH, WCSPH, PCISPH, the particle-system solvers, the point-hash-grid neighbor searcher, and the CUDA array/container layer) so they run on AMD GPUs in addition to NVIDIA GPUs, while keeping the existing CUDA build unchanged when
USE_HIPis off.What changed
The CUDA sources are compiled with hipcc rather than rewritten. A small compatibility shim,
Includes/Core/CUDA/cuda_to_hip.h, is force-included on every HIP translation unit and maps the handful of CUDA runtime symbols the project uses (cudaMalloc,cudaMemcpy, error helpers, and the memcpy-kind enumerators) onto theirhip*equivalents. The shim also reconciles the preprocessor idioms the codebase relies on: the project gates its kernel and host/device attribute macros on__CUDACC__and selects device-vs-host accessor return types on__CUDA_ARCH__, so those guards are extended in place to also accept the HIP compile passes.__CUDACC__is deliberately left undefined under HIP because rocThrust keys its backend selection on it; instead Thrust is pinned to its HIP backend so it does not pull a CUDA-only CUB header.The remaining edits are confined to the CUDA headers and sources under
Includes/Core/CUDAandSources/Core/CUDA(extending the__CUDACC__/__CUDA_ARCH__guards), the CMake build (a mutually exclusiveUSE_HIPoption that enables the HIP language, setsCMAKE_HIP_ARCHITECTURES, and force-includes the shim), the CUDA examples and tests so they build under HIP, and the install documentation.How to build the ROCm/HIP path
NVIDIA CUDA remains the default (
USE_CUDA=ON). To build for AMD GPUs instead, configure withUSE_HIP=ON:CMAKE_HIP_ARCHITECTURESaccepts a semicolon-separated list of AMD GPU targets and defaults togfx90awhen omitted. A working ROCm installation with HIP is required. The new section inDocuments/Install.mddocuments this alongside the existing build instructions.Validation
Built and run on AMD Instinct MI250X (gfx90a) and Radeon (gfx1100) on Linux, and on gfx1201 on Windows (ROCm/TheRock SDK). The
Tests/CUDATestssuite passes (35/35 cases, 3168/3168 assertions), the CPUUnitTestsremain green (722/722), and theCUDASPHSimexample runs the full GPU pipeline end to end. The CUDA build and behavior are unchanged whenUSE_HIPis off.This work was authored with the assistance of Claude (Anthropic's AI assistant).
Summary by CodeRabbit
Release Notes
New Features
Documentation
USE_HIP=ONand optional HIP architecture configuration (defaults togfx90awhen unset).Build System
USE_HIPbuild option that switches to HIP compilation and disables CUDA for HIP builds.Bug Fixes