Add a HIP/ROCm device backend for KV-transfer and Hamming kernels by jeffdaily · Pull Request #1021 · ModelEngine-Group/unified-cache-management

jeffdaily · 2026-06-11T22:34:38Z

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a __CUDA_ARCH__ or USE_ROCM guard that the CUDA build does not
compile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with export PLATFORM=rocm (or
-DRUNTIME_ENVIRONMENT=rocm when invoking CMake directly).

Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.

The compat shim resolves the project's #include <cuda_runtime.h>/<cuda.h>
to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.

Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
#if defined(__CUDA_ARCH__) with a HIP branch doing a plain vectorized uint4
load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.

The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.

A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.

Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document PLATFORM=rocm beside PLATFORM=cuda.

This work was authored with the assistance of Claude, an AI assistant.

Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).

Store/trans C++ surface plus unit tests:

cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \
  -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_rocm -j16
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1

79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.

Hamming kernel (against a ROCm PyTorch):

cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \
  -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
  -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_sparse -j16 --target hamming
HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \
  python ucm/sparse/test/gsa/test_hamming_rocm_ref.py

The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.

The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.

This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming scoring kernel build and run on AMD GPUs via HIP. The backend is purely additive: the existing backends are not modified. We have made every effort to leave the NVIDIA build unchanged -- every source change to a shared file is behind a `__CUDA_ARCH__` or `USE_ROCM` guard that the CUDA build does not compile, and the compat shim is only placed on the include path for the rocm build, never the cuda one. Select it with `export PLATFORM=rocm` (or `-DRUNTIME_ENVIRONMENT=rocm` when invoking CMake directly). Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then the two guarded kernel sources, then the docs and Windows host-build guards. The compat shim resolves the project's `#include <cuda_runtime.h>`/`<cuda.h>` to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use (Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*. Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD targets need no source edit. Root cause of the one non-mechanical change: the two grid-stride copy kernels used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores) that does not exist on AMD. The PTX is now guarded by `#if defined(__CUDA_ARCH__)` with a HIP branch doing a plain vectorized uint4 load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the __ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are cache-policy hints rather than visibility semantics for this memcpy (each thread writes a disjoint unit and the only consumer is the host after a stream sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh already selects its portable non-PTX fallback under hipcc, so it needed no change. The sparse Hamming module links libtorch. operator.h now includes <ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20 (torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses INFINITY as a masking sentinel that finite-math would drop), and with pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the module init symbol under -fvisibility=hidden. A set of WIN32-guarded host-build fixes let the backend also compile with the clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded, three header-only infra sub-libraries changed from OBJECT to INTERFACE so the linker language is determinable under Ninja+clang-cl, a getpid shim, metrics symbol export, and excluding a POSIX-only thread test). The GPU device code is unchanged by those guards. Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and SGLang quickstarts document `PLATFORM=rocm` beside `PLATFORM=cuda`. This work was authored with the assistance of Claude, an AI assistant. Test Plan: Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU visible (HIP_VISIBLE_DEVICES=0). Store/trans C++ surface plus unit tests: ``` cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \ -DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_rocm -j16 HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1 ``` 79/80 pass on Linux. The copy-kernel correctness gates all pass: UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases (store H2D/D2H batch copy with readback). The one failure, UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated to the GPU backend. Hamming kernel (against a ROCm PyTorch): ``` cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \ -DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \ -DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result" cmake --build build_sparse -j16 --target hamming HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \ python ucm/sparse/test/gsa/test_hamming_rocm_ref.py ``` The new test computes an independent CPU popcount reference for the paged block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa (min over kv heads) matches exactly, two-run output is bit-identical. The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards select the original inline-PTX branch (compile-checked with nvcc 12.8 at sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global streaming ops), so the CUDA build compiles the same device code as before.

flesher0813 · 2026-06-12T01:36:31Z

        match PLATFORM:
            case "cuda":
                cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"]
+            case "rocm":


setup.py adds PLATFORM=rocm support, but the warning message still lists only cuda/ascend/ascend-a3/musa/maca. Please include rocm there as well.

Done in 53f8b25 -- added rocm to the PLATFORM list in that warning.

ygwpz · 2026-06-13T06:27:02Z

+    // streaming builtins; those are NVIDIA cache-policy hints, not semantics.
+    // A plain vectorized 32-byte copy is the portable equivalent.
+    const uint4* src4 = reinterpret_cast<const uint4*>(src);
+    uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst));


⚠️ Warning: The const_cast<uint8_t*>(dst) removes the volatile qualifier from the destination pointer. The original CUDA implementation uses st.volatile.global PTX which ensures proper memory visibility semantics for device-to-host transfers. While the comment correctly notes that cache-policy hints are NVIDIA-specific, the volatile qualifier itself has semantic meaning - it prevents compiler optimizations that could reorder or eliminate memory operations. On HIP, consider using __builtin_nontemporal_store or ensuring proper memory fence semantics to maintain equivalent visibility guarantees. The current plain copy may be functionally correct for this use case (disjoint writes with stream sync), but the removal of volatile should be explicitly justified in the comment.

I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.

ygwpz · 2026-06-13T06:27:04Z

+    // A plain vectorized 32-byte copy is the portable equivalent.
+    const uint4* src4 = reinterpret_cast<const uint4*>(src);
+    uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst));
+    dst4[0] = src4[0];


💡 Suggestion: The reinterpret_cast<uint4*> assumes the source pointer is properly aligned to 16-byte boundary for uint4 access. While this is likely true for the KV cache blocks (which are typically allocated with proper alignment), consider adding an assertion or documentation noting the alignment requirement. Misaligned access could cause undefined behavior or performance degradation on some AMD architectures.

The 16-byte alignment requirement is unchanged from the original CUDA path -- its ld.global.cs.v4.b32 vectorized access requires the same 16-byte alignment, which the KV blocks satisfy. So the HIP path introduces no new constraint; keeping it as-is for parity with the CUDA code.

ygwpz · 2026-06-13T06:27:05Z

    asm volatile("ld.global.cs.v2.u64 {%0, %1}, [%2];" : "=l"(a), "=l"(b) : "l"(src));
    asm volatile("st.global.cg.v2.u64 [%0], {%1, %2};" ::"l"(dst), "l"(a), "l"(b));
+#else
+    *reinterpret_cast<uint4*>(dst) =


⚠️ Warning: Similar to cuda_sm_kernel.cu, the reinterpret_cast<uint4*>(dst) removes both const and volatile qualifiers. The volatile qualifier on the destination pointer in D2HUnit was intentional to ensure proper visibility of writes to host memory. While the stream synchronization after the kernel provides the necessary fence, the explicit volatile in the original code served as a compiler barrier. On HIP, this plain copy should work correctly due to the subsequent sync, but the removal of volatile semantics should be documented as a deliberate design decision rather than an incidental consequence of the shim.

I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.

ygwpz · 2026-06-13T06:27:07Z

+#endif
 }

 inline __device__ void D2HUnit(volatile uint8_t* __restrict__ dst, const uint8_t* __restrict__ src)


⚠️ Warning: The const_cast<uint8_t*>(dst) removes the volatile qualifier. This mirrors the issue in H2DUnit above. The D2H path writes to host-registered memory, and the volatile qualifier ensured the compiler wouldn't optimize away or reorder these writes. The HIP fallback relies on stream synchronization for correctness, which is valid, but the comment should clarify that volatile semantics are intentionally replaced by stream sync barriers.

I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.

ygwpz · 2026-06-13T06:27:08Z

+#endif
 #include <cuda_runtime.h>
 #include <torch/script.h>



💡 Suggestion: The #include <cuda_runtime.h> after the conditional block relies on the hip_compat shim being on the include path for ROCm builds. This works correctly when the CMake configuration is proper, but could cause confusing build failures if the shim directory is not added to include paths. Consider adding a comment here noting this dependency, or adding an #ifdef USE_ROCM guard with an #error directive if the shim is missing (e.g., checking for UNIFIEDCACHE_HIP_COMPAT_CUDA_RUNTIME_H define).

The rocm CMake arm puts the hip_compat shim dir on the include path (that's where cuda_runtime.h resolves for ROCm builds), so a misconfiguration fails loudly at compile time rather than silently. I'd rather not add an #error probe keyed on a shim-specific define, since it couples the source to the build layout. Leaving as-is.

ygwpz · 2026-06-13T06:27:10Z

+static constexpr hipError_t cudaSuccess = hipSuccess;
+static constexpr hipMemcpyKind cudaMemcpyHostToDevice = hipMemcpyHostToDevice;
+static constexpr hipMemcpyKind cudaMemcpyDeviceToHost = hipMemcpyDeviceToHost;
+static constexpr unsigned int cudaStreamNonBlocking = hipStreamNonBlocking;


💡 Suggestion: Using #define macros for function aliases prevents proper function overload resolution and could cause issues if CUDA APIs add new signatures in future versions. For this limited set of APIs, the macro approach is acceptable. However, consider adding inline wrapper functions instead of macros for type safety and better debugging support. Example: inline cudaError_t cudaMalloc(void** ptr, size_t size) { return hipMalloc(ptr, size); } would preserve the cudaError_t return type explicitly.

The #define approach is intentional: it matches the cuda_to_hip compatibility-header convention used across these ports, and it also has to cover the type aliases (cudaStream_t, cudaError_t, ...) that inline wrappers can't. The aliased surface here is small and unambiguous (Malloc/Free/Memcpy/Stream/Event), so the macros don't create overload-resolution hazards in practice. Leaving as-is.

ygwpz · 2026-06-13T06:27:11Z

@@ -0,0 +1,30 @@
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
+    set(CMAKE_HIP_ARCHITECTURES "gfx90a")


💡 Suggestion: The architecture defaults to gfx90a (MI250X/MI210) which is appropriate for datacenter AMD GPUs, but may cause confusion for users with consumer Radeon GPUs (gfx1100, gfx1201, etc.). Consider adding a comment or CMake warning message when defaulting, suggesting users set CMAKE_HIP_ARCHITECTURES explicitly. The PR description mentions gfx1100 validation but the default doesn't reflect this broader support.

Fixed in b697824. These arms pinned gfx90a before enable_language(HIP), which preempted CMake's host-GPU auto-detection (rocm_agent_enumerator) -- a non-gfx90a AMD user who didn't pass -DCMAKE_HIP_ARCHITECTURES would have silently built gfx90a binaries that fail to load on their card. Now enable_language(HIP) auto-detects the host arch (deduped), with gfx90a used only as a fallback when nothing is detected (e.g. a CPU-only build host); -DCMAKE_HIP_ARCHITECTURES still overrides.

Refined in 57b2443: I dropped the explicit gfx90a fallback. enable_language(HIP) already errors clearly when no GPU is detected and no arch is set, so a no-GPU build host now sets -DCMAKE_HIP_ARCHITECTURES explicitly rather than getting a silent gfx90a default.

ygwpz · 2026-06-13T06:27:13Z

+string(REPLACE "-s " "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+string(REGEX REPLACE "(^| )-s$" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
+
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")


💡 Suggestion: The string(REPLACE "-s " "" ...) and string(REGEX REPLACE ...) pattern for removing the -s flag is somewhat fragile. It assumes -s appears with a trailing space or at the end of the string. If -s appears without a space (e.g., -s -O2), the first REPLACE handles it, but other edge cases might slip through. Consider using a more robust approach like string(REGEX REPLACE "-s( |$)" "" ...) to handle all positions uniformly.

The two statements already cover both positions -- string(REPLACE "-s " ...) handles -s anywhere with a trailing space, and string(REGEX REPLACE "(^| )-s$" ...) handles it at end-of-string -- and both leave -shared/-static untouched. So it's robust for the flag strings in play; leaving as-is.

ygwpz · 2026-06-13T06:27:15Z

+            # element (kv*block_size + offset)*num_chunk + chunk. Index a flat
+            # view of the key tensor the same way so the reference matches the
+            # kernel's layout regardless of the host tensor's nominal shape.
+            block = key[phys].reshape(-1)


💡 Suggestion: The reference implementation uses a Python loop to compute popcount32, which is correct but slow. For test correctness verification this is acceptable. However, consider using torch.bincount or numpy's built-in popcount for better performance if this test needs to run on larger tensors. The current implementation is fine for the test sizes used (hd=576, block_size=64).

This is the reference oracle for the test, so I've kept it as the simplest obviously-correct implementation rather than optimizing it; as you note, it's fine at the test sizes. Leaving as-is.

ygwpz · 2026-06-13T06:27:16Z

@@ -0,0 +1,18 @@
+if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
+    set(CMAKE_HIP_ARCHITECTURES "gfx90a")


💡 Suggestion: Same as the trans/rocm CMakeLists.txt - the gfx90a default should be documented or warned about. Users building on consumer AMD GPUs will need to override this.

Fixed in b697824. These arms pinned gfx90a before enable_language(HIP), which preempted CMake's host-GPU auto-detection (rocm_agent_enumerator) -- a non-gfx90a AMD user who didn't pass -DCMAKE_HIP_ARCHITECTURES would have silently built gfx90a binaries that fail to load on their card. Now enable_language(HIP) auto-detects the host arch (deduped), with gfx90a used only as a fallback when nothing is detected (e.g. a CPU-only build host); -DCMAKE_HIP_ARCHITECTURES still overrides.

Refined in 57b2443: I dropped the explicit gfx90a fallback. enable_language(HIP) already errors clearly when no GPU is detected and no arch is set, so a no-GPU build host now sets -DCMAKE_HIP_ARCHITECTURES explicitly rather than getting a silent gfx90a default.

…e volatile drop Two review follow-ups, both non-functional: - setup.py: the "PLATFORM not set" warning listed cuda/ascend/ascend-a3/musa/maca but not rocm, even though the build supports PLATFORM=rocm. Add rocm. - The HIP copy paths in cuda_sm_kernel.cu and cuda_device.cu drop the `volatile` qualifier the CUDA path carries (st.volatile.global). Document why that is correct on AMD rather than reintroducing it: host visibility for these H2D/D2H copies comes from the fine-grained-coherent host registration plus the per-transfer hipStreamSynchronize (GPU caches flush at kernel completion), not from the qualifier. On AMD `volatile` only forces an L1 bypass (glc, GPU-L2 scope) -- neither necessary nor sufficient for host visibility; system-scope ordering, if it were ever needed, is __threadfence_system. Comments and a warning string only; no generated code changes. Authored with assistance from Claude.

The three ROCm CMake arms set CMAKE_HIP_ARCHITECTURES to gfx90a before enable_language(HIP) whenever it was unset, which preempted CMake's own host GPU auto-detection (via rocm_agent_enumerator). A user on a non-gfx90a AMD GPU (e.g. gfx1100) who did not pass -DCMAKE_HIP_ARCHITECTURES would silently build gfx90a code objects that fail to load on their card. Let enable_language(HIP) auto-detect the host architecture, dedup the result (it can list one entry per agent), and fall back to gfx90a only when nothing is detected (e.g. a CPU-only build host). Passing -DCMAKE_HIP_ARCHITECTURES still overrides, so explicit-arch builds are unchanged. Authored with assistance from Claude. Test Plan: configured the ROCm store build on gfx90a without -DCMAKE_HIP_ARCHITECTURES; auto-detection resolves to gfx90a and the kernels compile with --offload-arch=gfx90a (single, deduped). Explicit -DCMAKE_HIP_ARCHITECTURES=gfx90a is byte-identical to before.

Follow-up to b697824. enable_language(HIP) already honors an explicit -DCMAKE_HIP_ARCHITECTURES, otherwise auto-detects the host GPU(s) via rocm_agent_enumerator, and errors if none is found. So the explicit gfx90a-on-no-GPU fallback was dead code (enable_language has already errored by then), and deduping its detection output is unnecessary. Drop both; the no-GPU build host now gets CMake's clear "set the architecture" error instead of a silently-wrong gfx90a default. Authored with assistance from Claude.

jeffdaily requested review from FangRun2, Infinite666, Tarrei, flesher0813, mag1c-h, wangwenxin0312, wuhuxiao and ygwpz as code owners June 11, 2026 22:34

flesher0813 reviewed Jun 12, 2026

View reviewed changes

ygwpz reviewed Jun 13, 2026

View reviewed changes

jeffdaily added 3 commits June 23, 2026 18:03

		@@ -0,0 +1,30 @@
		if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
		set(CMAKE_HIP_ARCHITECTURES "gfx90a")

		@@ -0,0 +1,18 @@
		if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "")
		set(CMAKE_HIP_ARCHITECTURES "gfx90a")

Uh oh!

Conversation

jeffdaily commented Jun 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants