Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021
Add a HIP/ROCm device backend for KV-transfer and Hamming kernels#1021jeffdaily wants to merge 4 commits into
Conversation
This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a `__CUDA_ARCH__` or `USE_ROCM` guard that the CUDA build does not
compile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with `export PLATFORM=rocm` (or
`-DRUNTIME_ENVIRONMENT=rocm` when invoking CMake directly).
Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.
The compat shim resolves the project's `#include <cuda_runtime.h>`/`<cuda.h>`
to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.
Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
`#if defined(__CUDA_ARCH__)` with a HIP branch doing a plain vectorized uint4
load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.
The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.
A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.
Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document `PLATFORM=rocm` beside `PLATFORM=cuda`.
This work was authored with the assistance of Claude, an AI assistant.
Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).
Store/trans C++ surface plus unit tests:
```
cmake -S . -B build_rocm -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=ON \
-DBUILD_UNIT_TESTS=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_rocm -j16
HIP_VISIBLE_DEVICES=0 ctest --test-dir build_rocm -j1
```
79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.
Hamming kernel (against a ROCm PyTorch):
```
cmake -S . -B build_sparse -DRUNTIME_ENVIRONMENT=rocm -DBUILD_UCM_STORE=OFF \
-DBUILD_UCM_SPARSE=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
-DCMAKE_HIP_COMPILER=/opt/rocm/llvm/bin/clang++ -DCMAKE_BUILD_TYPE=Release \
-DPython_EXECUTABLE=<rocm-torch-python> -DCMAKE_CXX_FLAGS="-Wno-error=unused-result"
cmake --build build_sparse -j16 --target hamming
HIP_VISIBLE_DEVICES=0 HAMMING_DIR=<dir with hamming*.so> \
python ucm/sparse/test/gsa/test_hamming_rocm_ref.py
```
The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.
The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.
| match PLATFORM: | ||
| case "cuda": | ||
| cmake_args += ["-DRUNTIME_ENVIRONMENT=cuda"] | ||
| case "rocm": |
There was a problem hiding this comment.
setup.py adds PLATFORM=rocm support, but the warning message still lists only cuda/ascend/ascend-a3/musa/maca. Please include rocm there as well.
There was a problem hiding this comment.
Done in 53f8b25 -- added rocm to the PLATFORM list in that warning.
| // streaming builtins; those are NVIDIA cache-policy hints, not semantics. | ||
| // A plain vectorized 32-byte copy is the portable equivalent. | ||
| const uint4* src4 = reinterpret_cast<const uint4*>(src); | ||
| uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst)); |
There was a problem hiding this comment.
const_cast<uint8_t*>(dst) removes the volatile qualifier from the destination pointer. The original CUDA implementation uses st.volatile.global PTX which ensures proper memory visibility semantics for device-to-host transfers. While the comment correctly notes that cache-policy hints are NVIDIA-specific, the volatile qualifier itself has semantic meaning - it prevents compiler optimizations that could reorder or eliminate memory operations. On HIP, consider using __builtin_nontemporal_store or ensuring proper memory fence semantics to maintain equivalent visibility guarantees. The current plain copy may be functionally correct for this use case (disjoint writes with stream sync), but the removal of volatile should be explicitly justified in the comment.
There was a problem hiding this comment.
I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.
| // A plain vectorized 32-byte copy is the portable equivalent. | ||
| const uint4* src4 = reinterpret_cast<const uint4*>(src); | ||
| uint4* dst4 = reinterpret_cast<uint4*>(const_cast<uint8_t*>(dst)); | ||
| dst4[0] = src4[0]; |
There was a problem hiding this comment.
💡 Suggestion: The reinterpret_cast<uint4*> assumes the source pointer is properly aligned to 16-byte boundary for uint4 access. While this is likely true for the KV cache blocks (which are typically allocated with proper alignment), consider adding an assertion or documentation noting the alignment requirement. Misaligned access could cause undefined behavior or performance degradation on some AMD architectures.
There was a problem hiding this comment.
The 16-byte alignment requirement is unchanged from the original CUDA path -- its ld.global.cs.v4.b32 vectorized access requires the same 16-byte alignment, which the KV blocks satisfy. So the HIP path introduces no new constraint; keeping it as-is for parity with the CUDA code.
| asm volatile("ld.global.cs.v2.u64 {%0, %1}, [%2];" : "=l"(a), "=l"(b) : "l"(src)); | ||
| asm volatile("st.global.cg.v2.u64 [%0], {%1, %2};" ::"l"(dst), "l"(a), "l"(b)); | ||
| #else | ||
| *reinterpret_cast<uint4*>(dst) = |
There was a problem hiding this comment.
reinterpret_cast<uint4*>(dst) removes both const and volatile qualifiers. The volatile qualifier on the destination pointer in D2HUnit was intentional to ensure proper visibility of writes to host memory. While the stream synchronization after the kernel provides the necessary fence, the explicit volatile in the original code served as a compiler barrier. On HIP, this plain copy should work correctly due to the subsequent sync, but the removal of volatile semantics should be documented as a deliberate design decision rather than an incidental consequence of the shim.
There was a problem hiding this comment.
I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.
| #endif | ||
| } | ||
|
|
||
| inline __device__ void D2HUnit(volatile uint8_t* __restrict__ dst, const uint8_t* __restrict__ src) |
There was a problem hiding this comment.
const_cast<uint8_t*>(dst) removes the volatile qualifier. This mirrors the issue in H2DUnit above. The D2H path writes to host-registered memory, and the volatile qualifier ensured the compiler wouldn't optimize away or reorder these writes. The HIP fallback relies on stream synchronization for correctness, which is valid, but the comment should clarify that volatile semantics are intentionally replaced by stream sync barriers.
There was a problem hiding this comment.
I kept the plain copy and documented why in 53f8b25 rather than reintroducing volatile, because on AMD it doesn't do what st.volatile does on NVIDIA here. Measured on gfx90a (ROCm 7.2): a volatile access compiles to flat addressing plus the glc bit on loads (an L1 bypass at GPU-L2 scope) and no cache bit on stores -- i.e. GPU-L2 coherence, not host/system visibility. Host visibility for these H2D/D2H copies instead comes from the transfer buffers being fine-grained-coherent host registrations (hipHostRegister default) plus the hipStreamSynchronize after each copy (GPU caches flush at kernel completion). So the plain copy is visible to the host; volatile would be neither necessary nor sufficient. System-scope ordering, if it were ever needed, would be __threadfence_system, not the qualifier.
| #endif | ||
| #include <cuda_runtime.h> | ||
| #include <torch/script.h> | ||
|
|
There was a problem hiding this comment.
💡 Suggestion: The #include <cuda_runtime.h> after the conditional block relies on the hip_compat shim being on the include path for ROCm builds. This works correctly when the CMake configuration is proper, but could cause confusing build failures if the shim directory is not added to include paths. Consider adding a comment here noting this dependency, or adding an #ifdef USE_ROCM guard with an #error directive if the shim is missing (e.g., checking for UNIFIEDCACHE_HIP_COMPAT_CUDA_RUNTIME_H define).
There was a problem hiding this comment.
The rocm CMake arm puts the hip_compat shim dir on the include path (that's where cuda_runtime.h resolves for ROCm builds), so a misconfiguration fails loudly at compile time rather than silently. I'd rather not add an #error probe keyed on a shim-specific define, since it couples the source to the build layout. Leaving as-is.
| static constexpr hipError_t cudaSuccess = hipSuccess; | ||
| static constexpr hipMemcpyKind cudaMemcpyHostToDevice = hipMemcpyHostToDevice; | ||
| static constexpr hipMemcpyKind cudaMemcpyDeviceToHost = hipMemcpyDeviceToHost; | ||
| static constexpr unsigned int cudaStreamNonBlocking = hipStreamNonBlocking; |
There was a problem hiding this comment.
💡 Suggestion: Using #define macros for function aliases prevents proper function overload resolution and could cause issues if CUDA APIs add new signatures in future versions. For this limited set of APIs, the macro approach is acceptable. However, consider adding inline wrapper functions instead of macros for type safety and better debugging support. Example: inline cudaError_t cudaMalloc(void** ptr, size_t size) { return hipMalloc(ptr, size); } would preserve the cudaError_t return type explicitly.
There was a problem hiding this comment.
The #define approach is intentional: it matches the cuda_to_hip compatibility-header convention used across these ports, and it also has to cover the type aliases (cudaStream_t, cudaError_t, ...) that inline wrappers can't. The aliased surface here is small and unambiguous (Malloc/Free/Memcpy/Stream/Event), so the macros don't create overload-resolution hazards in practice. Leaving as-is.
| @@ -0,0 +1,30 @@ | |||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") | |||
| set(CMAKE_HIP_ARCHITECTURES "gfx90a") | |||
There was a problem hiding this comment.
💡 Suggestion: The architecture defaults to gfx90a (MI250X/MI210) which is appropriate for datacenter AMD GPUs, but may cause confusion for users with consumer Radeon GPUs (gfx1100, gfx1201, etc.). Consider adding a comment or CMake warning message when defaulting, suggesting users set CMAKE_HIP_ARCHITECTURES explicitly. The PR description mentions gfx1100 validation but the default doesn't reflect this broader support.
There was a problem hiding this comment.
Fixed in b697824. These arms pinned gfx90a before enable_language(HIP), which preempted CMake's host-GPU auto-detection (rocm_agent_enumerator) -- a non-gfx90a AMD user who didn't pass -DCMAKE_HIP_ARCHITECTURES would have silently built gfx90a binaries that fail to load on their card. Now enable_language(HIP) auto-detects the host arch (deduped), with gfx90a used only as a fallback when nothing is detected (e.g. a CPU-only build host); -DCMAKE_HIP_ARCHITECTURES still overrides.
There was a problem hiding this comment.
Refined in 57b2443: I dropped the explicit gfx90a fallback. enable_language(HIP) already errors clearly when no GPU is detected and no arch is set, so a no-GPU build host now sets -DCMAKE_HIP_ARCHITECTURES explicitly rather than getting a silent gfx90a default.
| string(REPLACE "-s " "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") | ||
| string(REGEX REPLACE "(^| )-s$" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") | ||
|
|
||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") |
There was a problem hiding this comment.
💡 Suggestion: The string(REPLACE "-s " "" ...) and string(REGEX REPLACE ...) pattern for removing the -s flag is somewhat fragile. It assumes -s appears with a trailing space or at the end of the string. If -s appears without a space (e.g., -s -O2), the first REPLACE handles it, but other edge cases might slip through. Consider using a more robust approach like string(REGEX REPLACE "-s( |$)" "" ...) to handle all positions uniformly.
There was a problem hiding this comment.
The two statements already cover both positions -- string(REPLACE "-s " ...) handles -s anywhere with a trailing space, and string(REGEX REPLACE "(^| )-s$" ...) handles it at end-of-string -- and both leave -shared/-static untouched. So it's robust for the flag strings in play; leaving as-is.
| # element (kv*block_size + offset)*num_chunk + chunk. Index a flat | ||
| # view of the key tensor the same way so the reference matches the | ||
| # kernel's layout regardless of the host tensor's nominal shape. | ||
| block = key[phys].reshape(-1) |
There was a problem hiding this comment.
💡 Suggestion: The reference implementation uses a Python loop to compute popcount32, which is correct but slow. For test correctness verification this is acceptable. However, consider using torch.bincount or numpy's built-in popcount for better performance if this test needs to run on larger tensors. The current implementation is fine for the test sizes used (hd=576, block_size=64).
There was a problem hiding this comment.
This is the reference oracle for the test, so I've kept it as the simplest obviously-correct implementation rather than optimizing it; as you note, it's fine at the test sizes. Leaving as-is.
| @@ -0,0 +1,18 @@ | |||
| if(NOT DEFINED CMAKE_HIP_ARCHITECTURES OR CMAKE_HIP_ARCHITECTURES STREQUAL "") | |||
| set(CMAKE_HIP_ARCHITECTURES "gfx90a") | |||
There was a problem hiding this comment.
💡 Suggestion: Same as the trans/rocm CMakeLists.txt - the gfx90a default should be documented or warned about. Users building on consumer AMD GPUs will need to override this.
There was a problem hiding this comment.
Fixed in b697824. These arms pinned gfx90a before enable_language(HIP), which preempted CMake's host-GPU auto-detection (rocm_agent_enumerator) -- a non-gfx90a AMD user who didn't pass -DCMAKE_HIP_ARCHITECTURES would have silently built gfx90a binaries that fail to load on their card. Now enable_language(HIP) auto-detects the host arch (deduped), with gfx90a used only as a fallback when nothing is detected (e.g. a CPU-only build host); -DCMAKE_HIP_ARCHITECTURES still overrides.
There was a problem hiding this comment.
Refined in 57b2443: I dropped the explicit gfx90a fallback. enable_language(HIP) already errors clearly when no GPU is detected and no arch is set, so a no-GPU build host now sets -DCMAKE_HIP_ARCHITECTURES explicitly rather than getting a silent gfx90a default.
…e volatile drop Two review follow-ups, both non-functional: - setup.py: the "PLATFORM not set" warning listed cuda/ascend/ascend-a3/musa/maca but not rocm, even though the build supports PLATFORM=rocm. Add rocm. - The HIP copy paths in cuda_sm_kernel.cu and cuda_device.cu drop the `volatile` qualifier the CUDA path carries (st.volatile.global). Document why that is correct on AMD rather than reintroducing it: host visibility for these H2D/D2H copies comes from the fine-grained-coherent host registration plus the per-transfer hipStreamSynchronize (GPU caches flush at kernel completion), not from the qualifier. On AMD `volatile` only forces an L1 bypass (glc, GPU-L2 scope) -- neither necessary nor sufficient for host visibility; system-scope ordering, if it were ever needed, is __threadfence_system. Comments and a warning string only; no generated code changes. Authored with assistance from Claude.
The three ROCm CMake arms set CMAKE_HIP_ARCHITECTURES to gfx90a before enable_language(HIP) whenever it was unset, which preempted CMake's own host GPU auto-detection (via rocm_agent_enumerator). A user on a non-gfx90a AMD GPU (e.g. gfx1100) who did not pass -DCMAKE_HIP_ARCHITECTURES would silently build gfx90a code objects that fail to load on their card. Let enable_language(HIP) auto-detect the host architecture, dedup the result (it can list one entry per agent), and fall back to gfx90a only when nothing is detected (e.g. a CPU-only build host). Passing -DCMAKE_HIP_ARCHITECTURES still overrides, so explicit-arch builds are unchanged. Authored with assistance from Claude. Test Plan: configured the ROCm store build on gfx90a without -DCMAKE_HIP_ARCHITECTURES; auto-detection resolves to gfx90a and the kernels compile with --offload-arch=gfx90a (single, deduped). Explicit -DCMAKE_HIP_ARCHITECTURES=gfx90a is byte-identical to before.
Follow-up to b697824. enable_language(HIP) already honors an explicit -DCMAKE_HIP_ARCHITECTURES, otherwise auto-detects the host GPU(s) via rocm_agent_enumerator, and errors if none is found. So the explicit gfx90a-on-no-GPU fallback was dead code (enable_language has already errored by then), and deduping its detection output is unnecessary. Drop both; the no-GPU build host now gets CMake's clear "set the architecture" error instead of a silently-wrong gfx90a default. Authored with assistance from Claude.
This adds a new RUNTIME_ENVIRONMENT=rocm device backend (PLATFORM=rocm in
setup.py) alongside the existing cuda/ascend/maca/musa/simu backends, so the
KV block-transfer kernels, the H2D/D2H store path, and the sparse Hamming
scoring kernel build and run on AMD GPUs via HIP. The backend is purely
additive: the existing backends are not modified. We have made every effort
to leave the NVIDIA build unchanged -- every source change to a shared file is
behind a
__CUDA_ARCH__orUSE_ROCMguard that the CUDA build does notcompile, and the compat shim is only placed on the include path for the rocm
build, never the cuda one. Select it with
export PLATFORM=rocm(or-DRUNTIME_ENVIRONMENT=rocmwhen invoking CMake directly).Review order: start with ucm/shared/vendor/hip_compat/ (the compat shim), then
the three new rocm/CMakeLists.txt arms (trans, store, sparse ham_dist), then
the two guarded kernel sources, then the docs and Windows host-build guards.
The compat shim resolves the project's
#include <cuda_runtime.h>/<cuda.h>to <hip/hip_runtime.h> and aliases the small cuda* runtime surface in use
(Malloc/Free/Memcpy[Async]/HostMalloc/HostRegister/Stream*/Event*) to hip*.
Each rocm CMake arm calls enable_language(HIP)/find_package(hip), reuses the
existing cuda .cc/.cu sources marked LANGUAGE HIP, and reads
CMAKE_HIP_ARCHITECTURES (defaulting to gfx90a only when unset) so other AMD
targets need no source edit.
Root cause of the one non-mechanical change: the two grid-stride copy kernels
used inline PTX (ld.global.cs / st.volatile.global vectorized loads/stores)
that does not exist on AMD. The PTX is now guarded by
#if defined(__CUDA_ARCH__)with a HIP branch doing a plain vectorized uint4load/store (32-byte and 16-byte units). ROCm 7.2.1 does not provide the
__ldcs/__stcg/__stcs cache-streaming builtins, and those PTX qualifiers are
cache-policy hints rather than visibility semantics for this memcpy (each
thread writes a disjoint unit and the only consumer is the host after a stream
sync), so the plain copy is functionally equivalent. FlashInfer's cp_async.cuh
already selects its portable non-PTX fallback under hipcc, so it needed no
change.
The sparse Hamming module links libtorch. operator.h now includes
<ATen/hip/HIPContext.h> under USE_ROCM (the cuda-spelled context header pulls
in NVIDIA-only cuda_runtime_api.h/cusparse.h, while the hipified header exposes
the same c10::cuda::getCurrentCUDAStream backed by HIP). It builds at C++20
(torch 2.x headers use requires-clauses), without -ffast-math (the kernel uses
INFINITY as a masking sentinel that finite-math would drop), and with
pybind11_add_module(NO_EXTRAS) to avoid pybind's default LTO+strip dropping the
module init symbol under -fvisibility=hidden.
A set of WIN32-guarded host-build fixes let the backend also compile with the
clang-cl toolchain on Windows ROCm (Linux-only compiler/linker flags guarded,
three header-only infra sub-libraries changed from OBJECT to INTERFACE so the
linker language is determinable under Ninja+clang-cl, a getpid shim, metrics
symbol export, and excluding a POSIX-only thread test). The GPU device code is
unchanged by those guards.
Docs: the supported-platform matrix gains a ROCm/AMD row, and the vLLM and
SGLang quickstarts document
PLATFORM=rocmbesidePLATFORM=cuda.This work was authored with the assistance of Claude, an AI assistant.
Test Plan:
Built and validated on real GPUs: Linux gfx90a (MI250X) and gfx1100 (Radeon
Pro W7800), and Windows gfx1201 (RX 9070 XT), gfx1101 (Radeon PRO V710), and
gfx1151 (Radeon 8060S), all on ROCm. All GPU tests run serially with one GPU
visible (HIP_VISIBLE_DEVICES=0).
Store/trans C++ surface plus unit tests:
79/80 pass on Linux. The copy-kernel correctness gates all pass:
UCTransUnitTest.{CopyDataWithCE,CopyDataWithSM,CopyDataBatchWithSM} (byte-exact
host<->device round-trip) and the UCPosixTrans*/UCCacheTransBuffer cases
(store H2D/D2H batch copy with readback). The one failure,
UCMetricsUT.ConcurrentUpdateAndCollect, is a pre-existing CPU-only
multi-threaded metrics counter test (untouched ucm/shared/metrics), unrelated
to the GPU backend.
Hamming kernel (against a ROCm PyTorch):
The new test computes an independent CPU popcount reference for the paged
block-mode score: mla (no kv reduction) matches within fp16 rounding, gqa
(min over kv heads) matches exactly, two-run output is bit-identical.
The NVIDIA path was checked to be unaffected: with PLATFORM=cuda the guards
select the original inline-PTX branch (compile-checked with nvcc 12.8 at
sm_80; the emitted PTX still contains the ld.global.cs / st.volatile.global
streaming ops), so the CUDA build compiles the same device code as before.