Problem
Today fastsafetensors ships two separate builds — one for CUDA, one for ROCm. This means:
- ROCm users need a separate index URL or a git reference to get the right wheel
- Downstream projects like vLLM have to special-case fastsafetensors in their ROCm packaging
- ROCm ends up as a second-class citizen requiring extra install steps that CUDA users never see
There is no fundamental reason this has to be the case.
Observation
The C++ extension already loads the GPU runtime entirely at runtime via dlopen() — nothing is linked at compile time. The only reason two separate builds exist today is that the symbol names passed to dlsym() differ between CUDA ("cudaMemcpy") and ROCm ("hipMemcpy"), and those strings are currently baked in at compile time.
Proposal
Move the CUDA/ROCm selection from compile time to runtime:
load_library_functions() tries dlopen("libcudart.so") first, then falls back to dlopen("libamdhip64.so")
- Both sets of symbol name strings are compiled into the binary
- At runtime, whichever library loads successfully determines which symbol names are used
The result is a single universal wheel that works on both CUDA and ROCm systems with no user configuration. One PyPI entry, no extra index URL, no special-casing in downstream projects.
Relationship to PR #67
PR #67 lays the groundwork by moving symbol names into cuda_compat.h as GPU_SYM_* macros. The runtime detection idea builds naturally on top of that — instead of selecting CUDA or ROCm symbol names at compile time via #ifdef, load_library_functions() selects them at runtime based on which library is present.
Impact
- Single wheel on PyPI works for all users
- vLLM and other downstream projects drop ROCm-specific fastsafetensors packaging entirely
- No behavior change for existing CUDA or ROCm users
Problem
Today fastsafetensors ships two separate builds — one for CUDA, one for ROCm. This means:
There is no fundamental reason this has to be the case.
Observation
The C++ extension already loads the GPU runtime entirely at runtime via
dlopen()— nothing is linked at compile time. The only reason two separate builds exist today is that the symbol names passed todlsym()differ between CUDA ("cudaMemcpy") and ROCm ("hipMemcpy"), and those strings are currently baked in at compile time.Proposal
Move the CUDA/ROCm selection from compile time to runtime:
load_library_functions()triesdlopen("libcudart.so")first, then falls back todlopen("libamdhip64.so")The result is a single universal wheel that works on both CUDA and ROCm systems with no user configuration. One PyPI entry, no extra index URL, no special-casing in downstream projects.
Relationship to PR #67
PR #67 lays the groundwork by moving symbol names into
cuda_compat.hasGPU_SYM_*macros. The runtime detection idea builds naturally on top of that — instead of selecting CUDA or ROCm symbol names at compile time via#ifdef,load_library_functions()selects them at runtime based on which library is present.Impact