[Proposal] Universal wheel: runtime CUDA/ROCm detection to eliminate separate builds

# Problem

Today fastsafetensors ships two separate builds — one for CUDA, one for ROCm. This means:

- ROCm users need a separate index URL or a git reference to get the right wheel
- Downstream projects like vLLM have to special-case fastsafetensors in their ROCm packaging
- ROCm ends up as a second-class citizen requiring extra install steps that CUDA users never see

There is no fundamental reason this has to be the case.

# Observation

The C++ extension already loads the GPU runtime entirely at runtime via `dlopen()` — nothing is linked at compile time. The only reason two separate builds exist today is that the symbol names passed to `dlsym()` differ between CUDA (`"cudaMemcpy"`) and ROCm (`"hipMemcpy"`), and those strings are currently baked in at compile time.

# Proposal

Move the CUDA/ROCm selection from compile time to runtime:

1. `load_library_functions()` tries `dlopen("libcudart.so")` first, then falls back to `dlopen("libamdhip64.so")`
2. Both sets of symbol name strings are compiled into the binary
3. At runtime, whichever library loads successfully determines which symbol names are used

The result is a **single universal wheel** that works on both CUDA and ROCm systems with no user configuration. One PyPI entry, no extra index URL, no special-casing in downstream projects.

# Relationship to PR #67

PR #67 lays the groundwork by moving symbol names into `cuda_compat.h` as `GPU_SYM_*` macros. The runtime detection idea builds naturally on top of that — instead of selecting CUDA or ROCm symbol names at compile time via `#ifdef`, `load_library_functions()` selects them at runtime based on which library is present.

# Impact

- Single wheel on PyPI works for all users
- vLLM and other downstream projects drop ROCm-specific fastsafetensors packaging entirely
- No behavior change for existing CUDA or ROCm users

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Universal wheel: runtime CUDA/ROCm detection to eliminate separate builds #68

Problem

Observation

Proposal

Relationship to PR #67

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal] Universal wheel: runtime CUDA/ROCm detection to eliminate separate builds #68

Description

Problem

Observation

Proposal

Relationship to PR #67

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions