@@ -31,7 +31,8 @@ The model has three components:
3131 huggingface-cli download mistralai/Voxtral-4B-TTS-2603 \
3232 --local-dir ~ /models/Voxtral-4B-TTS-2603
3333 ```
34- - For CUDA: NVIDIA GPU with CUDA 12.8 toolkit (tested on A100 80GB).
34+ - For CUDA: NVIDIA GPU with CUDA 12.8 or 12.9 toolkit (tested on A100 80GB
35+ / sm_80 and RTX 5080 / sm_120).
3536 Note: CUDA 13 is not supported (CUB 3.0 incompatibility in
3637 ` backends/cuda/runtime/shims/sort.cu ` ).
3738
@@ -186,6 +187,10 @@ directory).
186187- ** ` aoti_cuda_backend ` target not found at link time** : the parent
187188 ExecuTorch was built without CUDA. Use ` make voxtral_tts-cuda ` (which
188189 builds with ` EXECUTORCH_BUILD_CUDA=ON ` ) instead of running cmake by hand.
190+ - ** ` cannot find -lcuda ` during ` pip install -e . ` or export (WSL2)** : the
191+ CUDA toolkit doesn't ship ` libcuda.so ` — on WSL2 the driver lib lives at
192+ ` /usr/lib/wsl/lib/ ` . Prepend it (or ` /usr/local/cuda/lib64/stubs ` ) to
193+ ` LIBRARY_PATH ` before invoking pip / the export script.
189194- ** First call takes ~ 30–50 s** : Triton autotunes the LM matmul kernels on
190195 first run, then caches per-process. The runner's ` warmup() ` amortizes
191196 this so the first user-visible synth pays the cost once.
@@ -197,5 +202,20 @@ directory).
197202## Pre-exported artifacts
198203
199204For users who want to skip the export step, ready-to-run CUDA artifacts
200- are available on the HuggingFace hub:
205+ are available on the HuggingFace hub at
201206[ ` younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA ` ] ( https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA ) .
207+
208+ ExecuTorch's CUDA backend uses AOTInductor, which bakes pre-compiled
209+ cubins for the export-time GPU's compute capability into ` *.ptd ` . Cubins
210+ are not compatible across architectures, so the repo ships per-arch
211+ subfolders:
212+
213+ | Folder | Compute capability | Example GPUs |
214+ | ---| ---| ---|
215+ | ` sm80/ ` | ` sm_80 ` (Ampere) | A100, A30 |
216+ | ` sm120/ ` | ` sm_120 ` (Blackwell) | RTX 5080, RTX 5090 |
217+
218+ Find your GPU's arch with ` nvidia-smi --query-gpu=compute_cap --format=csv ` ,
219+ then ` hf download ... --include 'sm80/*' ` (or ` sm120 ` ). If your arch isn't
220+ shipped, re-export on the target GPU with the command above — the AOTI
221+ compile step writes cubins for the local arch.
0 commit comments