Skip to content

Commit f78c02e

Browse files
committed
voxtral_tts: README — note CUDA 12.9, WSL2 libcuda gotcha, multi-arch HF artifacts
1 parent edaba97 commit f78c02e

1 file changed

Lines changed: 22 additions & 2 deletions

File tree

examples/models/voxtral_tts/README.md

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,8 @@ The model has three components:
3131
huggingface-cli download mistralai/Voxtral-4B-TTS-2603 \
3232
--local-dir ~/models/Voxtral-4B-TTS-2603
3333
```
34-
- For CUDA: NVIDIA GPU with CUDA 12.8 toolkit (tested on A100 80GB).
34+
- For CUDA: NVIDIA GPU with CUDA 12.8 or 12.9 toolkit (tested on A100 80GB
35+
/ sm_80 and RTX 5080 / sm_120).
3536
Note: CUDA 13 is not supported (CUB 3.0 incompatibility in
3637
`backends/cuda/runtime/shims/sort.cu`).
3738

@@ -186,6 +187,10 @@ directory).
186187
- **`aoti_cuda_backend` target not found at link time**: the parent
187188
ExecuTorch was built without CUDA. Use `make voxtral_tts-cuda` (which
188189
builds with `EXECUTORCH_BUILD_CUDA=ON`) instead of running cmake by hand.
190+
- **`cannot find -lcuda` during `pip install -e .` or export (WSL2)**: the
191+
CUDA toolkit doesn't ship `libcuda.so` — on WSL2 the driver lib lives at
192+
`/usr/lib/wsl/lib/`. Prepend it (or `/usr/local/cuda/lib64/stubs`) to
193+
`LIBRARY_PATH` before invoking pip / the export script.
189194
- **First call takes ~30–50 s**: Triton autotunes the LM matmul kernels on
190195
first run, then caches per-process. The runner's `warmup()` amortizes
191196
this so the first user-visible synth pays the cost once.
@@ -197,5 +202,20 @@ directory).
197202
## Pre-exported artifacts
198203

199204
For users who want to skip the export step, ready-to-run CUDA artifacts
200-
are available on the HuggingFace hub:
205+
are available on the HuggingFace hub at
201206
[`younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA`](https://huggingface.co/younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA).
207+
208+
ExecuTorch's CUDA backend uses AOTInductor, which bakes pre-compiled
209+
cubins for the export-time GPU's compute capability into `*.ptd`. Cubins
210+
are not compatible across architectures, so the repo ships per-arch
211+
subfolders:
212+
213+
| Folder | Compute capability | Example GPUs |
214+
|---|---|---|
215+
| `sm80/` | `sm_80` (Ampere) | A100, A30 |
216+
| `sm120/` | `sm_120` (Blackwell) | RTX 5080, RTX 5090 |
217+
218+
Find your GPU's arch with `nvidia-smi --query-gpu=compute_cap --format=csv`,
219+
then `hf download ... --include 'sm80/*'` (or `sm120`). If your arch isn't
220+
shipped, re-export on the target GPU with the command above — the AOTI
221+
compile step writes cubins for the local arch.

0 commit comments

Comments
 (0)