Skip to content

Commit 0a4756a

Browse files
committed
voxtral_tts: highlight streaming RTF 0.31x (3x real-time) in README intro and new Streaming section
1 parent 37e3972 commit 0a4756a

1 file changed

Lines changed: 35 additions & 1 deletion

File tree

examples/models/voxtral_tts/README.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Self-contained ExecuTorch implementation of
44
[Voxtral-4B-TTS-2603](https://huggingface.co/mistralai/Voxtral-4B-TTS-2603),
55
a ~4B parameter text-to-speech model that produces 24 kHz mono audio from
66
text. Weights are loaded directly from the HuggingFace safetensors
7-
checkpoint. Supports CPU (portable + XNNPACK) and CUDA backends.
7+
checkpoint. Supports CPU (portable + XNNPACK) and CUDA backends. With `--streaming`, the CUDA 4w export runs at **RTF 0.31x on RTX 5080 — 3× faster than real-time** with 2.6 s time-to-first-audio.
88

99
## Overview
1010

@@ -87,6 +87,40 @@ Validated on A100, `seed=42`, `"Hello, how are you today?"`:
8787
| `--backend cuda` | 15.8 GB | 11.5 s | 178 s | 51x | FP32 weights, codec on portable CPU |
8888
| **`--backend cuda --qlinear 4w`** | **3.4 GB** | **2.1 s** | **3.7 s** | **0.88x**| int4 weights, codec on CUDA |
8989

90+
91+
### Streaming
92+
93+
`--streaming` emits codec chunks as they are decoded rather than batching the
94+
full audio at the end. The first chunk arrives in ~0.4 s of audio (short
95+
prefill delay), then 2 s chunks follow continuously. This decouples
96+
time-to-first-audio from total synthesis length and enables live piped playback.
97+
98+
Measured on RTX 5080 (sm_120, warm Triton autotune cache):
99+
100+
| Prompt | Audio | Wall clock | **RTF** | Time-to-first |
101+
|---|---|---|---|---|
102+
| 24 tokens | 10.3 s | 3.85 s | **0.31x** ⚡⚡ (~3.2× real-time) | ~2.6 s |
103+
104+
Live playback (pipe raw f32le PCM to `ffplay` or `aplay`):
105+
106+
```bash
107+
unset CPATH
108+
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
109+
110+
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
111+
--model voxtral_tts_exports_cuda_4w/model.pte \
112+
--data_path voxtral_tts_exports_cuda_4w/aoti_cuda_blob.ptd \
113+
--codec voxtral_tts_exports_cuda_4w/codec_decoder.pte \
114+
--codec_data_path voxtral_tts_exports_cuda_4w/codec_aoti_cuda_blob.ptd \
115+
--tokenizer ~/models/Voxtral-4B-TTS-2603/tekken.json \
116+
--voice ~/models/Voxtral-4B-TTS-2603/voice_embedding/neutral_female.pt \
117+
--text "Hello, how are you today?" \
118+
--streaming --speaker \
119+
| ffplay -f f32le -ar 24000 -ac 1 -nodisp -autoexit -
120+
```
121+
122+
Or `aplay`: replace `| ffplay ...` with `| aplay -f FLOAT_LE -r 24000 -c 1`.
123+
90124
### XNNPACK quantization configs
91125

92126
| Config | Scope | model.pte | RTF (long prompt) |

0 commit comments

Comments
 (0)