Skip to content

Commit 5545395

Browse files
authored
docs(voxtral_realtime): simplify and reorganize README (#18186)
Reduce duplication by showing one export/run example per backend instead of separate offline and streaming blocks. Make streaming the default mode in all examples. Collapse non-default backends (Metal, CUDA, CUDA-Windows) behind details tags in the Export section. Restore the Linux/macOS run example that was accidentally replaced by Windows-only commands in 1b0c10b.
1 parent 1b0c10b commit 5545395

1 file changed

Lines changed: 63 additions & 177 deletions

File tree

examples/models/voxtral_realtime/README.md

Lines changed: 63 additions & 177 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,10 @@ The pipeline has two stages: **export** (Python, once) and **inference**
1414
conversion. At inference time, the C++ runner loads both `.pte` files
1515
and the Tekken tokenizer, then transcribes audio to text.
1616

17-
Two modes are supported: **offline** (encode full audio, then decode)
18-
and **streaming** (process 80ms chunks in real time, including live
19-
microphone input).
17+
Two modes are supported: **streaming** (process 80ms chunks in real time,
18+
including live microphone input) and **offline** (encode full audio, then
19+
decode). The examples below use streaming mode. Omit `--streaming` from
20+
export and run commands for offline mode.
2021

2122
## Demo: streaming on Metal backend with microphone input
2223

@@ -37,24 +38,22 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
3738
## Preprocessor
3839

3940
Export a preprocessor `.pte` to convert raw audio into the format the
40-
model expects. `--max_audio_len 300` supports audio up to 5 minutes
41-
(300 seconds):
41+
model expects:
4242

4343
```bash
4444
python -m executorch.extension.audio.mel_spectrogram \
4545
--feature_size 128 \
46-
--max_audio_len 300 \
46+
--streaming \
4747
--output_file ./voxtral_rt_exports/preprocessor.pte
4848
```
4949

50-
For streaming, use a separate preprocessor with `--streaming` (no audio
51-
length limit):
50+
For offline mode:
5251

5352
```bash
5453
python -m executorch.extension.audio.mel_spectrogram \
5554
--feature_size 128 \
56-
--streaming \
57-
--output_file ./voxtral_streaming_exports/preprocessor.pte
55+
--max_audio_len 300 \
56+
--output_file ./voxtral_rt_exports/preprocessor.pte
5857
```
5958

6059
## Export
@@ -65,19 +64,7 @@ and token embedding.
6564
> [!TIP]
6665
> Mistral has already published pre-exported `.pte` files for select backends, including macOS Metal, on their [HuggingFace Hub](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-Executorch).
6766
68-
69-
```bash
70-
python export_voxtral_rt.py \
71-
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
72-
--backend xnnpack \
73-
--output-dir ./voxtral_rt_exports \
74-
--qlinear-encoder 8da4w \
75-
--qlinear 8da4w \
76-
--qembedding 8w
77-
```
78-
79-
For streaming, add `--streaming` to export the encoder for incremental
80-
processing (80ms audio chunks):
67+
### XNNPACK (default)
8168

8269
```bash
8370
python export_voxtral_rt.py \
@@ -90,65 +77,8 @@ python export_voxtral_rt.py \
9077
--qembedding 8w
9178
```
9279

93-
### Backend support
94-
95-
| Backend | Offline | Streaming | Quantization |
96-
|---------|---------|-----------|--------------|
97-
| `xnnpack` ||| `4w`, `8w`, `8da4w`, `8da8w` |
98-
| `metal` ||| none (fp32) or `fpa4w` (Metal-specific 4-bit) |
99-
| `cuda` ||| `4w`, `8w` |
100-
| `cuda-windows` ||| `4w`, `8w` |
101-
102-
Metal backend provides Apple GPU acceleration. CUDA backend provides NVIDIA GPU
103-
acceleration via AOTInductor.
104-
105-
#### CUDA export examples
106-
107-
Offline with int4 quantization:
108-
109-
```bash
110-
python export_voxtral_rt.py \
111-
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
112-
--backend cuda \
113-
--dtype bf16 \
114-
--output-dir ./voxtral_rt_exports \
115-
--qlinear-encoder 4w \
116-
--qlinear-encoder-packing-format tile_packed_to_4d \
117-
--qlinear 4w \
118-
--qlinear-packing-format tile_packed_to_4d \
119-
--qembedding 8w
120-
```
121-
122-
Streaming with int4 quantization:
123-
124-
```bash
125-
python export_voxtral_rt.py \
126-
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
127-
--backend cuda \
128-
--dtype bf16 \
129-
--streaming \
130-
--output-dir ./voxtral_rt_exports \
131-
--qlinear-encoder 4w \
132-
--qlinear-encoder-packing-format tile_packed_to_4d \
133-
--qlinear 4w \
134-
--qlinear-packing-format tile_packed_to_4d \
135-
--qembedding 8w
136-
```
137-
138-
#### Metal export examples
139-
140-
Offline:
141-
142-
```bash
143-
python export_voxtral_rt.py \
144-
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
145-
--backend metal \
146-
--output-dir ./voxtral_rt_exports \
147-
--qlinear-encoder fpa4w \
148-
--qlinear fpa4w
149-
```
150-
151-
Streaming:
80+
<details>
81+
<summary><strong>Metal</strong></summary>
15282

15383
```bash
15484
python export_voxtral_rt.py \
@@ -160,38 +90,27 @@ python export_voxtral_rt.py \
16090
--qlinear fpa4w
16191
```
16292

163-
**Note:** Metal 4-bit quantization requires torchao built with experimental MPS (Metal) ops.
93+
Metal 4-bit quantization (`fpa4w`) requires torchao built with experimental MPS ops:
16494

165-
You can install torchao with Metal support from the `ao` repo (in third-party/ao/)
16695
```bash
96+
# From the ao repo (third-party/ao/)
16797
USE_CPP=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 pip install . --no-build-isolation
168-
```
16998

170-
Alternatively, you can build torchao with Metal support while installing ExecuTorch from source
171-
```bash
99+
# Or while installing ExecuTorch from source
172100
EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh
173101
```
174102

175-
### CUDA-Windows Export
103+
</details>
176104

177-
Before running `cuda-windows` export, make sure these requirements are set up:
178-
- `x86_64-w64-mingw32-g++` is installed and on `PATH` (mingw-w64 cross-compiler).
179-
- `WINDOWS_CUDA_HOME` points to the extracted Windows CUDA package directory.
180-
181-
Example setup on Ubuntu (refer to [Parakeet README](../parakeet/README.md) for detailed extraction steps):
182-
183-
```bash
184-
# Ensure the WINDOWS_CUDA_HOME environment variable is set
185-
export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart
186-
```
187-
188-
Export the model for Windows CUDA (example with int4 quantization):
105+
<details>
106+
<summary><strong>CUDA</strong></summary>
189107

190108
```bash
191109
python export_voxtral_rt.py \
192110
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
193-
--backend cuda-windows \
111+
--backend cuda \
194112
--dtype bf16 \
113+
--streaming \
195114
--output-dir ./voxtral_rt_exports \
196115
--qlinear-encoder 4w \
197116
--qlinear-encoder-packing-format tile_packed_to_4d \
@@ -200,9 +119,18 @@ python export_voxtral_rt.py \
200119
--qembedding 8w
201120
```
202121

203-
For streaming, add `--streaming`:
122+
</details>
123+
124+
<details>
125+
<summary><strong>CUDA-Windows</strong></summary>
126+
127+
Requires `x86_64-w64-mingw32-g++` on `PATH` (mingw-w64 cross-compiler) and
128+
`WINDOWS_CUDA_HOME` pointing to the extracted Windows CUDA package directory.
129+
See [Parakeet README](../parakeet/README.md#cuda-windows-export) for detailed extraction steps.
204130

205131
```bash
132+
export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart
133+
206134
python export_voxtral_rt.py \
207135
--model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \
208136
--backend cuda-windows \
@@ -216,16 +144,18 @@ python export_voxtral_rt.py \
216144
--qembedding 8w
217145
```
218146

219-
Both offline and streaming exports generate:
220-
- `model.pte`
221-
- `aoti_cuda_blob.ptd`
147+
</details>
148+
149+
> [!NOTE]
150+
> Omit `--streaming` from any export command above for offline mode.
151+
> CUDA and CUDA-Windows exports also produce an `aoti_cuda_blob.ptd` file alongside `model.pte`.
222152
223153
### Options
224154

225155
| Flag | Default | Description |
226156
|------|---------|-------------|
227157
| `--model-path` | (required) | Directory with `params.json` + `consolidated.safetensors` |
228-
| `--backend` | `xnnpack` | `xnnpack`, `metal`, `cuda`, or `portable` |
158+
| `--backend` | `xnnpack` | `xnnpack`, `metal`, `cuda`, `cuda-windows`, or `portable` |
229159
| `--dtype` | `fp32` | Model dtype: `fp32` or `bf16` |
230160
| `--output-dir` | `./voxtral_rt_exports` | Output directory |
231161
| `--max-seq-len` | `4096` | KV cache length |
@@ -250,36 +180,21 @@ ExecuTorch must be installed from source first (see
250180
[Prerequisites](#prerequisites)). The `make` targets below handle
251181
building core libraries and the runner binary.
252182

253-
### XNNPACK (CPU)
254-
255183
```bash
256-
make voxtral_realtime-cpu
184+
make voxtral_realtime-cpu # XNNPACK (CPU)
185+
make voxtral_realtime-metal # Metal (Apple GPU)
186+
make voxtral_realtime-cuda # CUDA (NVIDIA GPU)
257187
```
258188

259-
This builds ExecuTorch core libraries with XNNPACK, then the runner binary
260-
at `cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner`.
261-
262-
### CUDA (NVIDIA GPU)
263-
264-
```bash
265-
make voxtral_realtime-cuda
266-
```
267-
268-
This builds ExecuTorch with CUDA backend support. The runner binary is at
269-
the same path as above. Requires NVIDIA GPU with CUDA toolkit installed.
270-
271-
### Metal (Apple GPU)
272-
273-
```bash
274-
make voxtral_realtime-metal
275-
```
276-
277-
This builds ExecuTorch with Metal backend support. The runner binary is at
278-
the same path as above. Metal exports can only run on macOS with Apple Silicon.
189+
All targets produce the runner binary at
190+
`cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner`.
279191

280192
### CUDA-Windows
281193

282-
On Windows (PowerShell), use CMake workflow presets directly from the executorch root directory. Note that if you exported the model with 4-bit quantization, you may need to specify your GPU's compute capability (e.g., `80;86;89;90;120` for Ampere, Lovelace, Hopper, and Blackwell) to avoid "invalid device function" errors at runtime, as the `int4mm` kernels require SM 80 or newer.
194+
On Windows (PowerShell), use CMake workflow presets from the executorch root
195+
directory. If you exported with 4-bit quantization, specify your GPU's compute
196+
capability to avoid "invalid device function" errors (the `int4mm` kernels
197+
require SM 80+).
283198

284199
```powershell
285200
$env:CMAKE_CUDA_ARCHITECTURES="80;86;89;90;120"
@@ -296,43 +211,24 @@ The runner requires:
296211
- `tekken.json` — tokenizer from the model weights directory
297212
- `preprocessor.pte` — mel spectrogram preprocessor (see [Preprocessor](#preprocessor))
298213
- A 16kHz mono WAV audio file (or live audio via `--mic`)
299-
- For CUDA: `aoti_cuda_blob.ptd` — delegate data file (pass via `--data_path`)
300214

301-
### Windows (PowerShell)
215+
### Basic usage
302216

303-
```powershell
304-
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
305-
--model_path voxtral_rt_exports\model.pte `
306-
--tokenizer_path C:\path\to\tekken.json `
307-
--preprocessor_path voxtral_rt_exports\preprocessor.pte `
308-
--audio_path input.wav
309-
```
310-
311-
For CUDA, include the `.ptd` data file:
312-
313-
```powershell
314-
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
315-
--model_path voxtral_rt_exports\model.pte `
316-
--data_path voxtral_rt_exports\aoti_cuda_blob.ptd `
317-
--tokenizer_path C:\path\to\tekken.json `
318-
--preprocessor_path voxtral_rt_exports\preprocessor.pte `
319-
--audio_path input.wav
217+
```bash
218+
cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
219+
--model_path voxtral_rt_exports/model.pte \
220+
--tokenizer_path ~/models/Voxtral-Mini-4B-Realtime-2602/tekken.json \
221+
--preprocessor_path voxtral_rt_exports/preprocessor.pte \
222+
--audio_path input.wav \
223+
--streaming
320224
```
321225

322-
For streaming, add `--streaming`. This requires a model exported with
323-
`--streaming`. The runner processes audio in 80ms steps, computing mel
324-
and running the encoder+decoder incrementally.
226+
Omit `--streaming` for offline transcription (requires an offline-exported
227+
model and offline preprocessor).
325228

326-
```powershell
327-
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
328-
--model_path voxtral_rt_exports\model.pte `
329-
--tokenizer_path C:\path\to\tekken.json `
330-
--preprocessor_path voxtral_rt_exports\preprocessor.pte `
331-
--audio_path input.wav `
332-
--streaming
333-
```
229+
For CUDA backends (Linux and Windows), add `--data_path voxtral_rt_exports/aoti_cuda_blob.ptd`.
334230

335-
For CUDA streaming, include the `.ptd` data file:
231+
**Windows (PowerShell):**
336232

337233
```powershell
338234
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
@@ -344,9 +240,11 @@ For CUDA streaming, include the `.ptd` data file:
344240
--streaming
345241
```
346242

347-
For live microphone input, use `--mic` to read raw 16kHz float32 PCM from
348-
stdin. This requires a model exported with `--streaming` and a streaming
349-
preprocessor. Pipe from any audio capture tool:
243+
### Live microphone input
244+
245+
Use `--mic` to read raw 16kHz float32 PCM from stdin. Requires a
246+
streaming-exported model and streaming preprocessor. Pipe from any audio
247+
capture tool:
350248

351249
```bash
352250
# macOS
@@ -360,19 +258,7 @@ ffmpeg -f avfoundation -i ":0" -ar 16000 -ac 1 -f f32le -nostats -loglevel error
360258

361259
Ctrl+C stops recording and flushes remaining text.
362260

363-
**Windows (PowerShell):**
364-
365-
```powershell
366-
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
367-
--model_path C:\path\to\voxtral_rt_exports\model.pte `
368-
--data_path C:\path\to\voxtral_rt_exports\aoti_cuda_blob.ptd `
369-
--tokenizer_path C:\path\to\tekken.json `
370-
--preprocessor_path C:\path\to\voxtral_rt_exports\preprocessor.pte `
371-
--audio_path C:\path\to\input.wav
372-
```
373-
374-
**CUDA:** Add `--data_path voxtral_rt_exports/aoti_cuda_blob.ptd` to all
375-
run commands above when using the CUDA backend.
261+
### Options
376262

377263
| Flag | Default | Description |
378264
|------|---------|-------------|

0 commit comments

Comments
 (0)