@@ -14,9 +14,10 @@ The pipeline has two stages: **export** (Python, once) and **inference**
1414conversion. At inference time, the C++ runner loads both ` .pte ` files
1515and the Tekken tokenizer, then transcribes audio to text.
1616
17- Two modes are supported: ** offline** (encode full audio, then decode)
18- and ** streaming** (process 80ms chunks in real time, including live
19- microphone input).
17+ Two modes are supported: ** streaming** (process 80ms chunks in real time,
18+ including live microphone input) and ** offline** (encode full audio, then
19+ decode). The examples below use streaming mode. Omit ` --streaming ` from
20+ export and run commands for offline mode.
2021
2122## Demo: streaming on Metal backend with microphone input
2223
@@ -37,24 +38,22 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a
3738## Preprocessor
3839
3940Export a preprocessor ` .pte ` to convert raw audio into the format the
40- model expects. ` --max_audio_len 300 ` supports audio up to 5 minutes
41- (300 seconds):
41+ model expects:
4242
4343``` bash
4444python -m executorch.extension.audio.mel_spectrogram \
4545 --feature_size 128 \
46- --max_audio_len 300 \
46+ --streaming \
4747 --output_file ./voxtral_rt_exports/preprocessor.pte
4848```
4949
50- For streaming, use a separate preprocessor with ` --streaming ` (no audio
51- length limit):
50+ For offline mode:
5251
5352``` bash
5453python -m executorch.extension.audio.mel_spectrogram \
5554 --feature_size 128 \
56- --streaming \
57- --output_file ./voxtral_streaming_exports /preprocessor.pte
55+ --max_audio_len 300 \
56+ --output_file ./voxtral_rt_exports /preprocessor.pte
5857```
5958
6059## Export
@@ -65,19 +64,7 @@ and token embedding.
6564> [ !TIP]
6665> Mistral has already published pre-exported ` .pte ` files for select backends, including macOS Metal, on their [ HuggingFace Hub] ( https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-Executorch ) .
6766
68-
69- ``` bash
70- python export_voxtral_rt.py \
71- --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
72- --backend xnnpack \
73- --output-dir ./voxtral_rt_exports \
74- --qlinear-encoder 8da4w \
75- --qlinear 8da4w \
76- --qembedding 8w
77- ```
78-
79- For streaming, add ` --streaming ` to export the encoder for incremental
80- processing (80ms audio chunks):
67+ ### XNNPACK (default)
8168
8269``` bash
8370python export_voxtral_rt.py \
@@ -90,65 +77,8 @@ python export_voxtral_rt.py \
9077 --qembedding 8w
9178```
9279
93- ### Backend support
94-
95- | Backend | Offline | Streaming | Quantization |
96- | ---------| ---------| -----------| --------------|
97- | ` xnnpack ` | ✓ | ✓ | ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` |
98- | ` metal ` | ✓ | ✓ | none (fp32) or ` fpa4w ` (Metal-specific 4-bit) |
99- | ` cuda ` | ✓ | ✓ | ` 4w ` , ` 8w ` |
100- | ` cuda-windows ` | ✓ | ✓ | ` 4w ` , ` 8w ` |
101-
102- Metal backend provides Apple GPU acceleration. CUDA backend provides NVIDIA GPU
103- acceleration via AOTInductor.
104-
105- #### CUDA export examples
106-
107- Offline with int4 quantization:
108-
109- ``` bash
110- python export_voxtral_rt.py \
111- --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
112- --backend cuda \
113- --dtype bf16 \
114- --output-dir ./voxtral_rt_exports \
115- --qlinear-encoder 4w \
116- --qlinear-encoder-packing-format tile_packed_to_4d \
117- --qlinear 4w \
118- --qlinear-packing-format tile_packed_to_4d \
119- --qembedding 8w
120- ```
121-
122- Streaming with int4 quantization:
123-
124- ``` bash
125- python export_voxtral_rt.py \
126- --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
127- --backend cuda \
128- --dtype bf16 \
129- --streaming \
130- --output-dir ./voxtral_rt_exports \
131- --qlinear-encoder 4w \
132- --qlinear-encoder-packing-format tile_packed_to_4d \
133- --qlinear 4w \
134- --qlinear-packing-format tile_packed_to_4d \
135- --qembedding 8w
136- ```
137-
138- #### Metal export examples
139-
140- Offline:
141-
142- ``` bash
143- python export_voxtral_rt.py \
144- --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
145- --backend metal \
146- --output-dir ./voxtral_rt_exports \
147- --qlinear-encoder fpa4w \
148- --qlinear fpa4w
149- ```
150-
151- Streaming:
80+ <details >
81+ <summary ><strong >Metal</strong ></summary >
15282
15383``` bash
15484python export_voxtral_rt.py \
@@ -160,38 +90,27 @@ python export_voxtral_rt.py \
16090 --qlinear fpa4w
16191```
16292
163- ** Note: ** Metal 4-bit quantization requires torchao built with experimental MPS (Metal) ops.
93+ Metal 4-bit quantization ( ` fpa4w ` ) requires torchao built with experimental MPS ops:
16494
165- You can install torchao with Metal support from the ` ao ` repo (in third-party/ao/)
16695``` bash
96+ # From the ao repo (third-party/ao/)
16797USE_CPP=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 pip install . --no-build-isolation
168- ```
16998
170- Alternatively, you can build torchao with Metal support while installing ExecuTorch from source
171- ``` bash
99+ # Or while installing ExecuTorch from source
172100EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh
173101```
174102
175- ### CUDA-Windows Export
103+ </ details >
176104
177- Before running ` cuda-windows ` export, make sure these requirements are set up:
178- - ` x86_64-w64-mingw32-g++ ` is installed and on ` PATH ` (mingw-w64 cross-compiler).
179- - ` WINDOWS_CUDA_HOME ` points to the extracted Windows CUDA package directory.
180-
181- Example setup on Ubuntu (refer to [ Parakeet README] ( ../parakeet/README.md ) for detailed extraction steps):
182-
183- ``` bash
184- # Ensure the WINDOWS_CUDA_HOME environment variable is set
185- export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart
186- ```
187-
188- Export the model for Windows CUDA (example with int4 quantization):
105+ <details >
106+ <summary ><strong >CUDA</strong ></summary >
189107
190108``` bash
191109python export_voxtral_rt.py \
192110 --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
193- --backend cuda-windows \
111+ --backend cuda \
194112 --dtype bf16 \
113+ --streaming \
195114 --output-dir ./voxtral_rt_exports \
196115 --qlinear-encoder 4w \
197116 --qlinear-encoder-packing-format tile_packed_to_4d \
@@ -200,9 +119,18 @@ python export_voxtral_rt.py \
200119 --qembedding 8w
201120```
202121
203- For streaming, add ` --streaming ` :
122+ </details >
123+
124+ <details >
125+ <summary ><strong >CUDA-Windows</strong ></summary >
126+
127+ Requires ` x86_64-w64-mingw32-g++ ` on ` PATH ` (mingw-w64 cross-compiler) and
128+ ` WINDOWS_CUDA_HOME ` pointing to the extracted Windows CUDA package directory.
129+ See [ Parakeet README] ( ../parakeet/README.md#cuda-windows-export ) for detailed extraction steps.
204130
205131``` bash
132+ export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart
133+
206134python export_voxtral_rt.py \
207135 --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
208136 --backend cuda-windows \
@@ -216,16 +144,18 @@ python export_voxtral_rt.py \
216144 --qembedding 8w
217145```
218146
219- Both offline and streaming exports generate:
220- - ` model.pte `
221- - ` aoti_cuda_blob.ptd `
147+ </details >
148+
149+ > [ !NOTE]
150+ > Omit ` --streaming ` from any export command above for offline mode.
151+ > CUDA and CUDA-Windows exports also produce an ` aoti_cuda_blob.ptd ` file alongside ` model.pte ` .
222152
223153### Options
224154
225155| Flag | Default | Description |
226156| ------| ---------| -------------|
227157| ` --model-path ` | (required) | Directory with ` params.json ` + ` consolidated.safetensors ` |
228- | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , ` cuda ` , or ` portable ` |
158+ | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , ` cuda ` , ` cuda-windows ` , or ` portable ` |
229159| ` --dtype ` | ` fp32 ` | Model dtype: ` fp32 ` or ` bf16 ` |
230160| ` --output-dir ` | ` ./voxtral_rt_exports ` | Output directory |
231161| ` --max-seq-len ` | ` 4096 ` | KV cache length |
@@ -250,36 +180,21 @@ ExecuTorch must be installed from source first (see
250180[ Prerequisites] ( #prerequisites ) ). The ` make ` targets below handle
251181building core libraries and the runner binary.
252182
253- ### XNNPACK (CPU)
254-
255183``` bash
256- make voxtral_realtime-cpu
184+ make voxtral_realtime-cpu # XNNPACK (CPU)
185+ make voxtral_realtime-metal # Metal (Apple GPU)
186+ make voxtral_realtime-cuda # CUDA (NVIDIA GPU)
257187```
258188
259- This builds ExecuTorch core libraries with XNNPACK, then the runner binary
260- at ` cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner ` .
261-
262- ### CUDA (NVIDIA GPU)
263-
264- ``` bash
265- make voxtral_realtime-cuda
266- ```
267-
268- This builds ExecuTorch with CUDA backend support. The runner binary is at
269- the same path as above. Requires NVIDIA GPU with CUDA toolkit installed.
270-
271- ### Metal (Apple GPU)
272-
273- ``` bash
274- make voxtral_realtime-metal
275- ```
276-
277- This builds ExecuTorch with Metal backend support. The runner binary is at
278- the same path as above. Metal exports can only run on macOS with Apple Silicon.
189+ All targets produce the runner binary at
190+ ` cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner ` .
279191
280192### CUDA-Windows
281193
282- On Windows (PowerShell), use CMake workflow presets directly from the executorch root directory. Note that if you exported the model with 4-bit quantization, you may need to specify your GPU's compute capability (e.g., ` 80;86;89;90;120 ` for Ampere, Lovelace, Hopper, and Blackwell) to avoid "invalid device function" errors at runtime, as the ` int4mm ` kernels require SM 80 or newer.
194+ On Windows (PowerShell), use CMake workflow presets from the executorch root
195+ directory. If you exported with 4-bit quantization, specify your GPU's compute
196+ capability to avoid "invalid device function" errors (the ` int4mm ` kernels
197+ require SM 80+).
283198
284199``` powershell
285200$env:CMAKE_CUDA_ARCHITECTURES="80;86;89;90;120"
@@ -296,43 +211,24 @@ The runner requires:
296211- ` tekken.json ` — tokenizer from the model weights directory
297212- ` preprocessor.pte ` — mel spectrogram preprocessor (see [ Preprocessor] ( #preprocessor ) )
298213- A 16kHz mono WAV audio file (or live audio via ` --mic ` )
299- - For CUDA: ` aoti_cuda_blob.ptd ` — delegate data file (pass via ` --data_path ` )
300214
301- ### Windows (PowerShell)
215+ ### Basic usage
302216
303- ``` powershell
304- .\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
305- --model_path voxtral_rt_exports\model.pte `
306- --tokenizer_path C:\path\to\tekken.json `
307- --preprocessor_path voxtral_rt_exports\preprocessor.pte `
308- --audio_path input.wav
309- ```
310-
311- For CUDA, include the ` .ptd ` data file:
312-
313- ``` powershell
314- .\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
315- --model_path voxtral_rt_exports\model.pte `
316- --data_path voxtral_rt_exports\aoti_cuda_blob.ptd `
317- --tokenizer_path C:\path\to\tekken.json `
318- --preprocessor_path voxtral_rt_exports\preprocessor.pte `
319- --audio_path input.wav
217+ ``` bash
218+ cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
219+ --model_path voxtral_rt_exports/model.pte \
220+ --tokenizer_path ~ /models/Voxtral-Mini-4B-Realtime-2602/tekken.json \
221+ --preprocessor_path voxtral_rt_exports/preprocessor.pte \
222+ --audio_path input.wav \
223+ --streaming
320224```
321225
322- For streaming, add ` --streaming ` . This requires a model exported with
323- ` --streaming ` . The runner processes audio in 80ms steps, computing mel
324- and running the encoder+decoder incrementally.
226+ Omit ` --streaming ` for offline transcription (requires an offline-exported
227+ model and offline preprocessor).
325228
326- ``` powershell
327- .\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
328- --model_path voxtral_rt_exports\model.pte `
329- --tokenizer_path C:\path\to\tekken.json `
330- --preprocessor_path voxtral_rt_exports\preprocessor.pte `
331- --audio_path input.wav `
332- --streaming
333- ```
229+ For CUDA backends (Linux and Windows), add ` --data_path voxtral_rt_exports/aoti_cuda_blob.ptd ` .
334230
335- For CUDA streaming, include the ` .ptd ` data file:
231+ ** Windows (PowerShell): **
336232
337233``` powershell
338234.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
@@ -344,9 +240,11 @@ For CUDA streaming, include the `.ptd` data file:
344240 --streaming
345241```
346242
347- For live microphone input, use ` --mic ` to read raw 16kHz float32 PCM from
348- stdin. This requires a model exported with ` --streaming ` and a streaming
349- preprocessor. Pipe from any audio capture tool:
243+ ### Live microphone input
244+
245+ Use ` --mic ` to read raw 16kHz float32 PCM from stdin. Requires a
246+ streaming-exported model and streaming preprocessor. Pipe from any audio
247+ capture tool:
350248
351249``` bash
352250# macOS
@@ -360,19 +258,7 @@ ffmpeg -f avfoundation -i ":0" -ar 16000 -ac 1 -f f32le -nostats -loglevel error
360258
361259Ctrl+C stops recording and flushes remaining text.
362260
363- ** Windows (PowerShell):**
364-
365- ``` powershell
366- .\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
367- --model_path C:\path\to\voxtral_rt_exports\model.pte `
368- --data_path C:\path\to\voxtral_rt_exports\aoti_cuda_blob.ptd `
369- --tokenizer_path C:\path\to\tekken.json `
370- --preprocessor_path C:\path\to\voxtral_rt_exports\preprocessor.pte `
371- --audio_path C:\path\to\input.wav
372- ```
373-
374- ** CUDA:** Add ` --data_path voxtral_rt_exports/aoti_cuda_blob.ptd ` to all
375- run commands above when using the CUDA backend.
261+ ### Options
376262
377263| Flag | Default | Description |
378264| ------| ---------| -------------|
0 commit comments