@@ -88,8 +88,9 @@ python export_voxtral_rt.py \
8888| ---------| ---------| -----------| --------------|
8989| ` xnnpack ` | ✓ | ✓ | ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` |
9090| ` metal ` | ✓ | ✓ | none (fp32) or ` fpa4w ` (Metal-specific 4-bit) |
91+ | ` mlx ` | ✓ | ✓ | ` 4w ` , ` 8w ` |
9192
92- Metal backend provides Apple GPU acceleration.
93+ Metal and MLX backends provide Apple GPU acceleration.
9394
9495#### Metal export examples
9596
@@ -128,12 +129,48 @@ Alternatively, you can build torchao with Metal support while installing ExecuTo
128129EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh
129130```
130131
132+ #### MLX export examples
133+
134+ MLX backend uses the MLX delegate for Apple Silicon GPU acceleration.
135+
136+ Offline:
137+
138+ ``` bash
139+ python export_voxtral_rt.py \
140+ --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
141+ --backend mlx \
142+ --output-dir ./voxtral_rt_exports \
143+ --qlinear-encoder 4w \
144+ --qlinear 4w \
145+ --qembedding 8w \
146+ --qembedding-group-size 128 \
147+ --export-preprocessor
148+ ```
149+
150+ Streaming:
151+
152+ ``` bash
153+ python export_voxtral_rt.py \
154+ --model-path ~ /models/Voxtral-Mini-4B-Realtime-2602 \
155+ --backend mlx \
156+ --streaming \
157+ --output-dir ./voxtral_rt_exports \
158+ --qlinear-encoder 4w \
159+ --qlinear 4w \
160+ --qembedding 8w \
161+ --qembedding-group-size 128 \
162+ --export-preprocessor
163+ ```
164+
165+ ` --export-preprocessor ` bundles the mel preprocessor into the output directory
166+ using the MLX partitioner, so no separate preprocessor export step is needed.
167+
131168### Options
132169
133170| Flag | Default | Description |
134171| ------| ---------| -------------|
135172| ` --model-path ` | (required) | Directory with ` params.json ` + ` consolidated.safetensors ` |
136- | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , or ` portable ` |
173+ | ` --backend ` | ` xnnpack ` | ` xnnpack ` , ` metal ` , ` mlx ` , or ` portable ` |
137174| ` --output-dir ` | ` ./voxtral_rt_exports ` | Output directory |
138175| ` --max-seq-len ` | ` 4096 ` | KV cache length |
139176| ` --delay-tokens ` | ` 6 ` | Transcription delay in tokens (6 = 480ms) |
@@ -142,6 +179,8 @@ EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_ex
142179| ` --qlinear-encoder ` | (none) | Encoder linear layer quantization (` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` ) |
143180| ` --qlinear-encoder-group-size ` | ` 32 ` | Group size for encoder linear quantization |
144181| ` --qembedding ` | (none) | Embedding layer quantization (` 8w ` ) |
182+ | ` --qembedding-group-size ` | ` 0 ` | Group size for embedding quantization (0 = per-channel) |
183+ | ` --export-preprocessor ` | off | Export ` preprocessor.pte ` alongside the model |
145184| ` --streaming ` | off | Export streaming encoder with KV cache |
146185| ` --max-enc-len ` | ` 750 ` | Encoder sliding window size (streaming only) |
147186
@@ -173,6 +212,15 @@ make voxtral_realtime-metal
173212This builds ExecuTorch with Metal backend support. The runner binary is at
174213the same path as above. Metal exports can only run on macOS with Apple Silicon.
175214
215+ ### MLX (Apple GPU)
216+
217+ ``` bash
218+ make voxtral_realtime-mlx
219+ ```
220+
221+ This builds ExecuTorch with MLX backend support. MLX provides GPU acceleration
222+ on Apple Silicon via the MLX delegate.
223+
176224## Run
177225
178226The runner requires:
0 commit comments