11# Parakeet TDT Export for ExecuTorch
22
3- Export [ nvidia/parakeet-tdt-0.6b-v3] ( https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 ) speech recognition model to ExecuTorch.
3+ Export
4+ [ nvidia/parakeet-tdt-0.6b-v3] ( https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3 )
5+ speech recognition model to ExecuTorch.
46
57## Installation
68
@@ -11,53 +13,57 @@ pip install -r install_requirements.txt
1113## Export
1214
1315Export the model:
16+
1417``` bash
1518python export_parakeet_tdt.py
1619```
1720
1821Test transcription on an audio file and compare eager vs lowered results:
22+
1923``` bash
2024python export_parakeet_tdt.py --audio /path/to/audio.wav
2125```
2226
2327### Export Arguments
2428
25- | Argument | Description |
26- | ----------| -------------|
27- | ` --output-dir ` | Output directory for exports (default: ` ./parakeet_tdt_exports ` ) |
28- | ` --backend ` | Backend for acceleration: ` portable ` , ` xnnpack ` , ` metal ` , ` mlx ` , ` cuda ` , ` cuda-windows ` (default: ` xnnpack ` ) |
29- | ` --dtype ` | Data type: ` fp32 ` , ` bf16 ` , ` fp16 ` (default: ` fp32 ` ). Metal backend supports ` fp32 ` and ` bf16 ` only (no ` fp16 ` ). |
30- | ` --audio ` | Path to audio file for transcription test |
29+ | Argument | Description |
30+ | -------------- | ---------------------------------------------------------------------------------------------------------------------- |
31+ | ` --output-dir ` | Output directory for exports (default: ` ./parakeet_tdt_exports ` ) |
32+ | ` --backend ` | Backend for acceleration: ` portable ` , ` xnnpack ` , ` vulkan ` , ` metal ` , ` mlx ` , ` cuda ` , ` cuda-windows ` (default: ` xnnpack ` ) |
33+ | ` --dtype ` | Data type: ` fp32 ` , ` bf16 ` , ` fp16 ` (default: ` fp32 ` ). Metal backend supports ` fp32 ` and ` bf16 ` only (no ` fp16 ` ). |
34+ | ` --audio ` | Path to audio file for transcription test |
3135
32- ** Note:** The preprocessor is always lowered with the portable backend regardless of the ` --backend ` setting.
36+ ** Note:** The preprocessor is always lowered with the portable backend
37+ regardless of the ` --backend ` setting.
3338
3439### Quantization
3540
36- The export script supports quantizing encoder and decoder linear layers using [ torchao] ( https://github.com/pytorch/ao ) .
41+ The export script supports quantizing encoder and decoder linear layers using
42+ [ torchao] ( https://github.com/pytorch/ao ) .
3743
3844#### Quantization Arguments
3945
40- | Argument | Description |
41- | ----------| -------------|
42- | ` --qlinear_encoder ` | Quantization config for encoder linear layers: ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` , ` nvfp4 ` |
43- | ` --qlinear_encoder_group_size ` | Group size for encoder linear quantization (default: auto) |
44- | ` --qlinear_encoder_packing_format ` | Packing format for encoder: ` tile_packed_to_4d ` |
45- | ` --qlinear ` | Quantization config for decoder linear layers: ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` , ` nvfp4 ` |
46- | ` --qlinear_group_size ` | Group size for decoder linear quantization (default: auto) |
47- | ` --qlinear_packing_format ` | Packing format for decoder: ` tile_packed_to_4d ` |
48- | ` --qembedding ` | Quantization config for decoder embedding layer: ` 4w ` , ` 8w ` , ` nvfp4 ` |
49- | ` --qembedding_group_size ` | Group size for embedding quantization (default: auto) |
46+ | Argument | Description |
47+ | ---------------------------------- | --------------------------------------------------------------------------------------------- |
48+ | ` --qlinear_encoder ` | Quantization config for encoder linear layers: ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` , ` nvfp4 ` |
49+ | ` --qlinear_encoder_group_size ` | Group size for encoder linear quantization (default: auto) |
50+ | ` --qlinear_encoder_packing_format ` | Packing format for encoder: ` tile_packed_to_4d ` |
51+ | ` --qlinear ` | Quantization config for decoder linear layers: ` 4w ` , ` 8w ` , ` 8da4w ` , ` 8da8w ` , ` fpa4w ` , ` nvfp4 ` |
52+ | ` --qlinear_group_size ` | Group size for decoder linear quantization (default: auto) |
53+ | ` --qlinear_packing_format ` | Packing format for decoder: ` tile_packed_to_4d ` |
54+ | ` --qembedding ` | Quantization config for decoder embedding layer: ` 4w ` , ` 8w ` , ` nvfp4 ` |
55+ | ` --qembedding_group_size ` | Group size for embedding quantization (default: auto) |
5056
5157#### Quantization Configs
5258
53- | Config | Description | Backends |
54- | -------- | -------------| ----------|
55- | ` 4w ` | 4-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
56- | ` 8w ` | 8-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
57- | ` 8da4w ` | 8-bit dynamic activation, 4-bit weight | XNNPACK |
58- | ` 8da8w ` | 8-bit dynamic activation, 8-bit weight | XNNPACK |
59- | ` fpa4w ` | Floating point activation, 4-bit weight | Metal |
60- | ` nvfp4 ` | 4-bit weight only quantization using NVIDIA's FP4 dtype | MLX |
59+ | Config | Description | Backends |
60+ | ------- | ------------------------------------------------------- | ----------------------------------- |
61+ | ` 4w ` | 4-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
62+ | ` 8w ` | 8-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
63+ | ` 8da4w ` | 8-bit dynamic activation, 4-bit weight | Vulkan, XNNPACK |
64+ | ` 8da8w ` | 8-bit dynamic activation, 8-bit weight | XNNPACK |
65+ | ` fpa4w ` | Floating point activation, 4-bit weight | Metal |
66+ | ` nvfp4 ` | 4-bit weight only quantization using NVIDIA's FP4 dtype | MLX |
6167
6268#### Example: Dynamic Quantization for XNNPACK
6369
@@ -71,6 +77,26 @@ python export_parakeet_tdt.py \
7177 --output-dir ./parakeet_quantized_xnnpack
7278```
7379
80+ #### Example: Dynamic Quantization for Vulkan
81+
82+ ``` bash
83+ python export_parakeet_tdt.py \
84+ --backend vulkan \
85+ --qlinear_encoder 8da4w \
86+ --qlinear_encoder_group_size 32 \
87+ --qlinear 8da4w \
88+ --qlinear_group_size 32 \
89+ --vulkan_force_fp16 \
90+ --output-dir ./parakeet_quantized_vulkan
91+ ```
92+
93+ An additional ` --vulkan_force_fp16 ` flag is available to have the Vulkan backend
94+ internally downcast FP32 tensors to FP16 within the Vulkan backend, forcing
95+ half-precision computation. Note that input/output tensors are still FP32, and
96+ the delegate will automatically convert them to/from FP16 upon entering and
97+ exiting the delegate. This will significantly improve latency but may slightly
98+ reduce transcription accuracy.
99+
74100#### Example: 4-bit Weight Quantization with Tile Packing for CUDA
75101
76102``` bash
@@ -100,14 +126,18 @@ python export_parakeet_tdt.py \
100126 --output-dir ./parakeet_metal_quantized
101127```
102128
103- ** Note:** Metal 4-bit quantization requires torchao built with experimental MPS (Metal) ops.
129+ ** Note:** Metal 4-bit quantization requires torchao built with experimental MPS
130+ (Metal) ops.
104131
105132You can install torchao with Metal support from the ` ao ` repo:
133+
106134``` bash
107135USE_CPP=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 pip install . --no-build-isolation
108136```
109137
110- Alternatively, you can build torchao with Metal support while installing ExecuTorch:
138+ Alternatively, you can build torchao with Metal support while installing
139+ ExecuTorch:
140+
111141``` bash
112142EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh
113143```
@@ -119,6 +149,7 @@ python export_parakeet_tdt.py --backend metal --output-dir ./parakeet_metal
119149```
120150
121151This generates:
152+
122153- ` model.pte ` - The compiled Parakeet TDT model (includes Metal kernel blob)
123154- ` tokenizer.model ` - SentencePiece tokenizer
124155
@@ -129,14 +160,17 @@ python export_parakeet_tdt.py --backend cuda --output-dir ./parakeet_cuda
129160```
130161
131162This generates:
163+
132164- ` model.pte ` - The compiled Parakeet TDT model
133165- ` aoti_cuda_blob.ptd ` - CUDA kernel blob required at runtime
134166- ` tokenizer.model ` - SentencePiece tokenizer
135167
136168### CUDA-Windows Export
137169
138170Before running ` cuda-windows ` export, make sure these requirements are set up:
139- - ` x86_64-w64-mingw32-g++ ` is installed and on ` PATH ` (mingw-w64 cross-compiler).
171+
172+ - ` x86_64-w64-mingw32-g++ ` is installed and on ` PATH ` (mingw-w64
173+ cross-compiler).
140174- ` WINDOWS_CUDA_HOME ` points to the extracted Windows CUDA package directory.
141175
142176Example setup on Ubuntu:
@@ -170,12 +204,14 @@ python export_parakeet_tdt.py --backend cuda-windows --output-dir ./parakeet_cud
170204```
171205
172206This generates:
207+
173208- ` model.pte ` - The compiled Parakeet TDT model
174209- ` aoti_cuda_blob.ptd ` - CUDA kernel blob required at runtime
175210
176211### MLX Export
177212
178213Export with MLX backend (bf16, int4 quantized, group size 128):
214+
179215``` bash
180216python export_parakeet_tdt.py \
181217 --backend mlx \
@@ -188,6 +224,7 @@ python export_parakeet_tdt.py \
188224```
189225
190226Export with MLX backend (bf16, NVFP4 quantized):
227+
191228``` bash
192229python export_parakeet_tdt.py \
193230 --backend mlx \
@@ -198,9 +235,12 @@ python export_parakeet_tdt.py \
198235 --output-dir ./parakeet_mlx_nvfp4
199236```
200237
201- > ** Note:** Although MLX supports NVFP4 embedding quantization, Parakeet's embedding layer has dimensions not divisible by 16, which is incompatible with NVFP4. Use ` 4w ` for embeddings instead.
238+ > ** Note:** Although MLX supports NVFP4 embedding quantization, Parakeet's
239+ > embedding layer has dimensions not divisible by 16, which is incompatible with
240+ > NVFP4. Use ` 4w ` for embeddings instead.
202241
203242This generates:
243+
204244- ` model.pte ` - The compiled model with MLX delegate (~ 470 MB)
205245- ` tokenizer.model ` - SentencePiece tokenizer
206246
@@ -217,6 +257,9 @@ make parakeet-cpu
217257# Metal build (macOS)
218258make parakeet-metal
219259
260+ # Vulkan build (Linux / Android)
261+ make parakeet-vulkan
262+
220263# CUDA build (Linux)
221264make parakeet-cuda
222265
@@ -250,6 +293,12 @@ DYLD_LIBRARY_PATH=/usr/lib ./cmake-out/examples/models/parakeet/parakeet_runner
250293 --audio_path /path/to/audio.wav \
251294 --tokenizer_path examples/models/parakeet/parakeet_metal/tokenizer.model
252295
296+ # Vulkan
297+ ./cmake-out/examples/models/parakeet/parakeet_runner \
298+ --model_path examples/models/parakeet/parakeet_vulkan/model.pte \
299+ --audio_path /path/to/audio.wav \
300+ --tokenizer_path examples/models/parakeet/parakeet_vulkan/tokenizer.model
301+
253302# CUDA (include .ptd data file)
254303./cmake-out/examples/models/parakeet/parakeet_runner \
255304 --model_path examples/models/parakeet/parakeet_cuda/model.pte \
@@ -274,20 +323,23 @@ Windows (PowerShell):
274323 --tokenizer_path C:\path\to\parakeet_cuda_windows\tokenizer.model
275324```
276325
277- If your generator is single-config, the runner may be at ` .\cmake-out\examples\models\parakeet\parakeet_runner.exe ` instead.
326+ If your generator is single-config, the runner may be at
327+ ` .\cmake-out\examples\models\parakeet\parakeet_runner.exe ` instead.
278328
279329### Runner Arguments
280330
281- | Argument | Description |
282- | ----------| -------------|
283- | ` --model_path ` | Path to Parakeet model (.pte) |
284- | ` --audio_path ` | Path to input audio file (.wav) |
285- | ` --tokenizer_path ` | Path to tokenizer file (default: ` tokenizer.json ` ) |
286- | ` --data_path ` | Path to data file (.ptd) for delegate data (required for CUDA/CUDA-Windows) |
331+ | Argument | Description |
332+ | ------------------ | ----------------------------------------------------------------------------- |
333+ | ` --model_path ` | Path to Parakeet model (.pte) |
334+ | ` --audio_path ` | Path to input audio file (.wav) |
335+ | ` --tokenizer_path ` | Path to tokenizer file (default: ` tokenizer.json ` ) |
336+ | ` --data_path ` | Path to data file (.ptd) for delegate data (required for CUDA/CUDA-Windows) |
287337| ` --timestamps ` | Timestamp output mode: ` none\|token\|word\|segment\|all ` (default: ` segment ` ) |
288338
289339### Mobile App
290340
291- Check out a [ demo Android app] ( https://github.com/meta-pytorch/executorch-examples/tree/main/parakeet/android/ParakeetApp ) for Parakeet in the separate ` executorch-examples ` repository.
341+ Check out a
342+ [ demo Android app] ( https://github.com/meta-pytorch/executorch-examples/tree/main/parakeet/android/ParakeetApp )
343+ for Parakeet in the separate ` executorch-examples ` repository.
292344
293345https://github.com/user-attachments/assets/9793d2d0-0d23-4627-a8dc-4334b97b07ab
0 commit comments