@@ -25,7 +25,7 @@ python export_parakeet_tdt.py --audio /path/to/audio.wav
2525| Argument | Description |
2626| ----------| -------------|
2727| ` --output-dir ` | Output directory for exports (default: ` ./parakeet_tdt_exports ` ) |
28- | ` --backend ` | Backend for acceleration: ` portable ` , ` xnnpack ` , ` metal ` , ` mlx ` , ` cuda ` , ` cuda-windows ` (default: ` xnnpack ` ) |
28+ | ` --backend ` | Backend for acceleration: ` portable ` , ` xnnpack ` , ` vulkan ` , ` metal ` , ` mlx ` , ` cuda ` , ` cuda-windows ` (default: ` xnnpack ` ) |
2929| ` --dtype ` | Data type: ` fp32 ` , ` bf16 ` , ` fp16 ` (default: ` fp32 ` ). Metal backend supports ` fp32 ` and ` bf16 ` only (no ` fp16 ` ). |
3030| ` --audio ` | Path to audio file for transcription test |
3131
@@ -54,7 +54,7 @@ The export script supports quantizing encoder and decoder linear layers using [t
5454| --------| -------------| ----------|
5555| ` 4w ` | 4-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
5656| ` 8w ` | 8-bit weight only quantization | CUDA, MLX, XNNPACK (embedding only) |
57- | ` 8da4w ` | 8-bit dynamic activation, 4-bit weight | XNNPACK |
57+ | ` 8da4w ` | 8-bit dynamic activation, 4-bit weight | Vulkan, XNNPACK |
5858| ` 8da8w ` | 8-bit dynamic activation, 8-bit weight | XNNPACK |
5959| ` fpa4w ` | Floating point activation, 4-bit weight | Metal |
6060| ` nvfp4 ` | 4-bit weight only quantization using NVIDIA's FP4 dtype | MLX |
@@ -71,6 +71,21 @@ python export_parakeet_tdt.py \
7171 --output-dir ./parakeet_quantized_xnnpack
7272```
7373
74+ #### Example: Dynamic Quantization for Vulkan
75+
76+ ``` bash
77+ python export_parakeet_tdt.py \
78+ --backend vulkan \
79+ --qlinear_encoder 8da4w \
80+ --qlinear_encoder_group_size 32 \
81+ --qlinear 8da4w \
82+ --qlinear_group_size 32 \
83+ --vulkan_force_fp16 \
84+ --output-dir ./parakeet_quantized_vulkan
85+ ```
86+
87+ An additional ` --vulkan_force_fp16 ` flag is available to have the Vulkan backend internally downcast FP32 tensors to FP16 within the Vulkan backend, forcing half-precision computation. Note that input/output tensors are still FP32, and the delegate will automatically convert them to/from FP16 upon entering and exiting the delegate. This will significantly improve latency but may slightly reduce transcription accuracy.
88+
7489#### Example: 4-bit Weight Quantization with Tile Packing for CUDA
7590
7691``` bash
@@ -217,6 +232,9 @@ make parakeet-cpu
217232# Metal build (macOS)
218233make parakeet-metal
219234
235+ # Vulkan build (Linux / Android)
236+ make parakeet-vulkan
237+
220238# CUDA build (Linux)
221239make parakeet-cuda
222240
@@ -250,6 +268,12 @@ DYLD_LIBRARY_PATH=/usr/lib ./cmake-out/examples/models/parakeet/parakeet_runner
250268 --audio_path /path/to/audio.wav \
251269 --tokenizer_path examples/models/parakeet/parakeet_metal/tokenizer.model
252270
271+ # Vulkan
272+ ./cmake-out/examples/models/parakeet/parakeet_runner \
273+ --model_path examples/models/parakeet/parakeet_vulkan/model.pte \
274+ --audio_path /path/to/audio.wav \
275+ --tokenizer_path examples/models/parakeet/parakeet_vulkan/tokenizer.model
276+
253277# CUDA (include .ptd data file)
254278./cmake-out/examples/models/parakeet/parakeet_runner \
255279 --model_path examples/models/parakeet/parakeet_cuda/model.pte \
0 commit comments