|
| 1 | +# Voxtral TTS ExecuTorch Benchmark Results |
| 2 | + |
| 3 | +Date: 2026-04-16 |
| 4 | +Machine: Meta devserver (CPU-only, no GPU) |
| 5 | +Backend: ExecuTorch XNNPACK (CPU) + portable |
| 6 | +Model: `mistralai/Voxtral-4B-TTS-2603` |
| 7 | +Voice: `neutral_female`, seed `42` |
| 8 | + |
| 9 | +## Short prompt — "Hello, how are you today?" (5 words) |
| 10 | + |
| 11 | +| Config | model.pte | codec.pte | Frames | Audio | Wall time | RTF | Parakeet transcript | |
| 12 | +|--------|-----------|-----------|--------|-------|-----------|-----|---------------------| |
| 13 | +| FP32 XNNPACK | 15.5 GB | 610 MB | 40 | 3.20s | 15.3s | 4.8x | Hello, how are you today? | |
| 14 | +| FP32 portable | 15.5 GB | 748 MB | 40 | 3.20s | 278s | 87x | Hello, how are you today? | |
| 15 | +| 8da4w (feed_forward) | 7.0 GB | 610 MB | 43 | 3.44s | ~12s | ~3.5x | Hello, how are you today? | |
| 16 | +| 8da8w (all) | 5.7 GB | 610 MB | 44 | 3.52s | ~10s | ~2.8x | Hello, how are you today? | |
| 17 | +| 8da4w (all) | 4.3 GB | 610 MB | 33 | 2.64s | ~10s | ~3.8x | Ah hello. How are you today? | |
| 18 | +| C reference (OpenBLAS) | N/A | N/A | 40 | 3.20s | ~300s | 94x | Hello, how are you today? | |
| 19 | + |
| 20 | +## Long prompt — 541 chars / 90 words (paragraph) |
| 21 | + |
| 22 | +Input text: |
| 23 | +> The quick brown fox jumps over the lazy dog near the old stone bridge that |
| 24 | +> crosses the winding river. Birds sing melodiously in the tall oak trees as |
| 25 | +> the morning sun casts golden rays across the peaceful meadow. A gentle breeze |
| 26 | +> carries the sweet scent of wildflowers through the valley, while distant |
| 27 | +> church bells chime softly in the background. Children laugh and play in the |
| 28 | +> nearby park, their joyful voices echoing through the neighborhood. The world |
| 29 | +> feels calm and beautiful on this perfect spring morning, filled with warmth |
| 30 | +> and wonder. |
| 31 | +
|
| 32 | +ExecuTorch configs ran with `--max_new_tokens 300` (= 24s audio at 12.5 Hz). |
| 33 | +The C reference ran uncapped and produced 403 frames (32.2s), capturing the |
| 34 | +full text. The ExecuTorch runs hit the 300-frame cap and truncated the last |
| 35 | +~2 sentences. Use `--max_new_tokens 500` to avoid truncation for long texts. |
| 36 | + |
| 37 | +| Config | model.pte | Frames | Audio | Wall time | RTF | Transcript (parakeet) | |
| 38 | +|--------|-----------|--------|-------|-----------|-----|-----------------------| |
| 39 | +| FP32 XNNPACK | 15.5 GB | 300 | 24.0s | 77s | 3.2x | Perfect through "Children laugh and play." | |
| 40 | +| 8da4w (feed_forward) | 7.0 GB | 300 | 24.0s | 64s | 2.6x | Perfect through "...in the nearby park." | |
| 41 | +| 8da8w (all) | 5.7 GB | 300 | 24.0s | 45s | 1.9x | "One" for "The" at start; otherwise perfect | |
| 42 | +| 8da4w (all) | 4.3 GB | 300 | 24.0s | 49s | 2.0x | Perfect through "...in the background." | |
| 43 | +| C reference (OpenBLAS) | N/A | 403 | 32.2s | 2508s | 77.9x | Full text perfect (no frame cap) | |
| 44 | + |
| 45 | +### Audio quality metrics (long prompt) |
| 46 | + |
| 47 | +| Config | RMS | Peak amplitude | |
| 48 | +|--------|-----|----------------| |
| 49 | +| FP32 XNNPACK | 0.0136 | [-0.182, 0.215] | |
| 50 | +| 8da4w (feed_forward) | 0.0130 | [-0.142, 0.140] | |
| 51 | +| 8da8w (all) | 0.0104 | [-0.127, 0.156] | |
| 52 | +| 8da4w (all) | 0.0117 | [-0.120, 0.119] | |
| 53 | + |
| 54 | +## Key observations |
| 55 | + |
| 56 | +1. **XNNPACK is 20–50x faster than the C reference and portable backend** on |
| 57 | + the same CPU, thanks to optimized XNNPACK kernels for matmul and convolution. |
| 58 | + |
| 59 | +2. **Quantization reduces model size 2–4x** with minimal quality impact: |
| 60 | + - `8da4w feed_forward` is the recommended config (2.2x smaller, perfect transcript) |
| 61 | + - `8da8w` is the fastest (RTF 1.9x) with good quality |
| 62 | + - `8da4w all` is the smallest (3.6x smaller) but may lose a word |
| 63 | + |
| 64 | +3. **RTF improves with longer texts** due to amortized model loading and warmup: |
| 65 | + - Short prompt: RTF 3–5x |
| 66 | + - Long prompt: RTF 1.9–3.2x |
| 67 | + |
| 68 | +4. **FP32 produces bit-identical codes to the C reference** when using the |
| 69 | + matching xorshift64+Box-Muller RNG (verified by `diff -q` on per-frame code |
| 70 | + dumps for the short prompt). |
| 71 | + |
| 72 | +## vllm-omni comparison (not runnable on this machine) |
| 73 | + |
| 74 | +This benchmark was run on a CPU-only devserver. The [vllm-omni](https://github.com/vllm-project/vllm-omni) |
| 75 | +reference implementation requires CUDA GPU (A100/H100 recommended) and typically |
| 76 | +achieves sub-1x RTF (real-time or faster). To compare: |
| 77 | + |
| 78 | +```bash |
| 79 | +git clone https://github.com/vllm-project/vllm-omni.git |
| 80 | +cd vllm-omni |
| 81 | +uv pip install gradio==5.50 |
| 82 | +python examples/online_serving/voxtral_tts/gradio_demo.py \ |
| 83 | + --host <your-server-url> --port 8000 |
| 84 | +``` |
| 85 | + |
| 86 | +ExecuTorch's value proposition is **on-device inference without GPU dependency** |
| 87 | +— achieving 1.9–3.2x RTF on CPU alone. |
| 88 | + |
| 89 | +## Reproducing |
| 90 | + |
| 91 | +```bash |
| 92 | +conda activate executorch |
| 93 | +VOXTRAL_DIR=~/.cache/huggingface/hub/models--mistralai--Voxtral-4B-TTS-2603/snapshots/<sha> |
| 94 | + |
| 95 | +# Export (pick one) |
| 96 | +python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --output-dir ./exports |
| 97 | +python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --qlinear 8da4w --decoder-qlinear-scope feed_forward --output-dir ./exports |
| 98 | +python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --qlinear 8da8w --output-dir ./exports |
| 99 | + |
| 100 | +# Build |
| 101 | +cmake --workflow --preset llm-release |
| 102 | +cd examples/models/voxtral_tts && cmake --workflow --preset voxtral-tts-xnnpack && cd ../../.. |
| 103 | + |
| 104 | +# Run |
| 105 | +./cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \ |
| 106 | + --model ./exports/model.pte \ |
| 107 | + --codec ./exports/codec_decoder.pte \ |
| 108 | + --tokenizer $VOXTRAL_DIR/tekken.json \ |
| 109 | + --voice $VOXTRAL_DIR/voice_embedding/neutral_female.pt \ |
| 110 | + --text "Hello, how are you today?" \ |
| 111 | + --output output.wav --seed 42 --max_new_tokens 300 |
| 112 | + |
| 113 | +# Verify with parakeet STT |
| 114 | +python examples/models/voxtral_tts/transcribe_parakeet.py \ |
| 115 | + --audio output.wav \ |
| 116 | + --parakeet-runner ./cmake-out/examples/models/parakeet/parakeet_runner \ |
| 117 | + --parakeet-model examples/models/parakeet/parakeet_tdt_exports/model.pte \ |
| 118 | + --parakeet-tokenizer examples/models/parakeet/parakeet_tdt_exports/tokenizer.model |
| 119 | +``` |
0 commit comments