Skip to content

Commit 5a7b515

Browse files
author
Young Han
committed
examples: fix Voxtral TTS to produce intelligible speech on CPU and XNNPACK
Three bugs fixed: codec reshape order (P*T to T*P), flow-matching RNG (mt19937 to xorshift64+BoxMuller matching C ref), ALiBi slopes off-by-one. Adds --speaker for live PCM output, parakeet STT gate, quantization docs and benchmarks. Authored with Claude.
1 parent a42f1fb commit 5a7b515

11 files changed

Lines changed: 535 additions & 656 deletions

CLAUDE.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ pip install -e . --no-build-isolation # subsequent installs
2727

2828
Details: [docs/source/using-executorch-building-from-source.md](docs/source/using-executorch-building-from-source.md)
2929

30+
## Long-running commands
31+
32+
ExecuTorch model exports and large builds (CMake configure+build of LLM runners, AOT lowering, NeMo restore, big HF downloads) can hang silently and may not surface an exit code through pipes like `tail`. For those long jobs only, poll progress every ~120s — check the process state (`ps`, `py-spy dump`), output file growth, and network/file activity — rather than waiting indefinitely on the original Bash invocation. Avoid wrapping with `| tail` for long jobs since it buffers and hides progress; tee to a log file or run unwrapped. Normal short commands don't need this — run them directly and trust the exit code.
33+
3034
## Naming
3135

3236
- Use "executorch" (lowercase) or "ExecuTorch" (camel case)
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Voxtral TTS ExecuTorch Benchmark Results
2+
3+
Date: 2026-04-16
4+
Machine: Meta devserver (CPU-only, no GPU)
5+
Backend: ExecuTorch XNNPACK (CPU) + portable
6+
Model: `mistralai/Voxtral-4B-TTS-2603`
7+
Voice: `neutral_female`, seed `42`
8+
9+
## Short prompt — "Hello, how are you today?" (5 words)
10+
11+
| Config | model.pte | codec.pte | Frames | Audio | Wall time | RTF | Parakeet transcript |
12+
|--------|-----------|-----------|--------|-------|-----------|-----|---------------------|
13+
| FP32 XNNPACK | 15.5 GB | 610 MB | 40 | 3.20s | 15.3s | 4.8x | Hello, how are you today? |
14+
| FP32 portable | 15.5 GB | 748 MB | 40 | 3.20s | 278s | 87x | Hello, how are you today? |
15+
| 8da4w (feed_forward) | 7.0 GB | 610 MB | 43 | 3.44s | ~12s | ~3.5x | Hello, how are you today? |
16+
| 8da8w (all) | 5.7 GB | 610 MB | 44 | 3.52s | ~10s | ~2.8x | Hello, how are you today? |
17+
| 8da4w (all) | 4.3 GB | 610 MB | 33 | 2.64s | ~10s | ~3.8x | Ah hello. How are you today? |
18+
| C reference (OpenBLAS) | N/A | N/A | 40 | 3.20s | ~300s | 94x | Hello, how are you today? |
19+
20+
## Long prompt — 541 chars / 90 words (paragraph)
21+
22+
Input text:
23+
> The quick brown fox jumps over the lazy dog near the old stone bridge that
24+
> crosses the winding river. Birds sing melodiously in the tall oak trees as
25+
> the morning sun casts golden rays across the peaceful meadow. A gentle breeze
26+
> carries the sweet scent of wildflowers through the valley, while distant
27+
> church bells chime softly in the background. Children laugh and play in the
28+
> nearby park, their joyful voices echoing through the neighborhood. The world
29+
> feels calm and beautiful on this perfect spring morning, filled with warmth
30+
> and wonder.
31+
32+
ExecuTorch configs ran with `--max_new_tokens 300` (= 24s audio at 12.5 Hz).
33+
The C reference ran uncapped and produced 403 frames (32.2s), capturing the
34+
full text. The ExecuTorch runs hit the 300-frame cap and truncated the last
35+
~2 sentences. Use `--max_new_tokens 500` to avoid truncation for long texts.
36+
37+
| Config | model.pte | Frames | Audio | Wall time | RTF | Transcript (parakeet) |
38+
|--------|-----------|--------|-------|-----------|-----|-----------------------|
39+
| FP32 XNNPACK | 15.5 GB | 300 | 24.0s | 77s | 3.2x | Perfect through "Children laugh and play." |
40+
| 8da4w (feed_forward) | 7.0 GB | 300 | 24.0s | 64s | 2.6x | Perfect through "...in the nearby park." |
41+
| 8da8w (all) | 5.7 GB | 300 | 24.0s | 45s | 1.9x | "One" for "The" at start; otherwise perfect |
42+
| 8da4w (all) | 4.3 GB | 300 | 24.0s | 49s | 2.0x | Perfect through "...in the background." |
43+
| C reference (OpenBLAS) | N/A | 403 | 32.2s | 2508s | 77.9x | Full text perfect (no frame cap) |
44+
45+
### Audio quality metrics (long prompt)
46+
47+
| Config | RMS | Peak amplitude |
48+
|--------|-----|----------------|
49+
| FP32 XNNPACK | 0.0136 | [-0.182, 0.215] |
50+
| 8da4w (feed_forward) | 0.0130 | [-0.142, 0.140] |
51+
| 8da8w (all) | 0.0104 | [-0.127, 0.156] |
52+
| 8da4w (all) | 0.0117 | [-0.120, 0.119] |
53+
54+
## Key observations
55+
56+
1. **XNNPACK is 20–50x faster than the C reference and portable backend** on
57+
the same CPU, thanks to optimized XNNPACK kernels for matmul and convolution.
58+
59+
2. **Quantization reduces model size 2–4x** with minimal quality impact:
60+
- `8da4w feed_forward` is the recommended config (2.2x smaller, perfect transcript)
61+
- `8da8w` is the fastest (RTF 1.9x) with good quality
62+
- `8da4w all` is the smallest (3.6x smaller) but may lose a word
63+
64+
3. **RTF improves with longer texts** due to amortized model loading and warmup:
65+
- Short prompt: RTF 3–5x
66+
- Long prompt: RTF 1.9–3.2x
67+
68+
4. **FP32 produces bit-identical codes to the C reference** when using the
69+
matching xorshift64+Box-Muller RNG (verified by `diff -q` on per-frame code
70+
dumps for the short prompt).
71+
72+
## vllm-omni comparison (not runnable on this machine)
73+
74+
This benchmark was run on a CPU-only devserver. The [vllm-omni](https://github.com/vllm-project/vllm-omni)
75+
reference implementation requires CUDA GPU (A100/H100 recommended) and typically
76+
achieves sub-1x RTF (real-time or faster). To compare:
77+
78+
```bash
79+
git clone https://github.com/vllm-project/vllm-omni.git
80+
cd vllm-omni
81+
uv pip install gradio==5.50
82+
python examples/online_serving/voxtral_tts/gradio_demo.py \
83+
--host <your-server-url> --port 8000
84+
```
85+
86+
ExecuTorch's value proposition is **on-device inference without GPU dependency**
87+
— achieving 1.9–3.2x RTF on CPU alone.
88+
89+
## Reproducing
90+
91+
```bash
92+
conda activate executorch
93+
VOXTRAL_DIR=~/.cache/huggingface/hub/models--mistralai--Voxtral-4B-TTS-2603/snapshots/<sha>
94+
95+
# Export (pick one)
96+
python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --output-dir ./exports
97+
python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --qlinear 8da4w --decoder-qlinear-scope feed_forward --output-dir ./exports
98+
python export_voxtral_tts.py --model-path $VOXTRAL_DIR --backend xnnpack --qlinear 8da8w --output-dir ./exports
99+
100+
# Build
101+
cmake --workflow --preset llm-release
102+
cd examples/models/voxtral_tts && cmake --workflow --preset voxtral-tts-xnnpack && cd ../../..
103+
104+
# Run
105+
./cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
106+
--model ./exports/model.pte \
107+
--codec ./exports/codec_decoder.pte \
108+
--tokenizer $VOXTRAL_DIR/tekken.json \
109+
--voice $VOXTRAL_DIR/voice_embedding/neutral_female.pt \
110+
--text "Hello, how are you today?" \
111+
--output output.wav --seed 42 --max_new_tokens 300
112+
113+
# Verify with parakeet STT
114+
python examples/models/voxtral_tts/transcribe_parakeet.py \
115+
--audio output.wav \
116+
--parakeet-runner ./cmake-out/examples/models/parakeet/parakeet_runner \
117+
--parakeet-model examples/models/parakeet/parakeet_tdt_exports/model.pte \
118+
--parakeet-tokenizer examples/models/parakeet/parakeet_tdt_exports/tokenizer.model
119+
```

0 commit comments

Comments
 (0)