You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
a ~4B parameter text-to-speech model that produces 24 kHz mono audio from
6
6
text. Weights are loaded directly from the HuggingFace safetensors
7
-
checkpoint. Supports CPU (portable + XNNPACK) and CUDA backends.
7
+
checkpoint. Supports CPU (portable + XNNPACK) and CUDA backends. With `--streaming`, the CUDA 4w export runs at **RTF 0.31x on RTX 5080 — 3× faster than real-time** with 2.6 s time-to-first-audio.
8
8
9
9
## Overview
10
10
@@ -87,6 +87,40 @@ Validated on A100, `seed=42`, `"Hello, how are you today?"`:
87
87
|`--backend cuda`| 15.8 GB | 11.5 s | 178 s | 51x | FP32 weights, codec on portable CPU |
88
88
|**`--backend cuda --qlinear 4w`**|**3.4 GB**|**2.1 s**|**3.7 s**|**0.88x** ⚡ | int4 weights, codec on CUDA |
89
89
90
+
91
+
### Streaming
92
+
93
+
`--streaming` emits codec chunks as they are decoded rather than batching the
94
+
full audio at the end. The first chunk arrives in ~0.4 s of audio (short
95
+
prefill delay), then 2 s chunks follow continuously. This decouples
96
+
time-to-first-audio from total synthesis length and enables live piped playback.
97
+
98
+
Measured on RTX 5080 (sm_120, warm Triton autotune cache):
0 commit comments