Cake supports text-to-speech synthesis with voice cloning.
| Model | Parameters | VRAM | Speed | Architecture | Voice Ref |
|---|---|---|---|---|---|
| LuxTTS | ~123M | <1 GB | 150x realtime (GPU) | Zipformer + flow matching | .wav file (optional) |
| VibeVoice-1.5B | ~3B (BF16) | ~7 GB | ~1x realtime | Qwen2.5 LM + diffusion | .wav file |
| VibeVoice-Realtime-0.5B | ~0.5B (F32) | ~3 GB | ~3x realtime | Qwen2.5 LM + diffusion | .safetensors preset |
A lightweight ZipVoice-based TTS model using Zipformer encoder + flow matching decoder with a 4-step Euler solver. Produces 48kHz audio at 150x realtime on GPU. Supports distributed inference — the 16 FM decoder layers can be sharded across workers.
Original model: YatharthS/LuxTTS (PyTorch). Pre-converted safetensors for Cake: evilsocket/luxtts.
Pre-converted safetensors weights are available at evilsocket/luxtts:
cake pull evilsocket/luxttsIf you prefer to convert from the original PyTorch weights (YatharthS/LuxTTS):
cake pull YatharthS/LuxTTS
python scripts/convert_luxtts.py \
--model-dir ~/.cache/huggingface/hub/models--YatharthS--LuxTTS/snapshots/*/ \
--output-dir /path/to/luxtts-convertedThe conversion script produces model.safetensors, vocos.safetensors, config.json, and tokens.txt.
LuxTTS uses IPA phoneme tokens. The built-in tokenizer provides a basic rule-based English-to-IPA fallback. For best quality, pre-compute IPA token IDs using an external phonemizer (e.g. espeak-ng via piper_phonemize) and pass them via --tts-token-ids:
# Generate IPA token IDs (example using Python)
python -c "
from piper_phonemize import phonemize_espeak
tokens_map = {}
with open('tokens.txt') as f:
for line in f:
tok, idx = line.strip().rsplit('\t', 1)
tokens_map[tok] = int(idx)
ipa = phonemize_espeak('Hello world', 'en-us')[0][0]
ids = [tokens_map.get(c, 0) for c in '^' + ipa + '\$']
print(' '.join(str(i) for i in ids))
" > token_ids.txt
# Generate audio using pre-computed tokens
cake run evilsocket/luxtts \
"Hello world" \
--tts-token-ids token_ids.txt \
--tts-speed 0.16 \
--tts-diffusion-steps 4 \
--audio-output output.wav \
--dtype f32cake run evilsocket/luxtts \
"Hello world, this is a test." \
--audio-output output.wav \
--tts-speed 0.16 \
--tts-diffusion-steps 4 \
--dtype f32Provide a 24kHz mono WAV reference audio for voice cloning:
cake run evilsocket/luxtts \
"Hello world" \
--tts-reference-audio voice_sample.wav \
--audio-output output.wav \
--tts-speed 0.16 --tts-diffusion-steps 4 --dtype f32The FM decoder layers can be split across workers:
# topology-luxtts.yml
worker1:
host: "192.168.1.100:10128"
layers:
- "fm_decoder.layers.0"
- "fm_decoder.layers.1"
- "fm_decoder.layers.2"
- "fm_decoder.layers.3"
- "fm_decoder.layers.4"
- "fm_decoder.layers.5"
- "fm_decoder.layers.6"
- "fm_decoder.layers.7"
# Master keeps fm_decoder.layers.8-15 + text encoder + vocoder# Worker
cake run evilsocket/luxtts --name worker1 \
--topology topology-luxtts.yml --address 0.0.0.0:10128
# Master
cake run evilsocket/luxtts \
"Hello world" \
--topology topology-luxtts.yml \
--audio-output output.wav \
--tts-speed 0.16 --tts-diffusion-steps 4 --dtype f32| Argument | Default | Description |
|---|---|---|
<prompt> |
(required) | Text to synthesize (positional argument to cake run) |
--audio-output |
output.wav |
Output WAV file path |
--tts-diffusion-steps |
10 | Euler solver steps (4 recommended for distilled LuxTTS) |
--tts-speed |
1.0 | Speed factor (lower = longer audio; 0.16 ≈ 6 frames/token) |
--tts-t-shift |
1.0 | Time shift for Euler solver schedule |
--tts-reference-audio |
- | Path to 24kHz mono WAV for voice cloning |
--tts-token-ids |
- | Pre-computed IPA token IDs file (space-separated) |
--dtype |
f16 | Use f32 for best quality on CPU |
LuxTTS consists of three components:
- Text encoder (4 Zipformer layers, dim=192) — converts IPA phoneme tokens to acoustic features. Always runs on master.
- FM decoder (16 Zipformer layers across 5 multi-resolution stacks, dim=512) — flow matching decoder that denoises random noise into mel features over 4 Euler steps. Shardable across workers.
- Vocos vocoder (ConvNeXt backbone + ISTFT head) — converts mel features to 24kHz waveform, then upsampled to 48kHz. Always runs on master.
The FM decoder uses a U-Net-style multi-resolution structure with downsampling factors [1, 2, 4, 2, 1] and layer counts [2, 2, 4, 4, 4] = 16 total layers.
The recommended model. Supports multi-speaker voice cloning from raw .wav files.
cake run evilsocket/VibeVoice-1.5B \
"Hello world, this is a test of the voice cloning system." \
--model-type audio-model \
--voice-prompt voice_reference.wav \
--audio-output output.wavThe model and tokenizer are downloaded automatically from HuggingFace on first run.
Provide a .wav file of the target speaker (any sample rate, mono or stereo — automatically converted to 24kHz mono). Longer references (10-30 seconds) produce better voice cloning. The audio is encoded through the acoustic tokenizer to create a voice embedding.
Pre-made voice presets are available in the VibeVoice community repo:
# Download a voice preset
wget https://raw.githubusercontent.com/vibevoice-community/VibeVoice/main/demo/voices/en-Alice_woman.wav
# Use it
cake run evilsocket/VibeVoice-1.5B \
"Your text here" \
--model-type audio-model \
--voice-prompt en-Alice_woman.wav \
--audio-output output.wav| Argument | Default | Description |
|---|---|---|
--voice-prompt |
(required) | Path to .wav voice reference file |
<prompt> |
(required) | Text to synthesize (positional argument to cake run) |
--audio-output |
output.wav |
Output WAV file path |
--tts-diffusion-steps |
10 | Diffusion steps per speech frame (higher = better quality, slower) |
--tts-cfg-scale |
1.5 | Classifier-free guidance scale (1.0-3.0) |
--max-audio-frames |
150 | Maximum speech frames (~133ms each at 7.5Hz) |
VibeVoice-1.5B uses autoregressive next-token prediction with three special tokens:
speech_start— marks the beginning of a speech segmentspeech_diffusion— triggers diffusion-based acoustic latent generationspeech_end— marks the end of a speech segment
Each speech_diffusion token generates one acoustic frame (64-dim latent) which is decoded to ~133ms of 24kHz audio. The model uses classifier-free guidance with a negative prompt for quality.
The feedback loop encodes each generated audio chunk back through both the acoustic and semantic encoders, combining the features as input for the next token prediction.
A lightweight streaming variant optimized for real-time TTS. Uses a split LM architecture (4-layer base + 20-layer TTS) and pre-computed voice prompt KV caches instead of raw audio encoding.
cake run evilsocket/VibeVoice-Realtime-0.5B \
"Hello world" \
--model-type audio-model \
--voice-prompt voice_preset.safetensors \
--audio-output output.wav \
--tts-cfg-scale 1.5Voice presets for the 0.5B model are .safetensors files containing pre-computed KV caches. These can be found in the VibeVoice community repo as .pt files (PyTorch format) and converted to safetensors.
When using cake serve, the audio endpoint is available at /v1/audio/speech:
# Start the API server
cake serve evilsocket/VibeVoice-1.5B --model-type audio-model \
--voice-prompt voice_reference.wav
# Generate speech
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello from the API"}' \
-o output.wavVoice cloning via API (base64-encoded WAV reference):
VOICE_B64=$(base64 -w0 voice_reference.wav)
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{\"input\": \"Hello with a cloned voice.\", \"voice_data\": \"$VOICE_B64\"}" \
-o output.wavSee the full REST API Reference for all parameters and response formats.
The model variant (1.5B vs 0.5B) is auto-detected from config.json:
model_type: "vibevoice"— VibeVoice-1.5Bmodel_type: "vibevoice_streaming"— VibeVoice-Realtime-0.5B