Skip to content

Commit 2b7d6fe

Browse files
committed
Add speed priors, text-length embedding selection, and README improvements
Adopted from PR #9 by @BenjaminKobjolke: - Speed priors from model config.json for per-voice speed correction - Text-length-based embedding row selection for ONNX2 models (varied prosody) - Accept both named voices and raw expr-voice-* identifiers via API - Added .gitattributes for line ending normalization README updates: - Added emojis to all section headers matching Chatterbox README style - Removed Raspberry Pi 4 references (unsupported)
1 parent d9b2d34 commit 2b7d6fe

3 files changed

Lines changed: 88 additions & 60 deletions

File tree

.gitattributes

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Normalize all text files to LF in the repository
2+
* text=auto eol=lf
3+
4+
# Explicitly mark binary files
5+
*.png binary
6+
*.jpg binary
7+
*.jpeg binary
8+
*.gif binary
9+
*.ico binary
10+
*.pdf binary
11+
*.zip binary
12+
*.gz binary
13+
*.tar binary

README.md

Lines changed: 36 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Self-host the ultra-lightweight [KittenTTS model](https://github.com/KittenML/KittenTTS) with this enhanced API server. Now supports all 7 KittenTTS models — from the tiny 15M Nano to the 80M Mini — with hot-swappable model switching, an intuitive Web UI, a flexible API, large text processing for audiobooks, and high-performance GPU acceleration.**
44

5-
This server provides a robust, user-friendly, and powerful interface for the KittenTTS engine family, an open-source, realistic text-to-speech system. This project significantly enhances the original model by adding a full-featured server, an easy-to-use UI, and an optimized inference pipeline for hardware ranging from NVIDIA GPUs to CPUs and even the Raspberry Pi 5 (RP5) and Raspberry Pi 4 (RP4).
5+
This server provides a robust, user-friendly, and powerful interface for the KittenTTS engine family, an open-source, realistic text-to-speech system. This project significantly enhances the original model by adding a full-featured server, an easy-to-use UI, and an optimized inference pipeline for hardware ranging from NVIDIA GPUs to CPUs and even the Raspberry Pi 5.
66

77
[![Project Link](https://img.shields.io/badge/GitHub-devnen/Kitten--TTS--Server-blue?style=for-the-badge&logo=github)](https://github.com/devnen/Kitten-TTS-Server)
88
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](LICENSE)
@@ -21,30 +21,30 @@ This server provides a robust, user-friendly, and powerful interface for the Kit
2121

2222
---
2323

24-
## What's New
24+
## 🆕 What's New
2525

26-
### Complete KittenTTS model family support (new)
26+
### 🎯 Complete KittenTTS model family support (new)
2727

2828
- Added full support for **all 7 KittenTTS models** across three model sizes (Nano, Micro, Mini) and two generations (v0.1/v0.2 and v0.8).
2929
- Models range from the ultra-compact **15M-parameter Nano** to the high-quality **80M-parameter Mini**, all running on ONNX for maximum portability.
3030
- v0.8 models feature improved expressivity, quantized INT8 variants for minimal footprint, and named voices (Bella, Jasper, Luna, Bruno, Rosie, Hugo, Kiki, Leo).
3131

32-
### Hot-swappable model switching (new)
32+
### 🔁 Hot-swappable model switching (new)
3333

3434
- Added a new **model selector** dropdown at the top of the Web UI.
3535
- All 7 models are **hot-swappable** — select from the dropdown, click "Apply & Restart", and the backend automatically downloads (if needed), unloads the current model, and loads your choice. No server restart required.
3636
- A **progress modal** with real-time status updates shows download and loading progress, so you always know what's happening.
3737
- Models are **downloaded automatically** from Hugging Face on first use and cached locally in the project's `model_cache` directory for instant subsequent loads.
3838
- **Cancellation support** — if you change your mind during a download, select a different model and the current load is cancelled automatically.
3939

40-
### Named voices for all models (new)
40+
### 🎤 Named voices for all models (new)
4141

4242
- All models now use **human-friendly voice names** instead of technical identifiers.
4343
- v0.1/v0.2 models: **Amber**, **Felix**, **Clara**, **Marcus**, **Ivy**, **Oscar**, **Nora**, **Reed** (4 female, 4 male).
4444
- v0.8 models: **Bella**, **Jasper**, **Luna**, **Bruno**, **Rosie**, **Hugo**, **Kiki**, **Leo** (4 female, 4 male).
4545
- The voice dropdown automatically updates when you switch models.
4646

47-
### Complete model lineup
47+
### Complete model lineup
4848

4949
**You now have access to the entire KittenTTS family:**
5050

@@ -62,7 +62,7 @@ This server provides a robust, user-friendly, and powerful interface for the Kit
6262

6363
---
6464

65-
## Overview: Enhanced KittenTTS Generation
65+
## 🗣️ Overview: Enhanced KittenTTS Generation
6666

6767
The [KittenTTS model by KittenML](https://github.com/KittenML/KittenTTS) provides a foundation for generating high-quality speech from models smaller than 80MB. This project elevates that foundation into a production-ready service by providing a robust [FastAPI](https://fastapi.tiangolo.com/) server that makes KittenTTS significantly easier to use, more powerful, and drastically faster.
6868

@@ -77,17 +77,15 @@ We solve the complexity of setting up and running the model by offering:
7777
* **Cross-platform support** for Windows and Linux, with clear setup instructions.
7878
* **Docker support** for easy, reproducible containerized deployment.
7979

80-
## Raspberry Pi & Edge Device Support
80+
## 🍓 Raspberry Pi & Edge Device Support
8181

8282
The ultra-lightweight nature of the KittenTTS model and the efficiency of this server make it a perfect candidate for running on single-board computers (SBCs) and other edge devices.
8383

84-
* **Raspberry Pi 5 (RP5):** Confirmed to run with **excellent performance**. The server is fast and responsive, easily handling requests from other devices on the same local network (LAN). This makes it ideal for local network services, home automation, and other DIY projects.
85-
86-
* **Raspberry Pi 4 (RP4):** Testing is currently in progress. Not working on the 32-bit Raspberry Pi OS.
84+
***Raspberry Pi 5 (RP5):** Confirmed to run with **excellent performance**. The server is fast and responsive, easily handling requests from other devices on the same local network (LAN). This makes it ideal for local network services, home automation, and other DIY projects.
8785

8886
To install, simply follow the standard **Linux installation guide** provided in this README.
8987

90-
## GPU Acceleration included
88+
## 🔥 GPU Acceleration included
9189

9290
A standout feature of this server is the implementation of **high-performance GPU acceleration**, a capability not available in the original KittenTTS project. While the base model is CPU-only, this server unlocks the full potential of your hardware.
9391

@@ -97,7 +95,7 @@ A standout feature of this server is the implementation of **high-performance GP
9795

9896
This enhancement transforms KittenTTS from a lightweight-but-modest engine into a high-speed synthesis powerhouse.
9997

100-
## Alternative to Piper TTS
98+
## 🔄 Alternative to Piper TTS
10199

102100
The [KittenTTS model](https://github.com/KittenML/KittenTTS) serves as an excellent alternative to [Piper TTS](https://github.com/rhasspy/piper) for fast generation on limited compute and edge devices like Raspberry Pi 5.
103101

@@ -112,34 +110,34 @@ While KittenTTS provides the ultra-lightweight foundation, this server transform
112110

113111
Perfect for users seeking Piper's offline capabilities with better performance on limited hardware and modern server infrastructure.
114112

115-
## Key Features of This Server
113+
## Key Features of This Server
116114

117-
* **Multi-Model Support:** All 7 KittenTTS models (Nano, Micro, Mini across v0.1/v0.2/v0.8) with hot-swappable switching from the UI.
118-
* **Automatic Model Download:** Models are downloaded from Hugging Face on first use and cached locally.
119-
* **True GPU Acceleration:** Full support for **NVIDIA (CUDA)** via an optimized `onnxruntime-gpu` pipeline with I/O Binding for maximum performance.
120-
* **Large Text & Audiobook Generation:**
115+
* 🔁 **Multi-Model Support:** All 7 KittenTTS models (Nano, Micro, Mini across v0.1/v0.2/v0.8) with hot-swappable switching from the UI.
116+
* 📦 **Automatic Model Download:** Models are downloaded from Hugging Face on first use and cached locally.
117+
* **True GPU Acceleration:** Full support for **NVIDIA (CUDA)** via an optimized `onnxruntime-gpu` pipeline with I/O Binding for maximum performance.
118+
* 📚 **Large Text & Audiobook Generation:**
121119
* Automatically handles long texts by intelligently splitting them based on sentence boundaries.
122120
* Processes each chunk individually and seamlessly concatenates the resulting audio.
123121
* **Ideal for audiobooks** - paste entire books and get professional-quality audio.
124-
* **Modern Web Interface:**
122+
* 🖥️ **Modern Web Interface:**
125123
* Intuitive UI for text input, model selection, voice selection, and parameter adjustment.
126124
* Real-time waveform visualization of generated audio.
127125
* Progress modal for model downloads with real-time status updates.
128-
* **Named Voices:**
126+
* 🎤 **Named Voices:**
129127
* Up to 8 named voices per model (4 male, 4 female).
130128
* Voice list updates automatically when switching models.
131-
* **Dual API Endpoints:**
129+
* ⚙️ **Dual API Endpoints:**
132130
* A primary `/tts` endpoint offering full control over all generation parameters.
133131
* An OpenAI-compatible `/v1/audio/speech` endpoint for seamless integration into existing workflows.
134-
* **Easy Configuration:**
132+
* 🔧 **Easy Configuration:**
135133
* All settings are managed through a single `config.yaml` file.
136134
* The server automatically creates a default config on the first run.
137-
* **UI State Persistence:** The web interface remembers your last-used text, voice, and settings to streamline your workflow.
138-
* **Docker Support:** Easy, reproducible deployment for both CPU and GPU via Docker Compose.
135+
* 💾 **UI State Persistence:** The web interface remembers your last-used text, voice, and settings to streamline your workflow.
136+
* 🐳 **Docker Support:** Easy, reproducible deployment for both CPU and GPU via Docker Compose.
139137

140138
---
141139

142-
## System Prerequisites
140+
## 🔩 System Prerequisites
143141

144142
* **Operating System:** Windows 10/11 (64-bit) or Linux (Debian/Ubuntu recommended).
145143
* **Python:** Version 3.10 or later.
@@ -149,14 +147,13 @@ Perfect for users seeking Piper's offline capabilities with better performance o
149147
* **Linux:** `sudo apt install espeak-ng`
150148
* **Raspberry Pi:**
151149
* Raspberry Pi 5
152-
* Raspberry Pi 4
153150
* **(For GPU Acceleration):**
154151
* An **NVIDIA GPU** with CUDA support.
155152
* **(For Linux Only):**
156153
* `libsndfile1`: Audio library needed by `soundfile`. Install via `sudo apt install libsndfile1`.
157154
* `ffmpeg`: For robust audio operations. Install via `sudo apt install ffmpeg`.
158155

159-
## Installation and Setup
156+
## 💻 Installation and Setup
160157

161158
This project uses specific dependency files and a clear process to ensure a smooth, one-command installation for your hardware.
162159

@@ -281,7 +278,7 @@ pip install -r requirements-nvidia.txt
281278

282279
---
283280

284-
## Running the Server
281+
## ▶️ Running the Server
285282

286283
**Important: First-Run Model Download**
287284
The first time you start the server, it will automatically download the default KittenTTS Nano model (~25MB) from Hugging Face. This is a one-time process. Subsequent launches will be instant. Additional models are downloaded automatically when selected from the Web UI.
@@ -301,11 +298,11 @@ The first time you start the server, it will automatically download the default
301298

302299
4. **To stop the server:** Press `CTRL+C` in the terminal.
303300

304-
### **Raspberry Pi 4 & 5 Installation (CPU-Only)**
301+
### **Raspberry Pi 5 Installation (CPU-Only)**
305302

306-
KittenTTS runs excellently on Raspberry Pi devices, making it ideal for local network services and DIY projects. However, installation requirements vary significantly between Pi models due to CPU architecture differences.
303+
KittenTTS runs excellently on Raspberry Pi 5, making it ideal for local network services and DIY projects.
307304

308-
#### **Raspberry Pi 5 - Full Support**
305+
#### **Raspberry Pi 5 - Full Support**
309306

310307
**Raspberry Pi 5 works out-of-the-box** with the standard Linux installation guide above. No special steps required!
311308

@@ -336,17 +333,7 @@ python server.py
336333

337334
> **Important:** During the `pip install -r requirements.txt` step, some Python packages (especially audio processing libraries like `librosa`, `praat-parselmouth`, and others) may need to be compiled from source on ARM architecture. This process can take **15-30 minutes** depending on your SD card speed and system load. This is normal - let it complete without interruption.
338335

339-
#### **Raspberry Pi 4 - Limited Support**
340-
341-
**Raspberry Pi 4 support is currently in development** due to complex dependency compilation issues on 32-bit ARM architecture.
342-
343-
**For Raspberry Pi 4 Users:**
344-
We recommend upgrading to **64-bit Raspberry Pi OS** if possible, as this significantly improves compatibility with modern Python packages. For users requiring 32-bit support, please check our [GitHub Issues](https://github.com/devnen/Kitten-TTS-Server/issues) for the latest progress updates and community-contributed solutions.
345-
346-
**Alternative Recommendation:**
347-
For the best Raspberry Pi TTS experience, we strongly recommend using a **Raspberry Pi 5** with the standard 64-bit OS, which provides excellent performance and full compatibility.
348-
349-
## Docker Installation
336+
## 🐳 Docker Installation
350337

351338
Run Kitten-TTS-Server easily using Docker. The recommended method uses Docker Compose, which is pre-configured for both CPU and NVIDIA GPU deployment.
352339

@@ -428,7 +415,7 @@ docker compose exec kitten-tts-server python -c "import torch; print(f'CUDA avai
428415
```
429416
If `CUDA available:` prints `True`, your GPU setup is working correctly
430417

431-
## Usage Guide
418+
## 💡 Usage Guide
432419

433420
### Generate Your First Audio
434421

@@ -456,7 +443,7 @@ If `CUDA available:` prints `True`, your GPU setup is working correctly
456443
5. Click **"Generate Speech"**. The server will process the entire text and stitch the audio together seamlessly.
457444
6. Download your complete audiobook file.
458445
459-
## API Documentation
446+
## 📖 API Documentation
460447
461448
The server exposes two main endpoints for TTS. See `http://localhost:8005/docs` for an interactive playground.
462449
@@ -502,7 +489,7 @@ Use this for drop-in compatibility with scripts expecting OpenAI's TTS API struc
502489
* `POST /restart_server` — Triggers an async model hot-swap based on current config.
503490
* `POST /api/cancel-loading` — Cancels an in-progress model download/load.
504491

505-
## Configuration
492+
## ⚙️ Configuration
506493

507494
All server settings are managed in the `config.yaml` file. It's created automatically on first launch if it doesn't exist.
508495

@@ -513,7 +500,7 @@ All server settings are managed in the `config.yaml` file. It's created automati
513500
* `generation_defaults.speed`: Default speech speed (1.0 is normal).
514501
* `audio_output.format`: Default audio format (`wav`, `mp3`, `opus`).
515502

516-
## Troubleshooting
503+
## 🛠️ Troubleshooting
517504

518505
* **Phonemizer / eSpeak Errors:**
519506
* This is the most common issue. Ensure you have installed **eSpeak NG** correctly for your OS and **restarted your terminal** afterward. The server includes auto-detection logic for common install paths.
@@ -529,16 +516,16 @@ All server settings are managed in the `config.yaml` file. It's created automati
529516
* Try clearing the `model_cache` directory and restarting.
530517
* Large models (Mini 0.1 at ~170MB) may take several minutes on slower connections.
531518
532-
## Acknowledgements & Credits
519+
## 🙏 Acknowledgements & Credits
533520
534521
* **Core Model:** This project is powered by the **[KittenTTS model](https://github.com/KittenML/KittenTTS)** created by **[KittenML](https://github.com/KittenML)**. Our work adds a high-performance server and UI layer on top of their excellent lightweight model.
535522
* **Core Libraries:** FastAPI, Uvicorn, ONNX Runtime, PyTorch, Hugging Face Hub, Phonemizer.
536523
* **UI Inspiration:** The UI/server architecture is inspired by our previous work on the [Chatterbox-TTS-Server](https://github.com/devnen/Chatterbox-TTS-Server).
537524
538-
## License
525+
## 📄 License
539526
540527
This project is licensed under the **MIT License**. See the [LICENSE](LICENSE) file for details.
541528
542-
## Contributing
529+
## 🤝 Contributing
543530
544531
Contributions, issues, and feature requests are welcome! Please feel free to open an issue or submit a pull request.

0 commit comments

Comments
 (0)