Skip to content

dingausmwald/Qwen3-TTS-Openai-Fastapi

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Qwen3-TTS OpenAI-compatible Server

An OpenAI-compatible TTS server for Qwen3-TTS with real-time audio streaming, voice cloning, and torch.compile optimizations.

Based on groxaxo/Qwen3-TTS-Openai-Fastapi and dffdeeq/Qwen3-TTS-streaming.

What this fork adds

  • Real-time audio streamingstream: true in the request body yields PCM chunks as the model generates them, instead of waiting for the full audio. Works for both built-in voices and voice cloning.
  • Voice library — Save voice profiles and use them with voice: "clone:MyVoice". The server auto-switches between CustomVoice and Base models as needed.
  • Voice prompt caching — Speaker embeddings are computed once per profile and reused across requests. Saves ~0.7s per voice clone request.
  • torch.compile + CUDA graphs — Configurable via config.yaml. The optimized backend compiles both streaming and non-streaming code paths on startup.
  • GPU keepalive — Periodic matmul to prevent AMD DPM from downclocking the GPU after idle. Keeps TTFB consistent at ~0.3s instead of 0.85s after idle.
  • Pipecat voice agent — Full voice assistant pipeline with WebRTC, Parakeet STT, LLM, and streaming TTS. End-to-end latency ~2s (down from ~7s). See install/pipecat/.

Compatibility

Platform Status
AMD ROCm (tested: Strix Halo gfx1151) Tested, optimized
NVIDIA CUDA (Ampere, Ada, Blackwell) Should work, not yet verified
CPU Use groxaxo's pytorch_cpu backend instead

The core features (streaming API, voice library, caching) are platform-agnostic. torch.compile and CUDA graphs work on both CUDA and ROCm.

NVIDIA users: if you encounter issues, please open an issue. The streaming library (dffdeeq) was originally developed and benchmarked on RTX 5090.

Quick Start

# NVIDIA
docker compose up -d

# AMD ROCm
docker compose -f docker-compose.rocm.yml up -d

Models are downloaded from HuggingFace automatically on first start (~2.4 GB each). First startup takes ~75 seconds for torch.compile warmup.

The server listens on port 8880 by default.

Usage

OpenAI-compatible API

# Non-streaming (compatible with OpenWebUI, SillyTavern, etc.)
curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world!", "voice": "alloy", "model": "tts-1"}' \
  --output speech.mp3

# Real-time streaming (for low-latency applications)
curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world!", "voice": "alloy", "model": "tts-1", "stream": true, "response_format": "pcm"}' \
  --output speech.pcm

# Voice cloning (auto-switches to Base model)
curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello world!", "voice": "clone:MyVoice", "model": "tts-1-en"}' \
  --output speech.mp3

Language hints

Append a language code to the model name for language-specific synthesis:

Model Language
tts-1-en English
tts-1-de German
tts-1-zh Chinese
tts-1-ja Japanese
tts-1-ko Korean
tts-1-fr French
tts-1-es Spanish
tts-1-ru Russian
tts-1-pt Portuguese
tts-1-it Italian

Voice mapping

OpenAI voice Qwen3-TTS speaker
alloy Vivian
echo Ryan
fable Serena
nova Aiden
onyx Eric
shimmer Dylan

Configuration

The server reads ~/qwen3-tts/config.yaml (or $TTS_CONFIG).

Key settings:

default_model: 0.6B-CustomVoice

optimization:
  compile_mode: max-autotune     # torch.compile mode
  use_cuda_graphs: true          # CUDA graph capture
  streaming:
    decode_window_frames: 72     # AMD: use 72 (not 64 or 80). NVIDIA: 64 or 80 also work.
    emit_every_frames: 6         # Lower = lower TTFB, higher RTF cost. See tuning table below.

See config.yaml for all options with comments.

Performance

Benchmarked on AMD Radeon 8060S (Strix Halo), 0.6B model, after warmup:

Mode RTF TTFB Notes
Non-streaming, CustomVoice 0.53-0.54x full generation time Best for OpenWebUI
Non-streaming, Voice Clone 0.55-0.56x full generation time With cached prompt
Streaming, CustomVoice 0.57x ~0.3s 6-8 chunks per request
Streaming, Voice Clone 0.57x ~0.3s With cached prompt

RTF = Real-Time Factor. Lower is faster. 0.54x means audio is generated ~1.85x faster than real-time.

Streaming tuning

emit_every_frames controls the trade-off between latency (TTFB) and throughput (RTF), with decode_window_frames fixed at 72:

Config (window/emit) TTFB RTF RTF overhead vs best
72/4 0.225s 0.655 +16%
72/6 ~0.3s ~0.62 ~10%
72/8 0.385s 0.606 +7%
72/12 0.546s 0.576 +2%
72/24 1.028s 0.564 baseline

Lower emit_every = lower TTFB but higher RTF cost due to more redundant decoding. 72/6 is a good balance for real-time voice assistants (Pipecat).

AMD ROCm note: decode_window_frames values 66, 67, and 71 trigger a CUDA graph capture bug (5-10x slowdown). Use 72. Values 64, 80 also problematic on some setups.

Voice Assistant Pipeline

The install/pipecat/ directory contains a complete voice assistant that combines STT, LLM, and TTS into a real-time WebRTC pipeline:

Browser (WebRTC)
    |
    v
Pipecat Pipeline
    |
    ├── STT: Parakeet TDT 0.6B v3 (ONNX INT8, CPU, ~0.2s)
    ├── LLM: any OpenAI-compatible API (llamaswap, vLLM, etc.)
    └── TTS: Qwen3-TTS streaming (GPU, TTFB ~0.3s)

End-to-end latency: ~2 seconds (user stops speaking → bot starts speaking), down from ~7 seconds through:

  • Parakeet STT on CPU via groxaxo's FastAPI wrapper (replaces local Whisper, 20x+ realtime)
  • Small LLM for fast TTFB (Ministral 3B Q8_0: ~0.5s TTFB)
  • Streaming TTS with GPU keepalive (consistent 0.3s TTFB)
  • SanitizedLLMService: merges consecutive user messages after voice interruptions (prevents template errors with strict chat templates like Ministral)

See install/pipecat/INSTALL.md for setup instructions.

Architecture

Client (OpenWebUI, Pipecat, curl, ...)
    |
    v
FastAPI Router (openai_compatible.py)
    |  - OpenAI API compatibility
    |  - Voice library (clone: prefix)
    |  - Ref audio caching
    |  - Stream/non-stream routing
    v
Optimized Backend (optimized_backend.py)
    |  - Model switching (CustomVoice <-> Base)
    |  - torch.compile + CUDA graphs
    |  - Voice prompt caching
    |  - GPU keepalive (prevents idle downclocking)
    v
dffdeeq/Qwen3-TTS-streaming
    |  - stream_generate_pcm()
    |  - stream_generate_custom_voice() (our patch)
    |  - create_voice_clone_prompt()
    v
Qwen3-TTS Model (HuggingFace)

Project structure

api/
  backends/
    optimized_backend.py    # Optimized backend: compile, cache, stream, model switch
    factory.py              # Backend selection (optimized / official / vllm / pytorch / openvino)
  routers/
    openai_compatible.py    # OpenAI API endpoints with streaming + voice library
  structures/
    schemas.py              # Request/response schemas
  services/                 # From groxaxo: text processing, audio encoding
install/                    # Canonical deployment files
  INSTALL.md                # Detailed installation guide
  config.yaml               # Server configuration reference
  groxaxo/                  # Modified API files (optimized backend, router, etc.)
  dffdeeq/                  # Patched fork files (CUDA graph fixes, streaming)
  pipecat/                  # Voice agent (app, config, requirements, start script)
config.yaml                 # Server configuration
start_server.sh             # Auto-detects AMD/NVIDIA, sets env vars
Dockerfile                  # NVIDIA CUDA
Dockerfile.rocm             # AMD ROCm
docker-compose.yml          # NVIDIA
docker-compose.rocm.yml     # AMD

Files in api/services/, api/main.py, and groxaxo's other backends come from groxaxo upstream. The qwen_tts/ library is from dffdeeq with patches for CustomVoice streaming and CUDA graph fixes applied directly.

Manual installation (without Docker)

# 1. Clone this repo
git clone https://github.com/dingausmwald/Qwen3-TTS-Openai-Fastapi.git
cd Qwen3-TTS-Openai-Fastapi

# 2. Install dependencies
pip install -r requirements.txt
pip install -e .

# 3. Install flash attention
# NVIDIA:
pip install flash-attn --no-build-isolation
# AMD:
FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE pip install flash-attn --no-build-isolation

# 4. Copy config
mkdir -p ~/qwen3-tts/voice_library
cp config.yaml ~/qwen3-tts/config.yaml

# 5. Start
./start_server.sh

Voice Library

To use voice cloning with saved profiles, create a profile directory:

~/qwen3-tts/voice_library/profiles/
  my_voice/
    meta.json
    reference.wav

meta.json format:

{
  "name": "MyVoice",
  "profile_id": "my_voice",
  "ref_audio_filename": "reference.wav",
  "ref_text": "Optional transcript of the reference audio.",
  "x_vector_only_mode": false,
  "language": "English",
  "task_type": "Base"
}

Then use voice: "clone:MyVoice" in API requests. The server auto-switches to the Base model for voice cloning.

Acknowledgments

License

Apache-2.0, same as the upstream projects.

About

Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice cloning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.1%
  • HTML 3.5%
  • Other 0.4%