Add llama-cpp-python backend for CosyVoice3 by Ferraronp · Pull Request #1872 · FunAudioLLM/CosyVoice

Ferraronp · 2026-04-15T20:41:22Z

Summary

Adds optional llama-cpp-python inference backend for CosyVoice3, allowing CPU and low-VRAM inference using GGUF quantized models.

Changes

cosyvoice/cli/cosyvoice.py: Added load_llama_cpp and gguf_model_path parameters to CosyVoice3.__init__. Overrides inference_zero_shot, inference_cross_lingual, inference_instruct2 with llama.cpp path. Both streaming and non-streaming modes supported.
cosyvoice/cli/model.py: Added tts_with_external_tokens and tts_stream_external_llm methods.
README.md: Added llama-cpp-python backend installation and usage instructions.

Usage

cosyvoice = AutoModel(
    model_dir='pretrained_models/Fun-CosyVoice3-0.5B',
    load_llama_cpp=True,
    gguf_model_path='/path/to/model.gguf'
)

All existing inference methods (inference_zero_shot, inference_cross_lingual, inference_instruct2) work unchanged.

Performance (NVIDIA T4, fp16)

Backend	Avg RTF
PyTorch fp16 (original)	~1.17
llama-cpp-python F16 GGUF	~0.45

~2.6x faster inference on T4 GPU.

Pre-converted GGUF models

Available on Hugging Face: Ferraronp/CosyVoice3-qwen2.5-0.5b-speech-gguf
Converter: Ferraronp/CosyVoice-gguf-converter

Notes

Only supported for CosyVoice3 / Fun-CosyVoice3-0.5B
Requires pip install llama-cpp-python
When load_llama_cpp=True, PyTorch LLM weights are not loaded to save VRAM

Ferraronp added 3 commits April 15, 2026 22:28

Add llama-cpp-python backend for CosyVoice3

de3341f

Remove verbose debug logging from llama.cpp backend

1b007c4

Remove unused speech_token_offset parameter

de8818d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama-cpp-python backend for CosyVoice3#1872

Add llama-cpp-python backend for CosyVoice3#1872
Ferraronp wants to merge 3 commits intoFunAudioLLM:mainfrom
Ferraronp:main

Ferraronp commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ferraronp commented Apr 15, 2026

Summary

Changes

Usage

Performance (NVIDIA T4, fp16)

Pre-converted GGUF models

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant