E2E Voice - End-to-End Voice Dialogue System

A C++17 end-to-end intelligent voice dialogue system integrating ASR (Speech Recognition), LLM (Large Language Model), TTS (Text-to-Speech), VAD (Voice Activity Detection), and AEC (Full-Duplex Echo Cancellation).

中文

Important

Branch Notice

master branch: Retains the original monolithic codebase and is no longer updated
refactor branch: The current active development branch — all future development is based on this branch

Key Improvements (master → refactor)

Based on 288 file changes (net reduction of ~1.36 million lines of code):

Modular Architecture — 6 core modules (audio / stt / tts / vad / llm / mcp) split into independent Git submodules, enabling standalone development, testing, and reuse
Full-Duplex Voice Conversation — New WebRTC AEC echo cancellation pipeline (voice_chat_aec) with Barge-in support, replacing the previous half-duplex approach
Removed Bloat — Eliminated embedded third_party sources (cppjieba, cpp-pinyin, cpp-mcp, etc.), replaced with CMake FetchContent for on-demand fetching
Unified API Convention — Each module provides a standardized *_api.hpp public header and API.md documentation
Python Bindings — audio / stt / tts / vad provide pybind11 interfaces, installable via pip install -e .
Legacy Cleanup — Removed old Python scripts, legacy C++ examples, build.sh, and other obsolete files

Features

Full-Duplex Conversation — WebRTC AEC echo cancellation + noise suppression with Barge-in support
Offline Speech Recognition — SenseVoice ONNX, supporting Chinese / English / Japanese / Korean / Cantonese
Multi-Backend TTS — Matcha-TTS (Chinese / English / Mixed) + Kokoro (multi-voice)
LLM Integration — Ollama local inference / OpenAI-compatible cloud APIs, streaming output
MCP Tool Calling — Extend LLM capabilities via Model Context Protocol (optional)
Modular Architecture — Independent audio / stt / tts / vad / llm / mcp modules, usable standalone
Python Bindings — pybind11 Python interfaces for audio / stt / tts / vad

Architecture

Full-Duplex AEC Pipeline

Modular Architecture

Module	Path	Description	Python Package
Audio	`modules/audio/`	Audio capture / playback / full-duplex / resampling	`evo_audio`
STT	`modules/stt/`	SenseVoice speech recognition	`evo_asr`
TTS	`modules/tts/`	Matcha / Kokoro speech synthesis	`evo_tts`
VAD	`modules/vad/`	Silero voice activity detection	`evo_vad`
LLM	`modules/llm/`	Ollama / OpenAI-compatible API client	—
MCP	`modules/mcp/`	MCP client SDK (stdio / socket / HTTP)	—

Dependencies

Ubuntu / Debian	macOS
sudo apt install gcc g++ cmake pkg-config \ libportaudio-dev libsndfile1-dev \ libcurl4-openssl-dev libfftw3-dev \ libssl-dev espeak-ng libabsl-dev ONNX Runtime must be installed manually, see ONNX Runtime Releases: wget https://github.com/microsoft/onnxruntime/releases/download/v1.20.0/onnxruntime-linux-x64-1.20.0.tgz tar -xzf onnxruntime-linux-x64-1.20.0.tgz sudo cp -r onnxruntime-linux-x64-1.20.0/include/* /usr/local/include/ sudo cp -r onnxruntime-linux-x64-1.20.0/lib/* /usr/local/lib/ sudo ldconfig	brew install gcc cmake pkg-config \ portaudio libsndfile curl fftw \ onnxruntime espeak openssl abseil

Ubuntu / Debian

macOS

sudo apt install gcc g++ cmake pkg-config \
  libportaudio-dev libsndfile1-dev \
  libcurl4-openssl-dev libfftw3-dev \
  libssl-dev espeak-ng libabsl-dev

ONNX Runtime must be installed manually, see ONNX Runtime Releases:

wget https://github.com/microsoft/onnxruntime/releases/download/v1.20.0/onnxruntime-linux-x64-1.20.0.tgz
tar -xzf onnxruntime-linux-x64-1.20.0.tgz
sudo cp -r onnxruntime-linux-x64-1.20.0/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-1.20.0/lib/* /usr/local/lib/
sudo ldconfig

brew install gcc cmake pkg-config \
  portaudio libsndfile curl fftw \
  onnxruntime espeak openssl abseil

Local LLM (Ollama):

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5:0.5b

Build & Run

Build

git clone --recursive https://github.com/muggle-stack/e2e_Voice.git
cd e2e_Voice
mkdir -p build && cd build
cmake .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

CMake Options:

Option	Default	Description
`USE_AEC`	`ON`	WebRTC echo cancellation (full-duplex barge-in)
`USE_MCP`	`OFF`	Model Context Protocol tool calling

# Enable MCP tool calling
cmake -DUSE_MCP=ON .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

Note: AEC requires building the WebRTC module first: cd modules/webrtc-audio-processing && meson build && ninja -C build

Run

./build/bin/voice_chat_aec                                     # Default configuration
./build/bin/voice_chat_aec --tts kokoro --model qwen2.5:7b     # Kokoro TTS + larger model
./build/bin/voice_chat_aec -l                                  # List audio devices
./build/bin/voice_chat_aec --mcp-config mcp.json               # Enable tool calling
./build/bin/voice_chat_aec --llm-url https://api.example.com/v1/chat/completions  # Cloud LLM

Runtime Options

Argument	Description	Default
`-i`, `--input-device <id>`	Input device index	System default
`-o`, `--output-device <id>`	Output device index	System default
`-l`, `--list-devices`	List available audio devices	—
`--tts <engine>`	TTS backend (`matcha:zh` / `matcha:en` / `matcha:zh-en` / `kokoro` / `kokoro:<voice>`)	`matcha:zh`
`--model <name>`	LLM model name	`qwen2.5:0.5b`
`--llm-url <url>`	LLM API URL (OpenAI-compatible)	Ollama local
`--no-aec`	Disable echo cancellation	—
`--no-ns`	Disable noise suppression	—
`--agc`	Enable automatic gain control	Disabled
`--aec-delay <ms>`	AEC delay compensation	`50`
`--buffer-frames <n>`	Audio buffer frame count	macOS `480` / Linux `960`
`--sample-rate <hz>`	Audio sample rate	`48000`
`--save-audio [file]`	Save AEC-processed audio	`aec_debug.wav`
`--mcp-config <path>`	MCP config file (enable tool calling)	—

MCP Tool Calling (Optional)

Extend LLM tool-calling capabilities via MCP (Model Context Protocol). Requires USE_MCP=ON at build time.

Quick Start:

# Install Python dependencies
pip install mcp starlette uvicorn psutil flask

# Start example MCP services (registry + Calculator + TimeService + SystemMonitor)
cd modules/mcp/examples && bash start_all_services.sh

# Start voice chat, auto-discover tools from the registry
./build/bin/voice_chat_aec --mcp-config modules/mcp/examples/config_registry.json

Edit config_registry.json to switch the LLM backend used for MCP tool calling:

{
  "backend": "ollama",                    // "ollama" (local) or "llama" (OpenAI-compatible API)
  "url": "http://localhost:11434",        // LLM API URL
  "model": "qwen2.5:7b",                 // Model name
  "registry_url": "http://127.0.0.1:9000/mcp/services"
}

Custom Extensions:

You can write your own MCP services and register them with the registry server (default port 9000). voice_chat_aec will auto-discover new tools without restarting:

Write a Python service using FastMCP
Register on startup: --registry http://127.0.0.1:9000
The LLM can now call your tools during voice conversation

See MCP module documentation for config file format and transport options.

Component Demos

Each module provides standalone executable demos:

./build/bin/stt_test audio.wav       # File-based speech recognition
./build/bin/simple_demo --text "你好" # TTS speech synthesis
./build/bin/vad_simple_demo          # VAD voice activity detection
./build/bin/audio_demo -l            # Audio device listing

Python Bindings

Four core modules provide pybind11 Python interfaces:

cd modules/audio/python && pip install -e .   # evo_audio
cd modules/stt/python   && pip install -e .   # evo_asr
cd modules/tts/python   && pip install -e .   # evo_tts
cd modules/vad/python   && pip install -e .   # evo_vad

See each module's README for detailed usage.

Model Cache

Models are automatically downloaded to ~/.cache/ on first run:

~/.cache/
├── sensevoice/                          # ASR (SenseVoice ONNX)
├── matcha-tts/
│   ├── matcha-icefall-zh-baker/         # Chinese TTS (22050Hz)
│   ├── matcha-icefall-en_US-ljspeech/   # English TTS (22050Hz)
│   ├── matcha-icefall-zh-en/            # Chinese-English mixed TTS (16000Hz)
│   ├── vocos-22khz-univ.onnx           # Vocoder (Chinese/English)
│   └── vocos-16khz-univ.onnx           # Vocoder (Chinese-English mixed)
└── silero_vad.onnx                      # VAD (Silero)

License

MIT License — See LICENSE

Acknowledgements

ONNX Runtime — High-performance inference engine
llama.cpp — High-performance LLM inference engine
Ollama — Local LLM runtime and management
SenseVoice — Speech recognition model
Matcha-TTS — Text-to-speech model
Kokoro — Multi-voice text-to-speech
Vocos — Vocoder
Silero VAD — Voice activity detection
WebRTC Audio Processing — Echo cancellation / noise suppression
PortAudio — Cross-platform audio I/O
cppjieba — C++ Chinese word segmentation
cpp-pinyin — Chinese to Pinyin conversion
espeak-ng — English phonemization
nlohmann/json — JSON library
pybind11 — Python bindings
dengcunqin — Chinese-English Matcha model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E Voice - End-to-End Voice Dialogue System

Features

Architecture

Full-Duplex AEC Pipeline

Modular Architecture

Dependencies

Build & Run

Build

Run

Runtime Options

MCP Tool Calling (Optional)

Component Demos

Python Bindings

Model Cache

License

Acknowledgements

FilesExpand file tree

README_EN.md

Latest commit

History

README_EN.md

File metadata and controls

E2E Voice - End-to-End Voice Dialogue System

Features

Architecture

Full-Duplex AEC Pipeline

Modular Architecture

Dependencies

Build & Run

Build

Run

Runtime Options

MCP Tool Calling (Optional)

Component Demos

Python Bindings

Model Cache

License

Acknowledgements