A C++17 end-to-end intelligent voice dialogue system integrating ASR (Speech Recognition), LLM (Large Language Model), TTS (Text-to-Speech), VAD (Voice Activity Detection), and AEC (Full-Duplex Echo Cancellation).
Important
Branch Notice
masterbranch: Retains the original monolithic codebase and is no longer updatedrefactorbranch: The current active development branch — all future development is based on this branch
Key Improvements (master → refactor)
Based on 288 file changes (net reduction of ~1.36 million lines of code):
- Modular Architecture — 6 core modules (audio / stt / tts / vad / llm / mcp) split into independent Git submodules, enabling standalone development, testing, and reuse
- Full-Duplex Voice Conversation — New WebRTC AEC echo cancellation pipeline (
voice_chat_aec) with Barge-in support, replacing the previous half-duplex approach - Removed Bloat — Eliminated embedded third_party sources (cppjieba, cpp-pinyin, cpp-mcp, etc.), replaced with CMake FetchContent for on-demand fetching
- Unified API Convention — Each module provides a standardized
*_api.hpppublic header andAPI.mddocumentation - Python Bindings — audio / stt / tts / vad provide pybind11 interfaces, installable via
pip install -e . - Legacy Cleanup — Removed old Python scripts, legacy C++ examples, build.sh, and other obsolete files
- Full-Duplex Conversation — WebRTC AEC echo cancellation + noise suppression with Barge-in support
- Offline Speech Recognition — SenseVoice ONNX, supporting Chinese / English / Japanese / Korean / Cantonese
- Multi-Backend TTS — Matcha-TTS (Chinese / English / Mixed) + Kokoro (multi-voice)
- LLM Integration — Ollama local inference / OpenAI-compatible cloud APIs, streaming output
- MCP Tool Calling — Extend LLM capabilities via Model Context Protocol (optional)
- Modular Architecture — Independent audio / stt / tts / vad / llm / mcp modules, usable standalone
- Python Bindings — pybind11 Python interfaces for audio / stt / tts / vad
| Module | Path | Description | Python Package |
|---|---|---|---|
| Audio | modules/audio/ |
Audio capture / playback / full-duplex / resampling | evo_audio |
| STT | modules/stt/ |
SenseVoice speech recognition | evo_asr |
| TTS | modules/tts/ |
Matcha / Kokoro speech synthesis | evo_tts |
| VAD | modules/vad/ |
Silero voice activity detection | evo_vad |
| LLM | modules/llm/ |
Ollama / OpenAI-compatible API client | — |
| MCP | modules/mcp/ |
MCP client SDK (stdio / socket / HTTP) | — |
| Ubuntu / Debian | macOS |
|---|---|
sudo apt install gcc g++ cmake pkg-config \
libportaudio-dev libsndfile1-dev \
libcurl4-openssl-dev libfftw3-dev \
libssl-dev espeak-ng libabsl-devONNX Runtime must be installed manually, see ONNX Runtime Releases: wget https://github.com/microsoft/onnxruntime/releases/download/v1.20.0/onnxruntime-linux-x64-1.20.0.tgz
tar -xzf onnxruntime-linux-x64-1.20.0.tgz
sudo cp -r onnxruntime-linux-x64-1.20.0/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-1.20.0/lib/* /usr/local/lib/
sudo ldconfig |
brew install gcc cmake pkg-config \
portaudio libsndfile curl fftw \
onnxruntime espeak openssl abseil |
Local LLM (Ollama):
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5:0.5bgit clone --recursive https://github.com/muggle-stack/e2e_Voice.git
cd e2e_Voice
mkdir -p build && cd build
cmake .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)CMake Options:
| Option | Default | Description |
|---|---|---|
USE_AEC |
ON |
WebRTC echo cancellation (full-duplex barge-in) |
USE_MCP |
OFF |
Model Context Protocol tool calling |
# Enable MCP tool calling
cmake -DUSE_MCP=ON .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)Note: AEC requires building the WebRTC module first:
cd modules/webrtc-audio-processing && meson build && ninja -C build
./build/bin/voice_chat_aec # Default configuration
./build/bin/voice_chat_aec --tts kokoro --model qwen2.5:7b # Kokoro TTS + larger model
./build/bin/voice_chat_aec -l # List audio devices
./build/bin/voice_chat_aec --mcp-config mcp.json # Enable tool calling
./build/bin/voice_chat_aec --llm-url https://api.example.com/v1/chat/completions # Cloud LLM| Argument | Description | Default |
|---|---|---|
-i, --input-device <id> |
Input device index | System default |
-o, --output-device <id> |
Output device index | System default |
-l, --list-devices |
List available audio devices | — |
--tts <engine> |
TTS backend (matcha:zh / matcha:en / matcha:zh-en / kokoro / kokoro:<voice>) |
matcha:zh |
--model <name> |
LLM model name | qwen2.5:0.5b |
--llm-url <url> |
LLM API URL (OpenAI-compatible) | Ollama local |
--no-aec |
Disable echo cancellation | — |
--no-ns |
Disable noise suppression | — |
--agc |
Enable automatic gain control | Disabled |
--aec-delay <ms> |
AEC delay compensation | 50 |
--buffer-frames <n> |
Audio buffer frame count | macOS 480 / Linux 960 |
--sample-rate <hz> |
Audio sample rate | 48000 |
--save-audio [file] |
Save AEC-processed audio | aec_debug.wav |
--mcp-config <path> |
MCP config file (enable tool calling) | — |
Extend LLM tool-calling capabilities via MCP (Model Context Protocol). Requires USE_MCP=ON at build time.
Quick Start:
# Install Python dependencies
pip install mcp starlette uvicorn psutil flask
# Start example MCP services (registry + Calculator + TimeService + SystemMonitor)
cd modules/mcp/examples && bash start_all_services.sh
# Start voice chat, auto-discover tools from the registry
./build/bin/voice_chat_aec --mcp-config modules/mcp/examples/config_registry.jsonEdit config_registry.json to switch the LLM backend used for MCP tool calling:
{
"backend": "ollama", // "ollama" (local) or "llama" (OpenAI-compatible API)
"url": "http://localhost:11434", // LLM API URL
"model": "qwen2.5:7b", // Model name
"registry_url": "http://127.0.0.1:9000/mcp/services"
}Custom Extensions:
You can write your own MCP services and register them with the registry server (default port 9000). voice_chat_aec will auto-discover new tools without restarting:
- Write a Python service using FastMCP
- Register on startup:
--registry http://127.0.0.1:9000 - The LLM can now call your tools during voice conversation
See MCP module documentation for config file format and transport options.
Each module provides standalone executable demos:
./build/bin/stt_test audio.wav # File-based speech recognition
./build/bin/simple_demo --text "你好" # TTS speech synthesis
./build/bin/vad_simple_demo # VAD voice activity detection
./build/bin/audio_demo -l # Audio device listingFour core modules provide pybind11 Python interfaces:
cd modules/audio/python && pip install -e . # evo_audio
cd modules/stt/python && pip install -e . # evo_asr
cd modules/tts/python && pip install -e . # evo_tts
cd modules/vad/python && pip install -e . # evo_vadSee each module's README for detailed usage.
Models are automatically downloaded to ~/.cache/ on first run:
~/.cache/
├── sensevoice/ # ASR (SenseVoice ONNX)
├── matcha-tts/
│ ├── matcha-icefall-zh-baker/ # Chinese TTS (22050Hz)
│ ├── matcha-icefall-en_US-ljspeech/ # English TTS (22050Hz)
│ ├── matcha-icefall-zh-en/ # Chinese-English mixed TTS (16000Hz)
│ ├── vocos-22khz-univ.onnx # Vocoder (Chinese/English)
│ └── vocos-16khz-univ.onnx # Vocoder (Chinese-English mixed)
└── silero_vad.onnx # VAD (Silero)
MIT License — See LICENSE
- ONNX Runtime — High-performance inference engine
- llama.cpp — High-performance LLM inference engine
- Ollama — Local LLM runtime and management
- SenseVoice — Speech recognition model
- Matcha-TTS — Text-to-speech model
- Kokoro — Multi-voice text-to-speech
- Vocos — Vocoder
- Silero VAD — Voice activity detection
- WebRTC Audio Processing — Echo cancellation / noise suppression
- PortAudio — Cross-platform audio I/O
- cppjieba — C++ Chinese word segmentation
- cpp-pinyin — Chinese to Pinyin conversion
- espeak-ng — English phonemization
- nlohmann/json — JSON library
- pybind11 — Python bindings
- dengcunqin — Chinese-English Matcha model
