Skip to content

Latest commit

 

History

History
252 lines (194 loc) · 10.3 KB

File metadata and controls

252 lines (194 loc) · 10.3 KB

E2E Voice - End-to-End Voice Dialogue System

A C++17 end-to-end intelligent voice dialogue system integrating ASR (Speech Recognition), LLM (Large Language Model), TTS (Text-to-Speech), VAD (Voice Activity Detection), and AEC (Full-Duplex Echo Cancellation).

e2e_voice

中文

Important

Branch Notice

  • master branch: Retains the original monolithic codebase and is no longer updated
  • refactor branch: The current active development branch — all future development is based on this branch

Key Improvements (master → refactor)

Based on 288 file changes (net reduction of ~1.36 million lines of code):

  1. Modular Architecture — 6 core modules (audio / stt / tts / vad / llm / mcp) split into independent Git submodules, enabling standalone development, testing, and reuse
  2. Full-Duplex Voice Conversation — New WebRTC AEC echo cancellation pipeline (voice_chat_aec) with Barge-in support, replacing the previous half-duplex approach
  3. Removed Bloat — Eliminated embedded third_party sources (cppjieba, cpp-pinyin, cpp-mcp, etc.), replaced with CMake FetchContent for on-demand fetching
  4. Unified API Convention — Each module provides a standardized *_api.hpp public header and API.md documentation
  5. Python Bindings — audio / stt / tts / vad provide pybind11 interfaces, installable via pip install -e .
  6. Legacy Cleanup — Removed old Python scripts, legacy C++ examples, build.sh, and other obsolete files

Features

  • Full-Duplex Conversation — WebRTC AEC echo cancellation + noise suppression with Barge-in support
  • Offline Speech Recognition — SenseVoice ONNX, supporting Chinese / English / Japanese / Korean / Cantonese
  • Multi-Backend TTS — Matcha-TTS (Chinese / English / Mixed) + Kokoro (multi-voice)
  • LLM Integration — Ollama local inference / OpenAI-compatible cloud APIs, streaming output
  • MCP Tool Calling — Extend LLM capabilities via Model Context Protocol (optional)
  • Modular Architecture — Independent audio / stt / tts / vad / llm / mcp modules, usable standalone
  • Python Bindings — pybind11 Python interfaces for audio / stt / tts / vad

Architecture

Full-Duplex AEC Pipeline

voice_chat_aec Pipeline

Modular Architecture

Module Path Description Python Package
Audio modules/audio/ Audio capture / playback / full-duplex / resampling evo_audio
STT modules/stt/ SenseVoice speech recognition evo_asr
TTS modules/tts/ Matcha / Kokoro speech synthesis evo_tts
VAD modules/vad/ Silero voice activity detection evo_vad
LLM modules/llm/ Ollama / OpenAI-compatible API client
MCP modules/mcp/ MCP client SDK (stdio / socket / HTTP)

Dependencies

Ubuntu / DebianmacOS
sudo apt install gcc g++ cmake pkg-config \
  libportaudio-dev libsndfile1-dev \
  libcurl4-openssl-dev libfftw3-dev \
  libssl-dev espeak-ng libabsl-dev

ONNX Runtime must be installed manually, see ONNX Runtime Releases:

wget https://github.com/microsoft/onnxruntime/releases/download/v1.20.0/onnxruntime-linux-x64-1.20.0.tgz
tar -xzf onnxruntime-linux-x64-1.20.0.tgz
sudo cp -r onnxruntime-linux-x64-1.20.0/include/* /usr/local/include/
sudo cp -r onnxruntime-linux-x64-1.20.0/lib/* /usr/local/lib/
sudo ldconfig
brew install gcc cmake pkg-config \
  portaudio libsndfile curl fftw \
  onnxruntime espeak openssl abseil

Local LLM (Ollama):

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5:0.5b

Build & Run

Build

git clone --recursive https://github.com/muggle-stack/e2e_Voice.git
cd e2e_Voice
mkdir -p build && cd build
cmake .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

CMake Options:

Option Default Description
USE_AEC ON WebRTC echo cancellation (full-duplex barge-in)
USE_MCP OFF Model Context Protocol tool calling
# Enable MCP tool calling
cmake -DUSE_MCP=ON .. && make -j$(nproc 2>/dev/null || sysctl -n hw.ncpu)

Note: AEC requires building the WebRTC module first: cd modules/webrtc-audio-processing && meson build && ninja -C build

Run

./build/bin/voice_chat_aec                                     # Default configuration
./build/bin/voice_chat_aec --tts kokoro --model qwen2.5:7b     # Kokoro TTS + larger model
./build/bin/voice_chat_aec -l                                  # List audio devices
./build/bin/voice_chat_aec --mcp-config mcp.json               # Enable tool calling
./build/bin/voice_chat_aec --llm-url https://api.example.com/v1/chat/completions  # Cloud LLM

Runtime Options

Argument Description Default
-i, --input-device <id> Input device index System default
-o, --output-device <id> Output device index System default
-l, --list-devices List available audio devices
--tts <engine> TTS backend (matcha:zh / matcha:en / matcha:zh-en / kokoro / kokoro:<voice>) matcha:zh
--model <name> LLM model name qwen2.5:0.5b
--llm-url <url> LLM API URL (OpenAI-compatible) Ollama local
--no-aec Disable echo cancellation
--no-ns Disable noise suppression
--agc Enable automatic gain control Disabled
--aec-delay <ms> AEC delay compensation 50
--buffer-frames <n> Audio buffer frame count macOS 480 / Linux 960
--sample-rate <hz> Audio sample rate 48000
--save-audio [file] Save AEC-processed audio aec_debug.wav
--mcp-config <path> MCP config file (enable tool calling)

MCP Tool Calling (Optional)

Extend LLM tool-calling capabilities via MCP (Model Context Protocol). Requires USE_MCP=ON at build time.

Quick Start:

# Install Python dependencies
pip install mcp starlette uvicorn psutil flask

# Start example MCP services (registry + Calculator + TimeService + SystemMonitor)
cd modules/mcp/examples && bash start_all_services.sh

# Start voice chat, auto-discover tools from the registry
./build/bin/voice_chat_aec --mcp-config modules/mcp/examples/config_registry.json

Edit config_registry.json to switch the LLM backend used for MCP tool calling:

{
  "backend": "ollama",                    // "ollama" (local) or "llama" (OpenAI-compatible API)
  "url": "http://localhost:11434",        // LLM API URL
  "model": "qwen2.5:7b",                 // Model name
  "registry_url": "http://127.0.0.1:9000/mcp/services"
}

Custom Extensions:

You can write your own MCP services and register them with the registry server (default port 9000). voice_chat_aec will auto-discover new tools without restarting:

  1. Write a Python service using FastMCP
  2. Register on startup: --registry http://127.0.0.1:9000
  3. The LLM can now call your tools during voice conversation

See MCP module documentation for config file format and transport options.

Component Demos

Each module provides standalone executable demos:

./build/bin/stt_test audio.wav       # File-based speech recognition
./build/bin/simple_demo --text "你好" # TTS speech synthesis
./build/bin/vad_simple_demo          # VAD voice activity detection
./build/bin/audio_demo -l            # Audio device listing

Python Bindings

Four core modules provide pybind11 Python interfaces:

cd modules/audio/python && pip install -e .   # evo_audio
cd modules/stt/python   && pip install -e .   # evo_asr
cd modules/tts/python   && pip install -e .   # evo_tts
cd modules/vad/python   && pip install -e .   # evo_vad

See each module's README for detailed usage.

Model Cache

Models are automatically downloaded to ~/.cache/ on first run:

~/.cache/
├── sensevoice/                          # ASR (SenseVoice ONNX)
├── matcha-tts/
│   ├── matcha-icefall-zh-baker/         # Chinese TTS (22050Hz)
│   ├── matcha-icefall-en_US-ljspeech/   # English TTS (22050Hz)
│   ├── matcha-icefall-zh-en/            # Chinese-English mixed TTS (16000Hz)
│   ├── vocos-22khz-univ.onnx           # Vocoder (Chinese/English)
│   └── vocos-16khz-univ.onnx           # Vocoder (Chinese-English mixed)
└── silero_vad.onnx                      # VAD (Silero)

License

MIT License — See LICENSE

Acknowledgements