Thank you for your interest in contributing to SenseVoice! This guide will help you get started.
- Python 3.8 or higher
- Git
- (Optional) CUDA-compatible GPU for faster inference
- Fork and clone the repository:
git clone https://github.com/<your-username>/SenseVoice.git
cd SenseVoice- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Verify the installation:
python -c "from funasr import AutoModel; print('Installation successful')"If you don't have a GPU, you can run SenseVoice on CPU by setting the device to "cpu":
model = AutoModel(
model="iic/SenseVoiceSmall",
trust_remote_code=True,
device="cpu",
)For the FastAPI server:
export SENSEVOICE_DEVICE=cpu
fastapi run --port 50000- Search existing issues to avoid duplicates.
- Use the Bug Report template.
- Include your environment details (OS, Python version, PyTorch version, GPU, CUDA version).
- Provide a minimal code sample to reproduce the issue.
- Fork the repository and create a new branch from
main:
git checkout -b your-branch-name-
Make your changes. Keep commits focused and atomic.
-
Test your changes to make sure nothing is broken.
-
Push to your fork and open a Pull Request against
FunAudioLLM/SenseVoice:main. -
Describe your changes clearly in the PR description. Explain what changed and why.
- Bug fixes — Check open issues labeled
bugfor known problems. - Documentation improvements — Typo fixes, clarifications, additional examples, translations.
- New examples — Demo scripts showing different use cases (emotion detection, event detection, multilingual transcription).
- Performance improvements — Optimizations for inference speed or memory usage.
- Test coverage — Unit tests and integration tests are very welcome.
- Follow existing code patterns and conventions in the repository.
- Use type hints where applicable.
- Add docstrings to new functions and classes.
- Keep lines under 120 characters.
SenseVoice/
├── model.py # Core SenseVoiceSmall model (encoder, CTC decoder, emotion/event embeddings)
├── api.py # FastAPI server for inference
├── webui.py # Gradio web interface
├── demo1.py # Inference example using FunASR AutoModel
├── demo2.py # Direct model inference with timestamp support
├── export.py # ONNX model export
├── export_meta.py # Export utilities for model rebuilding
├── finetune.sh # Fine-tuning script with DeepSpeed support
├── requirements.txt # Python dependencies
├── Dockerfile # Docker build configuration
├── data/ # Example training and validation data
├── utils/ # Utilities (frontend, ONNX inference, CTC alignment, export)
└── image/ # Documentation images
SenseVoice is a non-autoregressive encoder-only model that outputs:
- Speech transcription (ASR) across 50+ languages
- Emotion labels:
HAPPY,SAD,ANGRY,NEUTRAL,FEARFUL,DISGUSTED,SURPRISED - Audio event labels:
BGM,Speech,Applause,Laughter,Cry,Sneeze,Breath,Cough - Language identification for Mandarin, English, Cantonese, Japanese, and Korean
The model uses a SANM (Self-Attention with Normalized Memory) encoder architecture with CTC decoding. Emotion and event labels are predicted from the first 4 encoder output tokens, while the remaining tokens produce the transcription.
You can also run SenseVoice using Docker:
# Build
docker build -t sensevoice .
# Run with GPU
docker run --gpus all -p 50000:50000 sensevoice
# Run on CPU
docker run -e SENSEVOICE_DEVICE=cpu -p 50000:50000 sensevoice- Open an issue using the Questions template.
- Join the community via the DingTalk group (see README).
By contributing to SenseVoice, you agree that your contributions will be licensed under the same license as the project. See FunASR License for details.