This guide explains how to use the Hugging Face Transformers-based Whisper service for text transcription in NeuralNote.
NeuralNote supports two backends for text transcription:
- ONNX Runtime (local, embedded models) - Default, runs entirely in C++
- HTTP Service (Hugging Face Transformers via Python) - Better accuracy, more features, easier model selection
The HTTP Service option provides access to the full ecosystem of Whisper models on Hugging Face, including:
- All official OpenAI Whisper variants (tiny, base, small, medium, large-v2, large-v3, large-v3-turbo)
- Distil-Whisper models (faster, smaller alternatives)
- Fine-tuned models for specific languages/domains
# Create and activate virtual environment
python3 -m venv venv-whisper
source venv-whisper/bin/activate # On Windows: venv-whisper\Scripts\activate
# Install requirements
pip install -r Scripts/requirements-whisper-service.txtOption A: Using the launcher script (recommended)
./Scripts/start_whisper_service.shOption B: Manual start
python3 Scripts/whisper_service.py --model openai/whisper-large-v3-turboThe plugin will automatically detect and use the service if it's running on http://127.0.0.1:8765.
python3 Scripts/whisper_service.py --list-modelsStart the service with a specific model:
# Latest turbo model (fastest large model)
python3 Scripts/whisper_service.py --model openai/whisper-large-v3-turbo
# Smaller, faster models
python3 Scripts/whisper_service.py --model openai/whisper-small
python3 Scripts/whisper_service.py --model openai/whisper-tiny
# Distil-Whisper (6x faster)
python3 Scripts/whisper_service.py --model distil-whisper/distil-large-v3
# English-only models (more accurate for English)
python3 Scripts/whisper_service.py --model openai/whisper-medium.enIf you already have Hugging Face models downloaded:
# Use models from custom cache directory
python3 Scripts/whisper_service.py \
--model openai/whisper-large-v3-turbo \
--model-dir ~/.cache/huggingface/hub
# Or use a local model directory
python3 Scripts/whisper_service.py \
--model /path/to/my/local/whisper/modelpython3 Scripts/whisper_service.py --port 9000If using a custom port, set the NEURALNOTE_WHISPER_SERVICE_URL environment variable:
export NEURALNOTE_WHISPER_SERVICE_URL=http://127.0.0.1:9000# Automatic (default: GPU if available, else CPU)
python3 Scripts/whisper_service.py --device auto
# Force CPU
python3 Scripts/whisper_service.py --device cpu
# Specific GPU
python3 Scripts/whisper_service.py --device cuda:0For faster inference, install Flash Attention 2 (if your GPU supports it):
pip install flash-attn --no-build-isolationThe service will automatically use Flash Attention if available.
The service provides the following HTTP endpoints:
Health check endpoint.
Response:
{
"status": "healthy",
"model": {
"model_id": "openai/whisper-large-v3-turbo",
"device": "cuda:0",
"dtype": "torch.float16",
"sample_rate": 16000
}
}Transcribe audio to text.
Request:
{
"audio": [/* float array of audio samples at 16kHz */],
"language": "en", // Optional, auto-detect if omitted
"task": "transcribe" // Or "translate" for translation to English
}Response:
{
"text": "full transcription",
"words": [
{
"text": "hello",
"start": 0.0,
"end": 0.5,
"confidence": 1.0
},
// ...
]
}Get model information.
Problem: ModuleNotFoundError: No module named 'transformers'
Solution: Install dependencies:
pip install -r Scripts/requirements-whisper-service.txtProblem: Model not found or download errors
Solution: Ensure you have internet connection for first-time model download, or specify a local model with --model-dir.
-
Verify service is running:
curl http://127.0.0.1:8765/health
-
Check the service logs for errors
-
Ensure no firewall is blocking port 8765
- Use smaller models (tiny, base, small) for faster inference
- Consider Distil-Whisper models (6x faster)
- Ensure GPU is being used: check service logs for
device: cuda:0 - Install Flash Attention 2 for additional speedup
NeuralNote uses automatic backend selection:
- If HTTP service is running → Use HTTP service (Hugging Face Transformers)
- If ONNX models are available → Use ONNX Runtime
- If neither available → Show placeholder message
You can check which backend is active in the NeuralNote debug output.
| Model | Size | Speed | Accuracy | Memory |
|---|---|---|---|---|
| whisper-tiny | 39M | 32x | Good | 1GB |
| whisper-base | 74M | 16x | Better | 1GB |
| whisper-small | 244M | 6x | Very Good | 2GB |
| whisper-medium | 769M | 2x | Excellent | 5GB |
| whisper-large-v3-turbo | 809M | 1.5x | Best | 6GB |
| distil-large-v3 | 756M | 6x | Excellent | 4GB |
Speed is relative to whisper-large-v3. Distil-Whisper provides near-large accuracy at small model speed.