Whisper Speech-to-Text Setup

Whisper is a local speech recognition service that converts audio to text for VoiceMode using OpenAI's Whisper model. It provides offline STT capabilities with various model sizes to balance speed and accuracy.

Quick Start

# Install whisper service with default base model (includes Core ML on Apple Silicon!)
voicemode whisper service install

# Install with a different model
voicemode whisper service install --model large-v3

# List available models and their status
voicemode whisper model --all

# Switch to a different model (auto-installs if needed)
voicemode whisper model large-v2

# Start the service
voicemode whisper service start

Apple Silicon Bonus: On M1/M2/M3/M4 Macs, VoiceMode automatically downloads pre-built Core ML models for 2-3x faster performance. No Xcode or Python dependencies required!

Default endpoint: http://127.0.0.1:2022/v1

Installation Methods

Automatic Installation (Recommended)

VoiceMode includes an installation tool that sets up Whisper.cpp automatically:

# Install with default base model (142MB) - good balance of speed and accuracy
voicemode whisper service install

# Install with a specific model
voicemode whisper service install --model small

This will:

Clone and build Whisper.cpp with GPU support (if available)
Download the specified model (default: base)
On Apple Silicon: Automatically download pre-built Core ML models for 2-3x faster performance
Create a start script with environment variable support
Set up automatic startup (launchd on macOS, systemd on Linux)

Manual Installation

macOS

# Install via Homebrew
brew install whisper.cpp

# Download model
mkdir -p ~/.voicemode/models/whisper
cd ~/.voicemode/models/whisper
curl -LO https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin

Linux

# Clone and build whisper.cpp
git clone https://github.com/ggerganov/whisper.cpp
cd whisper.cpp
make

# Download model
mkdir -p ~/.voicemode/models/whisper
cd ~/.voicemode/models/whisper
wget https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v2.bin

Prerequisites

macOS:

Xcode Command Line Tools (xcode-select --install) - Only for building whisper.cpp
Homebrew (https://brew.sh)
cmake (brew install cmake)

Note for Apple Silicon users: Core ML models are pre-built and downloaded automatically. No Xcode, PyTorch, or coremltools required!

Linux:

Build essentials (sudo apt install build-essential on Ubuntu/Debian)

Core ML Acceleration (Apple Silicon)

On Apple Silicon Macs (M1/M2/M3/M4), VoiceMode automatically downloads pre-built Core ML models from Hugging Face for 2-3x faster transcription:

Automatic: Core ML models download alongside regular models
No Dependencies: No PyTorch, Xcode, or coremltools needed
Pre-built: Models are pre-compiled and ready to use
Performance: 2-3x faster than Metal acceleration alone

Core ML models are included automatically when available. The installation process handles this transparently.

Model Management

Available Models

Model	Size	RAM Usage	Accuracy	Speed	Language Support
tiny	39 MB	~390 MB	Low	Fastest	Multilingual
tiny.en	39 MB	~390 MB	Low	Fastest	English only
base	142 MB	~500 MB	Fair	Fast	Multilingual
base.en	142 MB	~500 MB	Fair	Fast	English only
small	466 MB	~1 GB	Good	Moderate	Multilingual
small.en	466 MB	~1 GB	Good	Moderate	English only
medium	1.5 GB	~2.6 GB	Very Good	Slow	Multilingual
medium.en	1.5 GB	~2.6 GB	Very Good	Slow	English only
large-v1	2.9 GB	~3.9 GB	Excellent	Slower	Multilingual
large-v2	2.9 GB	~3.9 GB	Excellent	Slower	Multilingual (recommended)
large-v3	3.1 GB	~3.9 GB	Best	Slowest	Multilingual
large-v3-turbo	1.6 GB	~2.5 GB	Very Good	Moderate	Multilingual

Model Commands

# List all models with installation status
voicemode whisper model --all

# Show current active model
voicemode whisper model

# Switch to a model (auto-installs if not present)
voicemode whisper model small.en

# Switch model without auto-installing (fails if model not installed)
voicemode whisper model medium --no-install

# Switch model without restarting service
voicemode whisper model large-v2 --no-restart

Note: After changing the active model with --no-restart, restart the whisper service manually for changes to take effect.

Service Configuration

Environment Variables

Configure in ~/.voicemode/voicemode.env:

VOICEMODE_WHISPER_MODEL=large-v2
VOICEMODE_WHISPER_PORT=2022
VOICEMODE_WHISPER_THREADS=          # Auto-detected if not set
VOICEMODE_WHISPER_LANGUAGE=auto
VOICEMODE_WHISPER_MODEL_PATH=~/.voicemode/models/whisper

Thread Configuration: By default, VoiceMode auto-detects the number of CPU cores and configures threads accordingly. You can override this by setting VOICEMODE_WHISPER_THREADS to a specific number.

Running the Server

OpenAI-Compatible Server Mode

whisper-server \
  --model models/ggml-large-v2.bin \
  --host 127.0.0.1 \
  --port 2022 \
  --inference-path "/v1/audio/transcriptions" \
  --threads 4 \
  --processors 1 \
  --convert \
  --print-progress

Key options:

--model: Path to model file
--host: Server host (default: 127.0.0.1)
--port: Server port (VoiceMode expects 2022)
--inference-path: OpenAI-compatible endpoint path
--threads: Number of threads for processing (auto-detected by VoiceMode)
--processors: Number of parallel processors
--convert: Convert audio to required format automatically (required for VoiceMode)
--print-progress: Show transcription progress

Note: When using VoiceMode's managed service, threads are auto-detected based on your CPU cores. The --convert flag is required for VoiceMode to work correctly with various audio formats.

Service Management

macOS (LaunchAgent)

# Start/stop service
launchctl load ~/Library/LaunchAgents/com.voicemode.whisper.plist
launchctl unload ~/Library/LaunchAgents/com.voicemode.whisper.plist

# Enable/disable at startup
launchctl load -w ~/Library/LaunchAgents/com.voicemode.whisper.plist
launchctl unload -w ~/Library/LaunchAgents/com.voicemode.whisper.plist

# Check status
launchctl list | grep whisper

Linux (Systemd)

# Start/stop service
systemctl --user start whisper
systemctl --user stop whisper

# Enable/disable at startup
systemctl --user enable whisper
systemctl --user disable whisper

# Check status and logs
systemctl --user status whisper
journalctl --user -u whisper -f

Hardware Acceleration

Apple Silicon (CoreML)

CoreML provides 2-3x faster transcription on Apple Silicon Macs:

# Performance comparison
# CPU Only: ~1x baseline
# Metal: ~3-4x faster
# CoreML + Metal: ~8-12x faster

Core ML models are downloaded automatically when installing Whisper on Apple Silicon. No additional configuration needed.

GPU Acceleration

The installation tool automatically detects and enables:

Mac (Apple Silicon): Metal acceleration
NVIDIA GPU: CUDA acceleration
CPU: Optimized CPU builds

Integration with VoiceMode

VoiceMode automatically detects Whisper when available:

First: Checks for Whisper.cpp on http://127.0.0.1:2022/v1
Fallback: Uses OpenAI API (requires OPENAI_API_KEY)

Custom Configuration

To use a different endpoint or force Whisper use:

export STT_BASE_URL=http://127.0.0.1:2022/v1

Or in MCP configuration:

"voicemode": {
  ...
  "env": {
    "STT_BASE_URL": "http://127.0.0.1:2022/v1"
  }
}

Fully Local Setup

For completely offline voice processing, combine Whisper with Kokoro:

export STT_BASE_URL=http://127.0.0.1:2022/v1  # Whisper for STT
export TTS_BASE_URL=http://127.0.0.1:8880/v1  # Kokoro for TTS
export TTS_VOICE=af_sky                       # Kokoro voice

Troubleshooting

Service Won't Start

Check if port 2022 is already in use: lsof -i :2022
Verify model file exists at configured path
Check service logs for error messages

Poor Transcription Quality

Try a larger model (base → small → medium → large)
Ensure audio input quality is good
Set specific language instead of 'auto' if known

High CPU Usage

Use a smaller model for better performance
Consider English-only models (.en) for English content
Enable GPU acceleration if available

Model Installation Issues

Verify adequate disk space (models range from 39MB to 3GB)
Check network connectivity to Hugging Face
Delete corrupted model files from ~/.voicemode/models/whisper/ and re-run the model command

Performance Monitoring

# Check service status
voicemode whisper service status

# Monitor real-time processing
tail -f ~/.voicemode/services/whisper/logs/performance.log

# List available models
voicemode whisper model --all

File Locations

Models: ~/.voicemode/models/whisper/ or ~/.voicemode/services/whisper/models/
Service Config: ~/.voicemode/services/whisper/config.json
Model Preferences: ~/.voicemode/whisper-models.txt
Logs: ~/.voicemode/services/whisper/logs/
LaunchAgent (macOS): ~/Library/LaunchAgents/com.voicemode.whisper.plist
Systemd Service (Linux): ~/.config/systemd/user/whisper.service

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper Speech-to-Text Setup

Quick Start

Installation Methods

Automatic Installation (Recommended)

Manual Installation

macOS

Linux

Prerequisites

Core ML Acceleration (Apple Silicon)

Model Management

Available Models

Model Commands

Service Configuration

Environment Variables

Running the Server

OpenAI-Compatible Server Mode

Service Management

macOS (LaunchAgent)

Linux (Systemd)

Hardware Acceleration

Apple Silicon (CoreML)

GPU Acceleration

Integration with VoiceMode

Custom Configuration

Fully Local Setup

Troubleshooting

Service Won't Start

Poor Transcription Quality

High CPU Usage

Model Installation Issues

Performance Monitoring

File Locations

FilesExpand file tree

whisper-setup.md

Latest commit

History

whisper-setup.md

File metadata and controls

Whisper Speech-to-Text Setup

Quick Start

Installation Methods

Automatic Installation (Recommended)

Manual Installation

macOS

Linux

Prerequisites

Core ML Acceleration (Apple Silicon)

Model Management

Available Models

Model Commands

Service Configuration

Environment Variables

Running the Server

OpenAI-Compatible Server Mode

Service Management

macOS (LaunchAgent)

Linux (Systemd)

Hardware Acceleration

Apple Silicon (CoreML)

GPU Acceleration

Integration with VoiceMode

Custom Configuration

Fully Local Setup

Troubleshooting

Service Won't Start

Poor Transcription Quality

High CPU Usage

Model Installation Issues

Performance Monitoring

File Locations