Real-time audio transcription with a beautiful terminal UI. Stream from your microphone or audio files and see transcripts appear instantly with color-coded confidence scores, speaker labels, and latency metrics.
uv sync
export DEEPGRAM_API_KEY="your_api_key_here"To use --vad for EOT latency measurement:
export HF_TOKEN="your_huggingface_token"You must first accept the licence at https://huggingface.co/pyannote/segmentation-3.0
Then launch the UI and start talking:
uv run stream_audio_file.py --ui --live \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&interim_results=true"Live microphone streaming:
uv run stream_audio_file.py --ui --live \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&interim_results=true"Transcribe an audio file:
uv run stream_audio_file.py --ui -f audio.wav \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&smart_format=true"Try Flux for ultra-low latency:
uv run stream_audio_file.py --ui --live \
--url "wss://api.deepgram.com/v2/listen?model=flux-general-en&eot_threshold=0.7&encoding=linear16&sample_rate=16000"Stream and save JSON output:
uv run stream_audio_file.py -f audio.wav \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"Output is automatically saved to audio.json (derived from input filename).
Specify a custom output file:
uv run stream_audio_file.py -o output.json -f audio.wav \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"Live recording saves with timestamp:
uv run stream_audio_file.py --live \
--url "wss://api.deepgram.com/v1/listen?model=nova-3&interim_results=true"
# Saves to recording_20250114_153022.json (or similar)Print basic transcript:
uv run print_transcript.py -f output.json[00:00:00.56 - 00:00:03.29]: The missile knows where it is at all times.
[00:00:03.75 - 00:00:06.17]: It knows this because it knows where it isn't.
Print with all the details:
uv run print_transcript.py -f output.json \
--print-speakers --print-channels --print-interim --print-latency --colorize[18:30:24.066 (0.665s since EOS)] [00:00:00.00 - 00:00:03.48] [Speaker 0] [Channel 0] [IsFinal]: The missile knows where it is at all times.
Print just the text:
uv run print_transcript.py -f output.json --only-transcriptThe missile knows where it is at all times.
It knows this because it knows where it isn't.
| Option | Description |
|---|---|
--url, -u |
Deepgram websocket URL (required) |
--ui |
Interactive terminal UI with live updates |
-f, --audio |
Audio file to stream |
-l, --live |
Stream from microphone |
-o, --output |
Save JSON messages to file (defaults to input filename or timestamped name) |
-v, -vv, -vvv |
Increase verbosity |
--vad/--no-vad |
Run PyAnnote VAD concurrently to enable EOT latency measurement (requires HF_TOKEN env var) |
--play-audio |
Play audio through the speaker as it streams (file mode only) |
| Option | Description |
|---|---|
--print-speakers |
Show speaker labels |
--print-channels |
Show audio channels |
--print-interim |
Include interim results |
--print-received |
Show received timestamp for streamed messages |
--print-latency |
Show latency metrics (TTFT, update frequency, message latency, EOT latency) |
--print-entities |
Show detected entities |
--colorize |
Color words by confidence |
--only-transcript |
Just the text, no metadata |
Run either script with --help for full options.
Generate shell completions for your preferred shell:
uv run stream_audio_file.py completion bash # or zsh, fishWhen using --print-latency, the following metrics are computed:
Session-level:
- TTFT (Time To First Transcript): Wall-clock time from when audio streaming begins to when the first transcript message is received. Measures initial responsiveness.
- Update Frequency: Number of interim transcript updates per second of audio. Higher values mean a more fluid, responsive transcription experience.
Per-message:
- Message Latency: How far behind the transcription is from the audio being sent, calculated as
audio_cursor - transcript_cursor. Measured on interim results only, per Deepgram's methodology. - EOT Latency (End-of-Turn Latency): Time from when the user finished speaking to when the finalizing transcript event is received (e.g.,
speech_final,UtteranceEnd,EndOfTurn). Requires--vadflag — uses PyAnnote VAD to detect when speech actually ended, and measures wall-clock time from when that audio was sent over the websocket to when the response arrived. Critical for voice agents — they can't respond until they know the user stopped speaking.
The UI mode shows transcription speed in real-time—watch words appear as you speak and see exactly how fast Deepgram processes your audio. The --print-latency option reveals latency metrics, perfect for testing different models and configurations.