just install # create venv and install dependencies
just run <source> # transcribe a YouTube URL, video ID, or local file
just config # show config file path
just config --show # print current config values
just config --edit # open config in $EDITOR
just models # list available Whisper model sizes
just test # run unit tests
just smoke # run smoke tests (requires network)# Full URL
just run "https://youtube.com/watch?v=VIDEO_ID"
# Short URL
just run "https://youtu.be/VIDEO_ID"
# Bare video ID
just run VIDEO_IDFor YouTube, captions are fetched first (instant, no download). Whisper is used as fallback when captions are unavailable.
just run recording.mp3
just run /path/to/interview.mp4
just run ~/Downloads/lecture.m4aAny format FFmpeg handles is accepted. See supported formats.
| Flag | Short | Default | Description |
|---|---|---|---|
--output |
-o |
— | Save to this exact path (overrides output_dir in config) |
--model |
-m |
from config | Whisper model size |
--language |
-l |
auto-detect | Force language code, e.g. en, fr, de |
--force-whisper |
-w |
off | Skip caption lookup, always use Whisper (ignored for local files) |
--print |
-p |
off | Print to stdout instead of saving; suppresses all status output |
CLI flags always override config values.
By default transcripts are saved to ~/Downloads/<video title>.txt. Three ways to control this:
--output path/to/file.txt— save to an exact pathoutput_dirin config — change the default save directory--print— write to stdout instead of a file
If a file with the target name already exists, (1), (2), … is appended automatically.
# macOS
just run "https://youtube.com/watch?v=VIDEO_ID" --print | pbcopy
# Linux
just run "https://youtube.com/watch?v=VIDEO_ID" --print | xclip -selection clipboard
# Windows
just run "https://youtube.com/watch?v=VIDEO_ID" --print | clipConfig is stored in a platform-specific TOML file:
| Platform | Path |
|---|---|
| macOS | ~/Library/Application Support/yt-transcribe/config.toml |
| Linux | ~/.config/yt-transcribe/config.toml |
| Windows | %LOCALAPPDATA%\yt-transcribe\config.toml |
Default config:
[defaults]
model = "turbo" # tiny | base | small | medium | turbo | large-v3
language = "" # empty = auto-detect per video
output_dir = "~/Downloads" # transcripts saved here (video title as filename)
output_extension = "txt"
[whisper]
device = "cpu" # cpu | cuda (use cuda if you have a GPU)
compute_type = "int8" # int8 (fast CPU) | float16 (GPU) | float32 (precise)
beam_size = 5 # higher = more accurate, slower (1–10)
vad_filter = true # skip silent segments (recommended)output_dir — every transcription is auto-saved to <output_dir>/<video title>.<output_extension>. Supports ~ expansion. Useful for batch use.
device = "cuda" — if you have an NVIDIA GPU, set this and compute_type = "float16" for much faster transcription.
language — set a default language to skip auto-detection. Useful if you primarily transcribe content in one language.
Model weights are downloaded from HuggingFace on first use and cached at:
- macOS/Linux:
~/.cache/huggingface/hub/ - Windows:
%USERPROFILE%\.cache\huggingface\hub\
Override with the HF_HUB_CACHE environment variable.
| Model | Size | Speed | Accuracy |
|---|---|---|---|
tiny |
~75 MB | fastest | lowest |
base |
~140 MB | fast | decent |
small |
~460 MB | moderate | good |
medium |
~1.5 GB | slow | better |
turbo |
~800 MB | fast | best for size — default |
large-v3 |
~3 GB | slowest | highest |
turbo is the default: good accuracy across languages at reasonable speed. For quick drafts or batch jobs, base is a solid middle ground. See performance benchmarks for timing data.
If YouTube rate-limits caption requests, you can supply cookies from a logged-in browser session:
YOUTUBE_COOKIES_FILE=/path/to/cookies.txt just run "https://youtube.com/watch?v=VIDEO_ID"The file must be in Netscape cookie format (exportable via browser extensions like "Get cookies.txt").