|
| 1 | +# yt-transcribe |
| 2 | + |
| 3 | +CLI tool that extracts transcripts from YouTube videos. Fetches existing captions when available; otherwise downloads audio and transcribes locally with [Whisper](https://github.com/SYSTRAN/faster-whisper). |
| 4 | + |
| 5 | +## Requirements |
| 6 | + |
| 7 | +- Python 3.11+ |
| 8 | +- [uv](https://docs.astral.sh/uv/) — `curl -LsSf https://astral.sh/uv/install.sh | sh` |
| 9 | +- [just](https://just.systems/) — `brew install just` |
| 10 | + |
| 11 | +## Setup |
| 12 | + |
| 13 | +```bash |
| 14 | +cd transcribe |
| 15 | +just install |
| 16 | +``` |
| 17 | + |
| 18 | +## Usage |
| 19 | + |
| 20 | +```bash |
| 21 | +# Fetch captions if available, otherwise run Whisper — saves to current dir by default |
| 22 | +just run "https://youtube.com/watch?v=VIDEO_ID" |
| 23 | + |
| 24 | +# Save to a specific file |
| 25 | +just run "https://youtube.com/watch?v=VIDEO_ID" --output transcript.txt |
| 26 | + |
| 27 | +# Print to stdout (all status output suppressed — safe to pipe) |
| 28 | +just run "https://youtube.com/watch?v=VIDEO_ID" --print |
| 29 | +just run "https://youtube.com/watch?v=VIDEO_ID" --print | pbcopy |
| 30 | + |
| 31 | +# Force Whisper even if captions exist |
| 32 | +just run "https://youtube.com/watch?v=VIDEO_ID" --force-whisper |
| 33 | + |
| 34 | +# Use a specific Whisper model |
| 35 | +just run "https://youtube.com/watch?v=VIDEO_ID" --model large-v3 |
| 36 | + |
| 37 | +# Force language (auto-detected by default) |
| 38 | +just run "https://youtube.com/watch?v=VIDEO_ID" --language de |
| 39 | + |
| 40 | +# See all model options |
| 41 | +just models |
| 42 | +``` |
| 43 | + |
| 44 | +## Config |
| 45 | + |
| 46 | +Defaults are stored in `~/.config/yt-transcribe/config.toml`. On first run with `just config --edit` the file is created with all options documented inline. |
| 47 | + |
| 48 | +```bash |
| 49 | +just config # show config file path |
| 50 | +just config --show # print current config |
| 51 | +just config --edit # open in $EDITOR |
| 52 | +``` |
| 53 | + |
| 54 | +Default config: |
| 55 | + |
| 56 | +```toml |
| 57 | +[defaults] |
| 58 | +model = "turbo" # tiny | base | small | medium | turbo | large-v3 |
| 59 | +language = "" # empty = auto-detect per video |
| 60 | +output_dir = "" # if set, transcripts are auto-saved here (uses video title as filename) |
| 61 | +output_extension = "txt" |
| 62 | + |
| 63 | +[whisper] |
| 64 | +device = "cpu" # cpu | cuda (use cuda if you have a GPU) |
| 65 | +compute_type = "int8" # int8 (fast CPU) | float16 (GPU) | float32 (precise) |
| 66 | +beam_size = 5 # higher = more accurate, slower (1–10) |
| 67 | +vad_filter = true # skip silent segments (recommended) |
| 68 | +``` |
| 69 | + |
| 70 | +**`output_dir`** — when set, every transcription is auto-saved to `<output_dir>/<video title>.<output_extension>` without needing `--output`. Useful for batch use. |
| 71 | + |
| 72 | +## Options |
| 73 | + |
| 74 | +| Flag | Short | Default | Description | |
| 75 | +|---|---|---|---| |
| 76 | +| `--print` | `-p` | off | Print to stdout instead of saving; suppresses all status output (safe to pipe) | |
| 77 | +| `--output` | `-o` | — | Save to this exact path (overrides `output_dir` in config) | |
| 78 | +| `--model` | `-m` | from config | Whisper model size | |
| 79 | +| `--language` | `-l` | auto-detect | Override language, e.g. `en`, `fr`, `de`. Omit to auto-detect — useful only when detection gets it wrong or the video has mixed-language content. | |
| 80 | +| `--force-whisper` | `-w` | off | Skip caption lookup, always use Whisper | |
| 81 | + |
| 82 | +By default the transcript is **saved to a file** — to `output_dir` from config if set, otherwise to the current directory, using the video title as the filename. Use `--print` to get stdout behaviour instead. |
| 83 | + |
| 84 | +CLI flags always override config values. |
| 85 | + |
| 86 | +## Whisper models |
| 87 | + |
| 88 | +Model weights are downloaded from HuggingFace on first use and cached at `~/.cache/huggingface/hub/`. Subsequent runs use the cached copy — no re-download. Override the location with the `HF_HUB_CACHE` environment variable. |
| 89 | + |
| 90 | +| Model | Size | Speed | Accuracy | |
| 91 | +|---|---|---|---| |
| 92 | +| `tiny` | ~75 MB | fastest | lowest | |
| 93 | +| `base` | ~140 MB | fast | decent | |
| 94 | +| `small` | ~460 MB | moderate | good | |
| 95 | +| `medium` | ~1.5 GB | slow | better | |
| 96 | +| `turbo` | ~800 MB | fast | best for size — **default** | |
| 97 | +| `large-v3` | ~3 GB | slowest | highest | |
| 98 | + |
| 99 | +## How it works |
| 100 | + |
| 101 | +``` |
| 102 | + ┌─────────────────────┐ |
| 103 | + │ YouTube URL / ID │ |
| 104 | + └──────────┬──────────┘ |
| 105 | + │ extract video ID |
| 106 | + ▼ |
| 107 | + ┌────────────────────────┐ |
| 108 | + │ Fetch YouTube captions │ ◄── youtube-transcript-api |
| 109 | + │ (any available lang) │ prefers manual over auto-generated |
| 110 | + └────────────┬───────────┘ |
| 111 | + │ |
| 112 | + ┌───────────────┴──────────────┐ |
| 113 | + │ Captions found? │ |
| 114 | + ▼ ▼ |
| 115 | + Yes: done No: fallback |
| 116 | + │ |
| 117 | + ┌──────▼──────┐ |
| 118 | + │ Download │ ◄── yt-dlp |
| 119 | + │ audio │ best quality stream |
| 120 | + └──────┬──────┘ |
| 121 | + │ |
| 122 | + ┌──────▼──────┐ |
| 123 | + │ Whisper │ ◄── faster-whisper |
| 124 | + │ transcribe │ CTranslate2, CPU/GPU |
| 125 | + └──────┬──────┘ |
| 126 | + │ audio deleted from temp dir |
| 127 | + ▼ |
| 128 | + ┌─────────────────────────────┐ |
| 129 | + │ --output / output_dir / stdout │ |
| 130 | + └─────────────────────────────┘ |
| 131 | +``` |
| 132 | + |
| 133 | +### Caption lookup |
| 134 | + |
| 135 | +Uses [`youtube-transcript-api`](https://github.com/jdepoix/youtube-transcript-api) to fetch captions from YouTube's internal API — no audio download needed, instant. Prefers manually uploaded captions over auto-generated ones. Falls back to Whisper when: |
| 136 | + |
| 137 | +- The video owner disabled captions |
| 138 | +- The video is too new for auto-generation to finish |
| 139 | +- YouTube rate-limits the request |
| 140 | + |
| 141 | +### Whisper transcription |
| 142 | + |
| 143 | +1. **Download** — `yt-dlp` fetches the best available audio stream to a temp directory. |
| 144 | +2. **Transcribe** — `faster-whisper` runs the Whisper model with `vad_filter=True` to skip silent segments. Language is auto-detected unless overridden. |
| 145 | +3. **Cleanup** — temp audio file deleted automatically. |
| 146 | + |
| 147 | +`faster-whisper` uses [CTranslate2](https://github.com/OpenNMT/CTranslate2) under the hood — 4–8× faster than OpenAI's original Whisper on CPU using `int8` quantization. |
| 148 | + |
| 149 | +## Dependencies |
| 150 | + |
| 151 | +| Package | Purpose | |
| 152 | +|---|---| |
| 153 | +| [`youtube-transcript-api`](https://github.com/jdepoix/youtube-transcript-api) | Fetch YouTube captions | |
| 154 | +| [`yt-dlp`](https://github.com/yt-dlp/yt-dlp) | Download audio | |
| 155 | +| [`faster-whisper`](https://github.com/SYSTRAN/faster-whisper) | Local speech-to-text | |
| 156 | +| [`typer`](https://typer.tiangolo.com/) | CLI framework | |
| 157 | +| [`rich`](https://rich.readthedocs.io/) | Terminal output | |
0 commit comments