Skip to content

Commit eea41ca

Browse files
committed
Transcribe files
1 parent ad847e7 commit eea41ca

3 files changed

Lines changed: 201 additions & 130 deletions

File tree

README.md

Lines changed: 68 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# yt-transcribe
22

3-
CLI tool that extracts transcripts from YouTube videos. Fetches existing captions when available; otherwise downloads audio and transcribes locally with [Whisper](https://github.com/SYSTRAN/faster-whisper).
3+
CLI tool that transcribes YouTube videos and local audio/video files. For YouTube, it fetches existing captions when available; otherwise downloads audio and transcribes locally with [Whisper](https://github.com/SYSTRAN/faster-whisper). Local files always go through Whisper directly.
44

55
## Requirements
66

77
- Python 3.11+
88
- [uv](https://docs.astral.sh/uv/)
99
- [just](https://just.systems/)
10-
- [ffmpeg](https://ffmpeg.org/) — required when Whisper is used (no captions available)
10+
- [ffmpeg](https://ffmpeg.org/) — required by `yt-dlp` when downloading YouTube audio (not needed for local files, which are decoded by faster-whisper's bundled FFmpeg)
1111

1212
### macOS
1313

@@ -43,7 +43,7 @@ Then restart your terminal so the new PATH entries take effect.
4343
> **Note:** `faster-whisper` requires the [Microsoft Visual C++ Redistributable](https://aka.ms/vs/17/release/vc_redist.x64.exe) (x64). Install it if you see a DLL error on first Whisper run.
4444
4545
## Setup
46-
46+
""
4747
```bash
4848
cd transcribe
4949
just install
@@ -52,6 +52,11 @@ just install
5252
## Usage
5353

5454
```bash
55+
# Transcribe a local audio or video file (any format FFmpeg supports)
56+
just run recording.mp3
57+
just run /path/to/interview.mp4
58+
just run ~/Downloads/lecture.m4a
59+
5560
# Fetch captions if available, otherwise run Whisper — saves to ~/Downloads by default
5661
just run "https://youtube.com/watch?v=VIDEO_ID"
5762

@@ -121,12 +126,29 @@ vad_filter = true # skip silent segments (recommended)
121126
| `--output` | `-o` || Save to this exact path (overrides `output_dir` in config) |
122127
| `--model` | `-m` | from config | Whisper model size |
123128
| `--language` | `-l` | auto-detect | Override language, e.g. `en`, `fr`, `de`. Omit to auto-detect — useful only when detection gets it wrong or the video has mixed-language content. |
124-
| `--force-whisper` | `-w` | off | Skip caption lookup, always use Whisper |
129+
| `--force-whisper` | `-w` | off | Skip caption lookup, always use Whisper (ignored for local files — Whisper is always used) |
125130

126131
By default the transcript is **saved to `~/Downloads`** using the video title as the filename. Change `output_dir` in config to save elsewhere. Use `--print` to get stdout behaviour instead.
127132

128133
CLI flags always override config values.
129134

135+
## Whisper performance
136+
137+
Measured on a 4m 52s audio clip (Polish speech, CPU, `int8`):
138+
139+
| Model | Elapsed | Seconds per audio-minute | Time for 1h video |
140+
|---|---|---|---|
141+
| `tiny` | 15.4s | 3.2s | ~3 min |
142+
| `base` | 25.4s | 5.2s | ~5 min |
143+
| `small` | 70.1s | 14.4s | ~14 min |
144+
| `turbo` | 90.8s | 18.7s | ~19 min |
145+
146+
`medium` and `large-v3` skipped — extrapolate the trend.
147+
148+
- `tiny` is nearly 6× faster than `turbo` on CPU
149+
- `turbo` is the default because it trades that speed for much better accuracy — especially on non-English audio
150+
- For quick drafts or batch jobs where accuracy matters less, `base` is a good middle ground
151+
130152
## Whisper models
131153

132154
Model weights are downloaded from HuggingFace on first use and cached at `~/.cache/huggingface/hub/` (macOS/Linux) or `%USERPROFILE%\.cache\huggingface\hub\` (Windows). Subsequent runs use the cached copy — no re-download. Override with the `HF_HUB_CACHE` environment variable.
@@ -143,35 +165,37 @@ Model weights are downloaded from HuggingFace on first use and cached at `~/.cac
143165
## How it works
144166

145167
```
146-
┌─────────────────────┐
147-
│ YouTube URL / ID │
148-
└──────────┬──────────┘
149-
│ extract video ID
150-
151-
┌────────────────────────┐
152-
│ Fetch YouTube captions │ ◄── youtube-transcript-api
153-
│ (any available lang) │ prefers manual over auto-generated
154-
└────────────┬───────────┘
155-
156-
┌───────────────┴──────────────┐
157-
│ Captions found? │
158-
▼ ▼
159-
Yes: done No: fallback
160-
161-
┌──────▼──────┐
162-
│ Download │ ◄── yt-dlp + ffmpeg
163-
│ audio │ best quality stream
164-
└──────┬──────┘
165-
166-
┌──────▼──────┐
167-
│ Whisper │ ◄── faster-whisper
168-
│ transcribe │ CTranslate2, CPU/GPU
169-
└──────┬──────┘
170-
│ audio deleted from temp dir
171-
172-
┌─────────────────────────────┐
173-
│ --output / output_dir / stdout │
174-
└─────────────────────────────┘
168+
┌─────────────────────┐ ┌──────────────────────┐
169+
│ YouTube URL / ID │ │ Local audio/video │
170+
└──────────┬──────────┘ └───────────┬──────────┘
171+
│ extract video ID │
172+
▼ │
173+
┌────────────────────────┐ │
174+
│ Fetch YouTube captions │ ◄── youtube-transcript-api
175+
│ (any available lang) │ prefers manual over auto-generated
176+
└────────────┬───────────┘ │
177+
│ │
178+
┌──────────────┴─────────────┐ │
179+
│ Captions found? │ │
180+
▼ ▼ │
181+
Yes: done No: fallback │
182+
│ │
183+
┌──────▼──────┐ │
184+
│ Download │ ◄── yt-dlp + ffmpeg
185+
│ audio │ best quality stream
186+
└──────┬──────┘ │
187+
│ │
188+
└──────────┬──────────┘
189+
190+
┌──────▼──────┐
191+
│ Whisper │ ◄── faster-whisper
192+
│ transcribe │ CTranslate2, CPU/GPU
193+
└──────┬──────┘
194+
│ temp audio deleted (YouTube only)
195+
196+
┌─────────────────────────────┐
197+
│ --output / output_dir / stdout │
198+
└─────────────────────────────┘
175199
```
176200

177201
### Caption lookup
@@ -182,6 +206,17 @@ Uses [`youtube-transcript-api`](https://github.com/jdepoix/youtube-transcript-ap
182206
- The video is too new for auto-generation to finish
183207
- YouTube rate-limits the request
184208

209+
### Supported file formats
210+
211+
faster-whisper uses [PyAV](https://github.com/PyAV-Org/PyAV) (bundled FFmpeg) to decode audio, so any format FFmpeg handles is accepted — no system ffmpeg installation required for local files.
212+
213+
Common formats:
214+
215+
| Type | Extensions |
216+
|---|---|
217+
| Audio | `.mp3` `.m4a` `.aac` `.wav` `.flac` `.ogg` `.opus` `.wma` `.aiff` |
218+
| Video (audio extracted) | `.mp4` `.mkv` `.mov` `.avi` `.webm` `.ts` |
219+
185220
### Whisper transcription
186221

187222
1. **Download**`yt-dlp` fetches the best available audio stream to a temp directory.

justfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
install:
33
uv sync
44

5-
# Transcribe a YouTube video
6-
# Usage: just run <url> [--model turbo] [--force-whisper] [--output file.txt]
5+
# Transcribe a YouTube URL, video ID, or local audio/video file
6+
# Usage: just run <url-or-path> [--model turbo] [--force-whisper] [--output file.txt]
77
[positional-arguments]
88
run *args:
99
@uv run transcribe run "$@"

0 commit comments

Comments
 (0)