layout	minimal
title	Transcript Import
description	Runbook for importing transcript files and mapping them to canonical video assets.
breadcrumb	Transcript Import
breadcrumb_parent_name	Docs
breadcrumb_parent_url	/backlog/docs/
id	doc-008

{% include breadcrumbs.html %}

Transcript Import

Automated YouTube Transcription (Async)

For missing transcripts of videos hosted on YouTube, we use a local asynchronous pipeline powered by zdots-ctx and whisper.cpp. This process is designed to be highly resilient and fully resumable.

Enqueue Pending Videos: Identify all pending video_assets with a YouTube ID and enqueue them to the local zdots-ctx worker.
```
./bin/batch_ztranscribe.rb
```
Start the Background Worker (Safely): Ensure the worker is running to process the queue asynchronously. CRITICAL: When running in the background, you MUST redirect stdin from /dev/null. If you do not, the underlying ffmpeg process (used by yt-transcribe to convert audio) will attempt to read from the terminal, immediately suspending the entire worker process.
```
# Run in background safely:
zdots-ctx worker --type transcription < /dev/null &
```
Pausing and Resuming (Interruption Recovery): If the process is interrupted (e.g., power loss, internet outage, or manual kill):
- Clear any jobs that were stuck in the "running" state when the interruption occurred:
```
zdots-ctx clear-stale-jobs
```
- Restart the worker using the safe command from Step 2. It will automatically pick up where it left off.
Stage Completed Transcripts: As the worker finishes downloading and transcribing to ~/Downloads/transcripts/, map them back to their canonical video_asset_id and stage them for ingestion.
```
./bin/stage_completed_transcripts.rb
```
Ingest Staged Transcripts: Use the standard pipeline to ingest the staged files into _data/transcripts/.
```
./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9 --auto-commit
```
Audit and Extract Insights: Once ingested, prepare the newly available transcripts for the conversational audit to extract SEO metadata, separate speakers, and pull durable insights.
```
bundle exec rake audit:prepare_wave
```
Then activate the transcript-conversational-audit skill to process the generated prompts.

Canonical Model

Transcript files live in _data/transcripts/*.yml.
Video assets reference transcripts via _data/video_assets.yml transcript_id.
_data/transcripts.yml is legacy and not used for active content.

Commands

Audit current repository transcript integrity:
- ./bin/transcripts audit
Build ID-suffixed staging files (recommended for ambiguous filenames):
- ./bin/transcripts prepare --source-dir /Volumes/Dock_1TB/vimeo/outbox --output-dir tmp/transcript-id-staging --min-confidence 0.8 --clean-output
Run import in dry-run mode:
- ./bin/transcripts dry-run --source-dir tmp/transcript-id-staging --min-confidence 0.9
Review output reports:
- tmp/transcript-import-report.json
- tmp/transcript-import-report.md
Apply high-confidence mappings:
- ./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9

Direct Import Mode

If filenames already include explicit IDs and do not need staging:

./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9

Report Files

Mapping report: tmp/transcript-import-report.json
Human-readable summary: tmp/transcript-import-report.md

Legacy sequence (kept for reference)

Run import in dry-run mode:
- ./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
Review output reports:
- tmp/transcript-import-report.json
- tmp/transcript-import-report.md
Apply high-confidence mappings:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
Re-run pipeline validation:
- ./bin/transcripts validate

One-Command Batch Mode

Ingest + audit + validate + commit:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit
Ingest + audit + validate + commit + push:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit --auto-push

Notes

Supported source file formats: .txt, .md, .srt, .vtt.
Existing transcript files are not overwritten unless --force is supplied.
Low-confidence mappings are never auto-applied; review those in the report first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript Import

Automated YouTube Transcription (Async)

Canonical Model

Commands

Direct Import Mode

Report Files

Legacy sequence (kept for reference)

One-Command Batch Mode

Notes

FilesExpand file tree

transcript-import.md

Latest commit

History

transcript-import.md

File metadata and controls

Transcript Import

Automated YouTube Transcription (Async)

Canonical Model

Commands

Direct Import Mode

Report Files

Legacy sequence (kept for reference)

One-Command Batch Mode

Notes