| layout | minimal |
|---|---|
| title | Transcript Import |
| description | Runbook for importing transcript files and mapping them to canonical video assets. |
| breadcrumb | Transcript Import |
| breadcrumb_parent_name | Docs |
| breadcrumb_parent_url | /backlog/docs/ |
| id | doc-008 |
{% include breadcrumbs.html %}
For missing transcripts of videos hosted on YouTube, we use a local asynchronous pipeline powered by zdots-ctx and whisper.cpp. This process is designed to be highly resilient and fully resumable.
- Enqueue Pending Videos:
Identify all pending
video_assetswith a YouTube ID and enqueue them to the localzdots-ctxworker../bin/batch_ztranscribe.rb
- Start the Background Worker (Safely):
Ensure the worker is running to process the queue asynchronously.
CRITICAL: When running in the background, you MUST redirect stdin from
/dev/null. If you do not, the underlyingffmpegprocess (used byyt-transcribeto convert audio) will attempt to read from the terminal, immediately suspending the entire worker process.# Run in background safely: zdots-ctx worker --type transcription < /dev/null &
- Pausing and Resuming (Interruption Recovery):
If the process is interrupted (e.g., power loss, internet outage, or manual kill):
- Clear any jobs that were stuck in the "running" state when the interruption occurred:
zdots-ctx clear-stale-jobs
- Restart the worker using the safe command from Step 2. It will automatically pick up where it left off.
- Clear any jobs that were stuck in the "running" state when the interruption occurred:
- Stage Completed Transcripts:
As the worker finishes downloading and transcribing to
~/Downloads/transcripts/, map them back to their canonicalvideo_asset_idand stage them for ingestion../bin/stage_completed_transcripts.rb
- Ingest Staged Transcripts:
Use the standard pipeline to ingest the staged files into
_data/transcripts/../bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9 --auto-commit
- Audit and Extract Insights:
Once ingested, prepare the newly available transcripts for the conversational audit to extract SEO metadata, separate speakers, and pull durable insights.
Then activate the
bundle exec rake audit:prepare_wavetranscript-conversational-auditskill to process the generated prompts.
- Transcript files live in
_data/transcripts/*.yml. - Video assets reference transcripts via
_data/video_assets.ymltranscript_id. _data/transcripts.ymlis legacy and not used for active content.
- Audit current repository transcript integrity:
./bin/transcripts audit
- Build ID-suffixed staging files (recommended for ambiguous filenames):
./bin/transcripts prepare --source-dir /Volumes/Dock_1TB/vimeo/outbox --output-dir tmp/transcript-id-staging --min-confidence 0.8 --clean-output
- Run import in dry-run mode:
./bin/transcripts dry-run --source-dir tmp/transcript-id-staging --min-confidence 0.9
- Review output reports:
tmp/transcript-import-report.jsontmp/transcript-import-report.md
- Apply high-confidence mappings:
./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9
If filenames already include explicit IDs and do not need staging:
./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
- Mapping report:
tmp/transcript-import-report.json - Human-readable summary:
tmp/transcript-import-report.md
- Run import in dry-run mode:
./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
- Review output reports:
tmp/transcript-import-report.jsontmp/transcript-import-report.md
- Apply high-confidence mappings:
./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
- Re-run pipeline validation:
./bin/transcripts validate
- Ingest + audit + validate + commit:
./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit
- Ingest + audit + validate + commit + push:
./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit --auto-push
- Supported source file formats:
.txt,.md,.srt,.vtt. - Existing transcript files are not overwritten unless
--forceis supplied. - Low-confidence mappings are never auto-applied; review those in the report first.