Skip to content

Latest commit

 

History

History
111 lines (91 loc) · 4.69 KB

File metadata and controls

111 lines (91 loc) · 4.69 KB
layout minimal
title Transcript Import
description Runbook for importing transcript files and mapping them to canonical video assets.
breadcrumb Transcript Import
breadcrumb_parent_name Docs
breadcrumb_parent_url /backlog/docs/
id doc-008

{% include breadcrumbs.html %}

Transcript Import

Automated YouTube Transcription (Async)

For missing transcripts of videos hosted on YouTube, we use a local asynchronous pipeline powered by zdots-ctx and whisper.cpp. This process is designed to be highly resilient and fully resumable.

  1. Enqueue Pending Videos: Identify all pending video_assets with a YouTube ID and enqueue them to the local zdots-ctx worker.
    ./bin/batch_ztranscribe.rb
  2. Start the Background Worker (Safely): Ensure the worker is running to process the queue asynchronously. CRITICAL: When running in the background, you MUST redirect stdin from /dev/null. If you do not, the underlying ffmpeg process (used by yt-transcribe to convert audio) will attempt to read from the terminal, immediately suspending the entire worker process.
    # Run in background safely:
    zdots-ctx worker --type transcription < /dev/null &
  3. Pausing and Resuming (Interruption Recovery): If the process is interrupted (e.g., power loss, internet outage, or manual kill):
    • Clear any jobs that were stuck in the "running" state when the interruption occurred:
      zdots-ctx clear-stale-jobs
    • Restart the worker using the safe command from Step 2. It will automatically pick up where it left off.
  4. Stage Completed Transcripts: As the worker finishes downloading and transcribing to ~/Downloads/transcripts/, map them back to their canonical video_asset_id and stage them for ingestion.
    ./bin/stage_completed_transcripts.rb
  5. Ingest Staged Transcripts: Use the standard pipeline to ingest the staged files into _data/transcripts/.
    ./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9 --auto-commit
  6. Audit and Extract Insights: Once ingested, prepare the newly available transcripts for the conversational audit to extract SEO metadata, separate speakers, and pull durable insights.
    bundle exec rake audit:prepare_wave
    Then activate the transcript-conversational-audit skill to process the generated prompts.

Canonical Model

  • Transcript files live in _data/transcripts/*.yml.
  • Video assets reference transcripts via _data/video_assets.yml transcript_id.
  • _data/transcripts.yml is legacy and not used for active content.

Commands

  1. Audit current repository transcript integrity:
    • ./bin/transcripts audit
  2. Build ID-suffixed staging files (recommended for ambiguous filenames):
    • ./bin/transcripts prepare --source-dir /Volumes/Dock_1TB/vimeo/outbox --output-dir tmp/transcript-id-staging --min-confidence 0.8 --clean-output
  3. Run import in dry-run mode:
    • ./bin/transcripts dry-run --source-dir tmp/transcript-id-staging --min-confidence 0.9
  4. Review output reports:
    • tmp/transcript-import-report.json
    • tmp/transcript-import-report.md
  5. Apply high-confidence mappings:
    • ./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9

Direct Import Mode

If filenames already include explicit IDs and do not need staging:

  • ./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
  • ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9

Report Files

  • Mapping report: tmp/transcript-import-report.json
  • Human-readable summary: tmp/transcript-import-report.md

Legacy sequence (kept for reference)

  1. Run import in dry-run mode:
    • ./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
  2. Review output reports:
    • tmp/transcript-import-report.json
    • tmp/transcript-import-report.md
  3. Apply high-confidence mappings:
    • ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
  4. Re-run pipeline validation:
    • ./bin/transcripts validate

One-Command Batch Mode

  • Ingest + audit + validate + commit:
    • ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit
  • Ingest + audit + validate + commit + push:
    • ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit --auto-push

Notes

  • Supported source file formats: .txt, .md, .srt, .vtt.
  • Existing transcript files are not overwritten unless --force is supplied.
  • Low-confidence mappings are never auto-applied; review those in the report first.