Skip to content

YouTube visual OCR output uses index-based timestamps instead of real frame timing #2099

@shaun0927

Description

@shaun0927

Description

The YouTube visual extraction feature currently emits subtitle-like timestamps based on the OCR result index instead of the actual frame time.

That means the output can look temporally precise while actually pointing to the wrong time in the source video.

Affected area

  • internal/tools/youtube/youtube.go
  • current upstream I validated locally: 9743e10273c45e1c377c10ec0ffaf0de8fefb3c8 (v1.4.448)

Why this happens

GrabVisual() chooses frames using either:

  • fps=%d, or
  • select='gt(scene,%f)'

but later renders timestamps using:

secs := i

So the nth OCR result is labeled as second n, regardless of when that frame actually occurred.

Why this matters

This feature is meant to extract meaningful visual text from video. Incorrect timestamps reduce trust in the output and can mislead downstream prompts or users into attributing OCR text to the wrong moment in the source video.

Local validation evidence

I reproduced this locally with a focused test that mocks yt-dlp, ffmpeg, and tesseract:

go test ./internal/tools/youtube -run 'TestGrabVisual_UsesFrameTimesFromFFmpeg' -v

In the repro, GrabVisual(..., fps=2) produces two OCR frames and the current upstream behavior labels the second frame by index instead of by the actual frame time.

I have a small fix prepared that preserves ffmpeg-derived frame times and includes regression coverage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions