Description
The YouTube visual extraction feature currently emits subtitle-like timestamps based on the OCR result index instead of the actual frame time.
That means the output can look temporally precise while actually pointing to the wrong time in the source video.
Affected area
internal/tools/youtube/youtube.go
- current upstream I validated locally:
9743e10273c45e1c377c10ec0ffaf0de8fefb3c8 (v1.4.448)
Why this happens
GrabVisual() chooses frames using either:
fps=%d, or
select='gt(scene,%f)'
but later renders timestamps using:
So the nth OCR result is labeled as second n, regardless of when that frame actually occurred.
Why this matters
This feature is meant to extract meaningful visual text from video. Incorrect timestamps reduce trust in the output and can mislead downstream prompts or users into attributing OCR text to the wrong moment in the source video.
Local validation evidence
I reproduced this locally with a focused test that mocks yt-dlp, ffmpeg, and tesseract:
go test ./internal/tools/youtube -run 'TestGrabVisual_UsesFrameTimesFromFFmpeg' -v
In the repro, GrabVisual(..., fps=2) produces two OCR frames and the current upstream behavior labels the second frame by index instead of by the actual frame time.
I have a small fix prepared that preserves ffmpeg-derived frame times and includes regression coverage.
Description
The YouTube visual extraction feature currently emits subtitle-like timestamps based on the OCR result index instead of the actual frame time.
That means the output can look temporally precise while actually pointing to the wrong time in the source video.
Affected area
internal/tools/youtube/youtube.go9743e10273c45e1c377c10ec0ffaf0de8fefb3c8(v1.4.448)Why this happens
GrabVisual()chooses frames using either:fps=%d, orselect='gt(scene,%f)'but later renders timestamps using:
So the nth OCR result is labeled as second
n, regardless of when that frame actually occurred.Why this matters
This feature is meant to extract meaningful visual text from video. Incorrect timestamps reduce trust in the output and can mislead downstream prompts or users into attributing OCR text to the wrong moment in the source video.
Local validation evidence
I reproduced this locally with a focused test that mocks
yt-dlp,ffmpeg, andtesseract:In the repro,
GrabVisual(..., fps=2)produces two OCR frames and the current upstream behavior labels the second frame by index instead of by the actual frame time.I have a small fix prepared that preserves ffmpeg-derived frame times and includes regression coverage.