whisper : map token timestamps to original time when VAD is enabled#3910
Open
buxuku wants to merge 1 commit into
Open
whisper : map token timestamps to original time when VAD is enabled#3910buxuku wants to merge 1 commit into
buxuku wants to merge 1 commit into
Conversation
danbev
approved these changes
Jun 25, 2026
danbev
left a comment
Member
There was a problem hiding this comment.
Optional, but perhaps we could extend test-vad-full.cpp to exercise these new functions.
whisper_full_get_token_data().t0/t1 are in the VAD-processed timeline (silences removed), so they don't line up with the original audio. Add whisper_full_get_token_t0/t1 that map them back. A token inside a speech segment is interpolated within it; a token that falls in a removed inter-segment silence snaps to the nearest boundary, so it never lands in the middle of a cut-out gap. Without VAD the raw times are returned unchanged.
c008fa5 to
e14e08b
Compare
Contributor
Author
|
@danbev pushed an update on top of the approved version: the token times are now mapped segment by segment instead of one interpolation over the whole mapping table, and a token that lands in a removed silence snaps to the nearest speech boundary rather than somewhere in the middle of a gap that isn't in the original audio. Also extended test-vad-full.cpp to cover the new getters as you suggested. PTAL when you have a moment, thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When VAD is enabled, the segment getters (
whisper_full_get_segment_t0/t1) already map timestamps back to the original audio timeline, but the per-token timestamps inwhisper_full_get_token_data()stay in the VAD-processed timeline with the silence removed. So if you build word-level timing on top of the token times while VAD is on, the words drift off by however much silence VAD stripped out, and there's no public getter that applies the mapping.This adds
whisper_full_get_token_t0/t1(plus the_from_statevariants) that map the token times back. A token inside a speech segment is interpolated within that segment; a token that falls in the silence removed between two segments is snapped to the nearest boundary, so it doesn't end up in the middle of a gap that isn't in the original audio. With VAD off, or when there's no segment info, the stored token times are returned unchanged, so existing callers aren't affected.I hit this doing word-level re-segmentation with VAD enabled: the segment times lined up with the original audio but the token times didn't. Also extended tests/test-vad-full.cpp to exercise the new getters. Built and ran it on macOS.