whisper : expose internal VAD speech segments#3916
Merged
Conversation
Member
|
@buxuku Could you take a look at the conflicts and resolve them. |
When transcribing with params.vad = true, whisper already computes the speech segments and keeps them in the state. Expose them so callers can reuse those boundaries (for example to align or clip subtitles to speech) instead of running a second, separate VAD pass. Times are on the original audio timeline in centiseconds; the count is 0 when VAD was not used. test-vad-full.cpp checks the segments are ordered and non-empty.
99d2d51 to
6809480
Compare
Contributor
Author
Done! Conflicts were just my stuff sitting right next to the #3910 token_t0/t1 changes, so I kept both. Rebased on master, should be happy now |
danbev
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When transcribing with vad = true, whisper already detects the speech segments internally and keeps them in the state, but there's no way to read them back. This exposes them through whisper_full_n_vad_segments() and whisper_full_get_vad_segment_t0/t1() (with _from_state variants). The times are on the original audio timeline in centiseconds, and the count is 0 when VAD wasn't used.
The point is to let callers reuse whisper's own speech boundaries — for instance to clip or align subtitles to speech — instead of running a separate VAD pass over the same audio.
Extended tests/test-vad-full.cpp to check the segments come back non-empty and in order. Built and ran it on macOS.