whisper : add resumable / streaming transcription API#3869
Open
AlexCherrypi wants to merge 4 commits into
Open
whisper : add resumable / streaming transcription API#3869AlexCherrypi wants to merge 4 commits into
AlexCherrypi wants to merge 4 commits into
Conversation
Add a resumable transcription path that lets audio be fed incrementally
and decoded consistently with a single batch run: the seek position and
rolling text context are persisted in whisper_state across calls, and
audio is only ever decoded once (no sliding-window re-decoding / output
divergence between iterations).
New public API:
- whisper_resumable_reset[_with_state]
- whisper_append_audio[_with_state] (incremental log-mel)
- whisper_full_resumable[_with_state] (decode complete 30s windows,
defer the trailing window until finalize)
The existing whisper_full_with_state loop gains a few guarded hooks
(active only in resumable mode); whisper_full() behavior is unchanged.
Log-mel normalization is configurable via whisper_full_params:
- WHISPER_MEL_NORM_GLOBAL: global-max, matches batch (default)
- WHISPER_MEL_NORM_WINDOW: per-window reference with an envelope
follower (instant attack, EMA release fed the raw peak) so a brief
silence does not amplify background noise and a loud transient does
not wreck the rest of the stream.
Verified that in GLOBAL mode the resumable path feeds byte-identical mel
windows to the encoder at every seek offset (including the finalize tail
with batch-equivalent straddle frames), so decoded output matches batch
for any model.
The reference-level release was an EMA coefficient applied once per decode window, so its effective time constant depended on the model-chosen seek_delta (variable in seconds). Replace it with an exponential decay parameterized by a half-life in audio seconds: decay = 0.5^(dt / T_half), where dt is the seek advance since the last reference update. This makes the "forgetting" of a stopped background source predictable in wall-clock terms and independent of window stride or pauses between calls. - mel_norm_release -> mel_norm_half_life (seconds; <= 0 = instantaneous) - mel_norm_max_drop is now a drop-rate cap (log10 power per second) Attack stays instantaneous/free; the decay is still fed the raw per-window peak. GLOBAL mode is unaffected.
Reference example for the resumable streaming API. A producer thread feeds PCM into a queue while a worker thread, owning a dedicated whisper_state, runs whisper_append_audio + whisper_full_resumable next to "recording" - decoding each 30s window once and persisting seek + context across calls. Reads a WAV file (optionally paced in real time to simulate a live source); swap the producer for a microphone in a real application. Demonstrates both mel normalization modes (global / per-window envelope follower). Builds without SDL.
Review follow-ups on the resumable/streaming API: - Document that WHISPER_MEL_NORM_GLOBAL matches batch whisper_full() only when all audio is appended before the first decode; incremental decoding uses the running max (whisper.h + stream-resumable README). - Reject whisper_append_audio() after a finalize pass (new rs_finalized flag) so a misuse can't interleave real frames behind the trailing pad and desync the mel layout; reset clears the flag. - Warn on finalize when audio was fed but too short to produce any frame. - Fill the language-detection mel window at offset 0 to match the encode offset. - Correct the WINDOW envelope-follower dt comment (stride is decoder-chosen). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an incremental, resumable transcription API so audio can be fed to whisper while it is still being captured, instead of only as one complete buffer via
whisper_full().New public API (
whisper.h):whisper_append_audio[_with_state]()— feed PCM incrementally; mel frames are computed and accumulated, PCM is not retained.whisper_full_resumable[_with_state]()— decode the windows fully backed by audio so far;finalize=trueflushes the trailing (<30s) window with batch-equivalent zero padding.whisper_resumable_reset[_with_state]()— reset the accumulated state to start a new stream.whisper_full_paramsfields:mel_norm_mode(GLOBAL= running max, batch-compatible;WINDOW= envelope-follower AGC for live mic input),mel_norm_half_life,mel_norm_max_drop.Plus a self-contained
examples/stream-resumable/demo (single producer/consumer, onewhisper_stateper worker).Why
Enables "transcribe while recording" / near-instant-on-stop UX — when the user stops, only the trailing window remains to decode — at full model quality, and without the re-decode overhead of the sliding-window
streamexample.Notes
whisper_full()behavior is unchanged; the resumable path is only active via the new entry points.GLOBALmel-norm matches batch exactly only when the whole signal is appended before the first decode (documented inline); incremental decoding uses the running max.whisper_full_params; the Go/Java/Ruby bindings are not yet regenerated.Happy to adjust scope or split this up — feedback very welcome.