whisper : add resumable / streaming transcription API by AlexCherrypi · Pull Request #3869 · ggml-org/whisper.cpp

AlexCherrypi · 2026-06-09T14:46:45Z

What

Adds an incremental, resumable transcription API so audio can be fed to whisper while it is still being captured, instead of only as one complete buffer via whisper_full().

New public API (whisper.h):

whisper_append_audio[_with_state]() — feed PCM incrementally; mel frames are computed and accumulated, PCM is not retained.
whisper_full_resumable[_with_state]() — decode the windows fully backed by audio so far; finalize=true flushes the trailing (<30s) window with batch-equivalent zero padding.
whisper_resumable_reset[_with_state]() — reset the accumulated state to start a new stream.
New whisper_full_params fields: mel_norm_mode (GLOBAL = running max, batch-compatible; WINDOW = envelope-follower AGC for live mic input), mel_norm_half_life, mel_norm_max_drop.

Plus a self-contained examples/stream-resumable/ demo (single producer/consumer, one whisper_state per worker).

Why

Enables "transcribe while recording" / near-instant-on-stop UX — when the user stops, only the trailing window remains to decode — at full model quality, and without the re-decode overhead of the sliding-window stream example.

Notes

Additive: existing whisper_full() behavior is unchanged; the resumable path is only active via the new entry points.
GLOBAL mel-norm matches batch exactly only when the whole signal is appended before the first decode (documented inline); incremental decoding uses the running max.
New params are appended at the end of whisper_full_params; the Go/Java/Ruby bindings are not yet regenerated.
Used in a real macOS app (live dictation) on Apple Silicon / Metal.

Happy to adjust scope or split this up — feedback very welcome.

Add a resumable transcription path that lets audio be fed incrementally and decoded consistently with a single batch run: the seek position and rolling text context are persisted in whisper_state across calls, and audio is only ever decoded once (no sliding-window re-decoding / output divergence between iterations). New public API: - whisper_resumable_reset[_with_state] - whisper_append_audio[_with_state] (incremental log-mel) - whisper_full_resumable[_with_state] (decode complete 30s windows, defer the trailing window until finalize) The existing whisper_full_with_state loop gains a few guarded hooks (active only in resumable mode); whisper_full() behavior is unchanged. Log-mel normalization is configurable via whisper_full_params: - WHISPER_MEL_NORM_GLOBAL: global-max, matches batch (default) - WHISPER_MEL_NORM_WINDOW: per-window reference with an envelope follower (instant attack, EMA release fed the raw peak) so a brief silence does not amplify background noise and a loud transient does not wreck the rest of the stream. Verified that in GLOBAL mode the resumable path feeds byte-identical mel windows to the encoder at every seek offset (including the finalize tail with batch-equivalent straddle frames), so decoded output matches batch for any model.

The reference-level release was an EMA coefficient applied once per decode window, so its effective time constant depended on the model-chosen seek_delta (variable in seconds). Replace it with an exponential decay parameterized by a half-life in audio seconds: decay = 0.5^(dt / T_half), where dt is the seek advance since the last reference update. This makes the "forgetting" of a stopped background source predictable in wall-clock terms and independent of window stride or pauses between calls. - mel_norm_release -> mel_norm_half_life (seconds; <= 0 = instantaneous) - mel_norm_max_drop is now a drop-rate cap (log10 power per second) Attack stays instantaneous/free; the decay is still fed the raw per-window peak. GLOBAL mode is unaffected.

Reference example for the resumable streaming API. A producer thread feeds PCM into a queue while a worker thread, owning a dedicated whisper_state, runs whisper_append_audio + whisper_full_resumable next to "recording" - decoding each 30s window once and persisting seek + context across calls. Reads a WAV file (optionally paced in real time to simulate a live source); swap the producer for a microphone in a real application. Demonstrates both mel normalization modes (global / per-window envelope follower). Builds without SDL.

Review follow-ups on the resumable/streaming API: - Document that WHISPER_MEL_NORM_GLOBAL matches batch whisper_full() only when all audio is appended before the first decode; incremental decoding uses the running max (whisper.h + stream-resumable README). - Reject whisper_append_audio() after a finalize pass (new rs_finalized flag) so a misuse can't interleave real frames behind the trailing pad and desync the mel layout; reset clears the flag. - Warn on finalize when audio was fed but too short to produce any frame. - Fill the language-detection mel window at offset 0 to match the encode offset. - Correct the WINDOW envelope-follower dt comment (stride is decoder-chosen). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude and others added 4 commits June 8, 2026 21:34

AlexCherrypi mentioned this pull request Jun 9, 2026

Live (resumable) streaming transcription Starmel/OpenSuperWhisper#147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : add resumable / streaming transcription API#3869

whisper : add resumable / streaming transcription API#3869
AlexCherrypi wants to merge 4 commits into
ggml-org:masterfrom
AlexCherrypi:resumable-streaming-api

AlexCherrypi commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AlexCherrypi commented Jun 9, 2026

What

Why

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants