Skip to content

whisper : add resumable / streaming transcription API#3869

Open
AlexCherrypi wants to merge 4 commits into
ggml-org:masterfrom
AlexCherrypi:resumable-streaming-api
Open

whisper : add resumable / streaming transcription API#3869
AlexCherrypi wants to merge 4 commits into
ggml-org:masterfrom
AlexCherrypi:resumable-streaming-api

Conversation

@AlexCherrypi

Copy link
Copy Markdown

What

Adds an incremental, resumable transcription API so audio can be fed to whisper while it is still being captured, instead of only as one complete buffer via whisper_full().

New public API (whisper.h):

  • whisper_append_audio[_with_state]() — feed PCM incrementally; mel frames are computed and accumulated, PCM is not retained.
  • whisper_full_resumable[_with_state]() — decode the windows fully backed by audio so far; finalize=true flushes the trailing (<30s) window with batch-equivalent zero padding.
  • whisper_resumable_reset[_with_state]() — reset the accumulated state to start a new stream.
  • New whisper_full_params fields: mel_norm_mode (GLOBAL = running max, batch-compatible; WINDOW = envelope-follower AGC for live mic input), mel_norm_half_life, mel_norm_max_drop.

Plus a self-contained examples/stream-resumable/ demo (single producer/consumer, one whisper_state per worker).

Why

Enables "transcribe while recording" / near-instant-on-stop UX — when the user stops, only the trailing window remains to decode — at full model quality, and without the re-decode overhead of the sliding-window stream example.

Notes

  • Additive: existing whisper_full() behavior is unchanged; the resumable path is only active via the new entry points.
  • GLOBAL mel-norm matches batch exactly only when the whole signal is appended before the first decode (documented inline); incremental decoding uses the running max.
  • New params are appended at the end of whisper_full_params; the Go/Java/Ruby bindings are not yet regenerated.
  • Used in a real macOS app (live dictation) on Apple Silicon / Metal.

Happy to adjust scope or split this up — feedback very welcome.

claude and others added 4 commits June 8, 2026 21:34
Add a resumable transcription path that lets audio be fed incrementally
and decoded consistently with a single batch run: the seek position and
rolling text context are persisted in whisper_state across calls, and
audio is only ever decoded once (no sliding-window re-decoding / output
divergence between iterations).

New public API:
  - whisper_resumable_reset[_with_state]
  - whisper_append_audio[_with_state]   (incremental log-mel)
  - whisper_full_resumable[_with_state] (decode complete 30s windows,
    defer the trailing window until finalize)

The existing whisper_full_with_state loop gains a few guarded hooks
(active only in resumable mode); whisper_full() behavior is unchanged.

Log-mel normalization is configurable via whisper_full_params:
  - WHISPER_MEL_NORM_GLOBAL: global-max, matches batch (default)
  - WHISPER_MEL_NORM_WINDOW: per-window reference with an envelope
    follower (instant attack, EMA release fed the raw peak) so a brief
    silence does not amplify background noise and a loud transient does
    not wreck the rest of the stream.

Verified that in GLOBAL mode the resumable path feeds byte-identical mel
windows to the encoder at every seek offset (including the finalize tail
with batch-equivalent straddle frames), so decoded output matches batch
for any model.
The reference-level release was an EMA coefficient applied once per decode
window, so its effective time constant depended on the model-chosen
seek_delta (variable in seconds). Replace it with an exponential decay
parameterized by a half-life in audio seconds: decay = 0.5^(dt / T_half),
where dt is the seek advance since the last reference update. This makes
the "forgetting" of a stopped background source predictable in wall-clock
terms and independent of window stride or pauses between calls.

  - mel_norm_release  -> mel_norm_half_life (seconds; <= 0 = instantaneous)
  - mel_norm_max_drop is now a drop-rate cap (log10 power per second)

Attack stays instantaneous/free; the decay is still fed the raw per-window
peak. GLOBAL mode is unaffected.
Reference example for the resumable streaming API. A producer thread feeds
PCM into a queue while a worker thread, owning a dedicated whisper_state,
runs whisper_append_audio + whisper_full_resumable next to "recording" -
decoding each 30s window once and persisting seek + context across calls.

Reads a WAV file (optionally paced in real time to simulate a live source);
swap the producer for a microphone in a real application. Demonstrates both
mel normalization modes (global / per-window envelope follower). Builds
without SDL.
Review follow-ups on the resumable/streaming API:
- Document that WHISPER_MEL_NORM_GLOBAL matches batch whisper_full() only when
  all audio is appended before the first decode; incremental decoding uses the
  running max (whisper.h + stream-resumable README).
- Reject whisper_append_audio() after a finalize pass (new rs_finalized flag)
  so a misuse can't interleave real frames behind the trailing pad and desync
  the mel layout; reset clears the flag.
- Warn on finalize when audio was fed but too short to produce any frame.
- Fill the language-detection mel window at offset 0 to match the encode offset.
- Correct the WINDOW envelope-follower dt comment (stride is decoder-chosen).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants