Problem
The abort_callback field in whisper_full_params does not actually interrupt an in-progress transcription. It only fires after each encode/decode step completes, making it ineffective for real-time cancellation. This makes it impossible to build a responsive stop feature in applications that use whisper_full(). It can also create ghost processes in certain applications if we rely purely on operating system's garbage collection. (e.g. macOS)
Root Cause
There are two overloads of the internal ggml_graph_compute_helper:
- The
ggml_cgraph * version (line 169): Correctly accepts and wires abort_callback into ggml backend.
- The
abort_callback_sched_t version (line 191): Has no abort_callback parameter at all.
All actual encoder and decoder compute calls inside whisper_encode_internaland whisper_decode_internaluse the second (shed) overload exclusively. As a result, abort_callback is never passed to the ggml backend during computation. The only places it fires are the post-hoc checks at the very end of those functions (lines 2447 and 2977), after all the work is already done.
Additionally, the main token sampling loop (whisper_full_with_state, line 6783) has no abort check at all. It runs up to n_text_ctx / 2 iterations with no opportunity to exit early.
encoder_begin_callback does work correctly. It fires before each audio chunk but this only helps with multi-chunk audio. For short clips processed as a single chunk, by the time a user requests a stop, the single chunk is already being processed and encoder_begin_callback will not fire again.
Proposed Fix
I propose 3 changes to whisper.cpp with no API changes:
- Add
abort_callback support to the sched overload of ggml_graph_compute_helper, ısing the same ggml_backend_set_abort_callback pattern already present in the non-shed overload.
- Pass
abort_callback and abort_callback_user_data through the ggml_graph_compute_helper(schedule, ...) calls inside whisper_encode_internal (lines 2406, 2431, 2447) and inside whisper_decode_internal (line 2944). Note that whisper_decode_internal is called from 4 external sites, but lines 3940 and 8847 already pass nullptr and are unrelated to user-initiated abort.
- Add an
abort_callback check at the top of the token sampling loop in whisper_full_with_state so it can exit between token generations.
Discussion
Before implementing these, I wanted to check whether is this the right layer to fix it, or would you prefer the abort mechanism live deeper in ggml? Also do you have any concerns with the proposed approach?
Problem
The
abort_callbackfield inwhisper_full_paramsdoes not actually interrupt an in-progress transcription. It only fires after each encode/decode step completes, making it ineffective for real-time cancellation. This makes it impossible to build a responsive stop feature in applications that usewhisper_full(). It can also create ghost processes in certain applications if we rely purely on operating system's garbage collection. (e.g. macOS)Root Cause
There are two overloads of the internal
ggml_graph_compute_helper:ggml_cgraph *version (line 169): Correctly accepts and wiresabort_callbackinto ggml backend.abort_callback_sched_tversion (line 191): Has noabort_callbackparameter at all.All actual encoder and decoder compute calls inside
whisper_encode_internalandwhisper_decode_internaluse the second (shed) overload exclusively. As a result,abort_callbackis never passed to the ggml backend during computation. The only places it fires are the post-hoc checks at the very end of those functions (lines 2447 and 2977), after all the work is already done.Additionally, the main token sampling loop (
whisper_full_with_state, line 6783) has no abort check at all. It runs up ton_text_ctx / 2iterations with no opportunity to exit early.encoder_begin_callbackdoes work correctly. It fires before each audio chunk but this only helps with multi-chunk audio. For short clips processed as a single chunk, by the time a user requests a stop, the single chunk is already being processed andencoder_begin_callbackwill not fire again.Proposed Fix
I propose 3 changes to
whisper.cppwith no API changes:abort_callbacksupport to theschedoverload ofggml_graph_compute_helper, ısing the sameggml_backend_set_abort_callbackpattern already present in the non-shed overload.abort_callbackandabort_callback_user_datathrough theggml_graph_compute_helper(schedule, ...)calls insidewhisper_encode_internal(lines 2406, 2431, 2447) and insidewhisper_decode_internal(line 2944). Note thatwhisper_decode_internalis called from 4 external sites, but lines 3940 and 8847 already passnullptrand are unrelated to user-initiated abort.abort_callbackcheck at the top of the token sampling loop inwhisper_full_with_stateso it can exit between token generations.Discussion
Before implementing these, I wanted to check whether is this the right layer to fix it, or would you prefer the abort mechanism live deeper in ggml? Also do you have any concerns with the proposed approach?