fix: allow flash attention in encoder when DTW is enabled#3704
Open
Acelogic wants to merge 1 commit intoggml-org:masterfrom
Open
fix: allow flash attention in encoder when DTW is enabled#3704Acelogic wants to merge 1 commit intoggml-org:masterfrom
Acelogic wants to merge 1 commit intoggml-org:masterfrom
Conversation
Previously, enabling DTW token timestamps with flash attention caused DTW to be silently disabled entirely. DTW only needs the explicit cross-attention weights (KQ_soft_max) from the decoder, so flash attention can remain enabled for: - encoder self-attention - decoder self-attention Only the cross-attention path in both the encoder (KV storage) and decoder (KQ computation) needs to fall back to the non-flash path when DTW is active, since flash attention fuses the entire attention operation and never materializes KQ_soft_max. This allows DTW timestamps to work alongside flash attention with no encoder performance penalty. Fixes ggml-org#3662
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Details
Previously,
whisper_init_with_params_no_statedisabled DTW entirely whenflash_attnwas set:DTW only needs the explicit cross-attention weights (
KQ_soft_max) from the decoder, which the flash attention path doesn't produce (it fuses QKV into one operation). The encoder self-attention and decoder self-attention don't interact with DTW at all.The fix introduces a
flash_crossflag (flash_attn && !dtw_token_timestamps) used at two matching locations:kv_crossBoth locations must use the same condition to keep the KV cache layout consistent.
Test plan
Fixes #3662