Attempt to fix #2220: Enable Flash Attention in KV-cache path by woaiwang · Pull Request #2248 · Lightning-AI/litgpt

woaiwang · 2026-05-09T12:04:18Z

What I did:
I noticed in #2220 that passing an explicit attn_mask disables the PyTorch SDPA Flash Attention fast path. I tried to drop the mask during the decoding phase (q.size(2) == 1) in CausalSelfAttention.scaled_dot_product_attention to re-enable it.

The Issue:
Running pytest tests/test_model.py results in AssertionError: Tensor-likes are not close! for a few tests (e.g., test_against_gpt_neox_model).
I realize that simply dropping the mask causes the query to attend to the uninitialized/padded parts of the k and v tensors in the KVCache if input_pos_maxp1 is not aggressively slicing them.

Question for reviewers:
Could you guide me on the safest way to slice the KV-cache or formulate the is_causal flag here so that we can safely drop the explicit mask without reading uninitialized memory? I would love to finish this PR with your guidance!

…ning-AI#2220)

fix: attempt to enable flash attn in decoding by dropping mask (Light…

f3504b2

…ning-AI#2220)

woaiwang requested review from andyland, k223kim, lianakoleva and t-vi as code owners May 9, 2026 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempt to fix #2220: Enable Flash Attention in KV-cache path#2248

Attempt to fix #2220: Enable Flash Attention in KV-cache path#2248
woaiwang wants to merge 1 commit into
Lightning-AI:mainfrom
woaiwang:fix-issue-2220

woaiwang commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

woaiwang commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant