[Qwen3_5]Remove unnecessary masked_fill_ in torch_chunk_gated_delta_rule attention computation: "attn = (q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i]).masked_fill_(mask, 0)" by ENg-122 · Pull Request #45215 · huggingface/transformers

ENg-122 · 2026-04-03T09:08:28Z

What does this PR do?

Remove unnecessary masked_fill_(mask, 0) call in torch_chunk_gated_delta_rule.

The decay_mask computed earlier already encodes the causal/lower-triangular structure (upper-triangle values are zero), so masking the attention scores again with masked_fill_(mask, 0) is redundant and adds unnecessary overhead.

Before:
attn = (q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i]).masked_fill_(mask, 0)

After:
attn = (q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i])

[√] I confirm that this is not a pure code agent PR.

Before submitting

[×] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[√] Did you read the contributor guideline,
Pull Request section?
[×] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[√] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[×] Did you write any new necessary tests?

Who can review?

@ArthurZucker @Cyrilvallez

Signed-off-by: zj <2716634506@qq.com>

…ule attention computation Signed-off-by: zj <2716634506@qq.com>

github-actions · 2026-04-03T09:46:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: olmo_hybrid, qwen3_5, qwen3_5_moe, qwen3_next

ENg-122 and others added 3 commits April 3, 2026 16:48

[Qwen3_5]Remove excess mask

b76f595

Signed-off-by: zj <2716634506@qq.com>

Merge branch 'main' into test_main

061b7ee

[Qwen3_5]Remove unnecessary masked_fill_ in torch_chunk_gated_delta_r…

ce554fb

…ule attention computation Signed-off-by: zj <2716634506@qq.com>

Fix: remove brackets to match generated code format

c8ace4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Qwen3_5]Remove unnecessary masked_fill_ in torch_chunk_gated_delta_rule attention computation: "attn = (q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i]).masked_fill_(mask, 0)"#45215

[Qwen3_5]Remove unnecessary masked_fill_ in torch_chunk_gated_delta_rule attention computation: "attn = (q_i @ k_i.transpose(-1, -2) * decay_mask[:, :, i]).masked_fill_(mask, 0)"#45215
ENg-122 wants to merge 4 commits intohuggingface:mainfrom
ENg-122:test_main

ENg-122 commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ENg-122 commented Apr 3, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

github-actions bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant