[ET-VK][sdpa] Use numerically-stable softmax in attention weights by SS-JIA · Pull Request #18407 · pytorch/executorch

SS-JIA · 2026-03-23T14:22:52Z

Stack from ghstack (oldest at bottom):

The SDPA attention weights softmax shader computed naive softmax:
exp(x) / sum(exp(x)). When attention weights are large (e.g., 151.29 for
Phi-4-mini with head_dim=128), exp(x) overflows float32 (threshold ~88.7),
producing Infinity and then NaN from inf/inf in the normalization step.

This replaces the naive softmax with the standard numerically-stable variant:
exp(x - max(x)) / sum(exp(x - max(x))). The implementation adds a cooperative
max-finding pass (same workgroup reduction pattern as the existing exp_sum pass)
before the exp_sum and normalization passes. The max subtraction ensures that the
largest exponent is 0, preventing overflow.

This fixes Phi-4-mini Vulkan inference which previously produced garbage output
due to NaN propagation from the first transformer layer's attention.

On-device A/B benchmarks on Samsung Galaxy S24 (Adreno 750) with Llama 3.2 1B
(8da4w g128 q4emb, 677 MB) confirm no performance regression:

Llama 3.2 1B (short prompt, 4 tokens, --warmup):
Prefill: 67.2 tok/s | Decode: 59.4 tok/s | TTFT: 60 ms

Llama 3.2 1B (medium prompt, 197 tokens, --warmup):
Prefill: 723.5 tok/s | Decode: 53.3 tok/s | TTFT: 273 ms

These numbers are within run-to-run variance of the baseline (no fix) measurements,
confirming the additional max-finding pass has negligible overhead.

Differential Revision: D97757920

The SDPA attention weights softmax shader computed naive softmax: exp(x) / sum(exp(x)). When attention weights are large (e.g., 151.29 for Phi-4-mini with head_dim=128), exp(x) overflows float32 (threshold ~88.7), producing Infinity and then NaN from inf/inf in the normalization step. This replaces the naive softmax with the standard numerically-stable variant: exp(x - max(x)) / sum(exp(x - max(x))). The implementation adds a cooperative max-finding pass (same workgroup reduction pattern as the existing exp_sum pass) before the exp_sum and normalization passes. The max subtraction ensures that the largest exponent is 0, preventing overflow. This fixes Phi-4-mini Vulkan inference which previously produced garbage output due to NaN propagation from the first transformer layer's attention. On-device A/B benchmarks on Samsung Galaxy S24 (Adreno 750) with Llama 3.2 1B (8da4w g128 q4emb, 677 MB) confirm no performance regression: Llama 3.2 1B (short prompt, 4 tokens, --warmup): Prefill: 67.2 tok/s | Decode: 59.4 tok/s | TTFT: 60 ms Llama 3.2 1B (medium prompt, 197 tokens, --warmup): Prefill: 723.5 tok/s | Decode: 53.3 tok/s | TTFT: 273 ms These numbers are within run-to-run variance of the baseline (no fix) measurements, confirming the additional max-finding pass has negligible overhead. Differential Revision: [D97757920](https://our.internmc.facebook.com/intern/diff/D97757920/) [ghstack-poisoned]

pytorch-bot · 2026-03-23T14:22:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18407

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit 4f2bffe with merge base 60d57e5 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-cortex-m-size-test (bare_metal) / linux-job (gh)
RuntimeError: Command docker exec -t 0411dd1a2a4c21df9eb62cd1234d44608c23070a32268fe210bd3cbf318d0102 /exec failed with exit code 92
pull / test-multimodal-linux (gemma3-4b) / linux-job (gh)
RuntimeError: Command docker exec -t ca048492bcf25e42b9f78600318889a04319db60367ea85fc74c88d6f289b0ea /exec failed with exit code 139

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-03-23T14:28:36Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

The SDPA attention weights softmax shader computed naive softmax: exp(x) / sum(exp(x)). When attention weights are large (e.g., 151.29 for Phi-4-mini with head_dim=128), exp(x) overflows float32 (threshold ~88.7), producing Infinity and then NaN from inf/inf in the normalization step. This replaces the naive softmax with the standard numerically-stable variant: exp(x - max(x)) / sum(exp(x - max(x))). The implementation adds a cooperative max-finding pass (same workgroup reduction pattern as the existing exp_sum pass) before the exp_sum and normalization passes. The max subtraction ensures that the largest exponent is 0, preventing overflow. This fixes Phi-4-mini Vulkan inference which previously produced garbage output due to NaN propagation from the first transformer layer's attention. On-device A/B benchmarks on Samsung Galaxy S24 (Adreno 750) with Llama 3.2 1B (8da4w g128 q4emb, 677 MB) confirm no performance regression: Llama 3.2 1B (short prompt, 4 tokens, --warmup): Prefill: 67.2 tok/s | Decode: 59.4 tok/s | TTFT: 60 ms Llama 3.2 1B (medium prompt, 197 tokens, --warmup): Prefill: 723.5 tok/s | Decode: 53.3 tok/s | TTFT: 273 ms These numbers are within run-to-run variance of the baseline (no fix) measurements, confirming the additional max-finding pass has negligible overhead. Differential Revision: [D97757920](https://our.internmc.facebook.com/intern/diff/D97757920/) ghstack-source-id: 356136427 Pull Request resolved: #18407

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 23, 2026

SS-JIA mentioned this pull request Mar 23, 2026

[ET-VK] Support alias_copy in partitioner by removing it as a redundant op #18408

Merged

meta-codesync Bot added fb-exported meta-exported labels Mar 23, 2026

manuelcandales approved these changes Mar 24, 2026

View reviewed changes

meta-codesync Bot merged commit 3ba9dec into gh/SS-JIA/500/base Mar 24, 2026
137 of 145 checks passed

meta-codesync Bot deleted the gh/SS-JIA/500/head branch March 24, 2026 19:59

meta-codesync Bot temporarily deployed to cherry-pick-bot March 24, 2026 19:59 Inactive

pytorchbot mentioned this pull request Mar 24, 2026

[ET-VK][sdpa] Use numerically-stable softmax in attention weights #18460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK][sdpa] Use numerically-stable softmax in attention weights#18407

[ET-VK][sdpa] Use numerically-stable softmax in attention weights#18407
meta-codesync[bot] merged 1 commit intogh/SS-JIA/500/basefrom
gh/SS-JIA/500/head

SS-JIA commented Mar 23, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SS-JIA commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18407

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

github-actions Bot commented Mar 23, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Mar 23, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 23, 2026 •

edited

Loading

This PR needs a `release notes:` label