[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels by JulianCloudNTH · Pull Request #20405 · pytorch/executorch

JulianCloudNTH · 2026-06-18T22:57:28Z

Stack from ghstack (oldest at bottom):

+32% SDPA attention-compute (AV +40%) — register-tile the QK and AV kernels (isolated GPU-timestamp A/B, decode S=1, Chrome Canary / M4 Pro). A kernel-time win, not a wall-clock forward() win — forward() stays bound by the submit/sync/readback floor (the separate fusion axis).

Problem: The naive QK/AV kernels compute one output element per thread, so each thread re-loads Q/K/V and the dot products are scalar — poor register reuse, ALU/latency-bound.

Solution: Each thread computes a 4×4 output tile with the dot products vec4-packed in registers:

Before: one thread per output element; scalar accumulate over D (QK) / context (AV).
After: one thread per (head, S-tile, {ctx,D}-tile); 4×4 register tile, vec4 dot products. A floating-point accumulation reorder of the same products — no algorithm change.

Implementation:

sdpa_compute_attn_weights.wgsl (QK): one thread per (head, S-tile, ctx-tile), grid Hq · ceil(S/4) · ceil(ctx/4); tile registers are array<vec4<f32>, TM/TN> loaded via for loops.
sdpa_compute_out.wgsl (AV): one thread per (head, S-tile, D-tile), grid Hq · ceil(S/4) · ceil(D/4).
Sdpa.cpp: dispatch math moves from an element count to a tile count (kSdpaTileM/N=4, shared utils::div_up), keeping the uint32 scratch-overflow guard.
Mirrors the Vulkan register-tiled SDPA kernels; the shared utils::div_up mirrors Vulkan's utils::div_up.

Constraints:

softmax, update_cache, the bind-group layouts, and the scratch-buffer sizes (Hq*S*ctx) are unchanged.
Scope is tiling only — causal tile-skip, V-cache coalescing, and branchless aligned/tail loads are separate follow-ups; this diff intentionally omits the Vulkan causal tile-skip since it is correctness-neutral (the per-element mask in store_qk is identical). See DESIGN_DECISIONS.md.
Output matches the naive kernels within fp tolerance (accumulation reorder only).
@exported-using-ghexport

Differential Revision: D109081409

[ghstack-poisoned]

pytorch-bot · 2026-06-18T22:57:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20405

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[ROCm] MI350 CI runner label rename: rebase PRs using old linux.rocm.gpu.gfx950.* labels

❌ 3 New Failures, 3 Unrelated Failures

As of commit 3ce91e0 with merge base 0e65ba6 ():

NEW FAILURES - The following jobs have failed:

pull / test-qnn-models-linux (dl3) / linux-job (gh)
RuntimeError: Command docker exec -t 0b397036a6da1ad82cd122e4e3ff8882cc4ae787364218a71d306a46878724e2 /exec failed with exit code 92
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t dfdcf549e8d5e9e3c654b3452c863f088fe01ad54a009bf5c1eaecb829db88e5 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 605504f7e799186cd77efb2118ca02795a2ea7041508e7287cff19b290454eae /exec failed with exit code 1

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-18T22:58:42Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

Update

b0e2d6b

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 18, 2026 22:57 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2026

Update

3ce91e0

[ghstack-poisoned]

This was referenced Jun 24, 2026

[ExecuTorch][WebGPU] SDPA: skip QK contraction for fully-masked causal tiles #20492

Open

[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels #20493

Open

JulianCloudNTH temporarily deployed to cadence June 24, 2026 19:54 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 24, 2026

SS-JIA approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels#20405

[ExecuTorch][WebGPU] Register-tile the SDPA QK/AV kernels#20405
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/49/basefrom
gh/JulianCloudNTH/49/head

JulianCloudNTH commented Jun 18, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

SS-JIA left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20405

❗ 1 Active SEVs

❌ 3 New Failures, 3 Unrelated Failures

Uh oh!

github-actions Bot commented Jun 18, 2026

This PR needs a release notes: label

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JulianCloudNTH commented Jun 18, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 18, 2026 •

edited

Loading

This PR needs a `release notes:` label