Commit 2fdc6d5
committed
[ExecuTorch][WebGPU] SDPA: branchless aligned/tail loads in the QK/AV kernels
Pull Request resolved: #20493
**Branchless aligned/tail loads + vec4 storage bindings** — drop the always-true per-lane bounds checks in the tiled QK/AV hot loops, split the AV context contraction into a branch-free aligned body plus a checked tail, and declare the head-dim-indexed SDPA storage buffers as `array<vec4<f32>>` so the loads/stores are forced-vectorized (addresses review feedback to mirror Vulkan's vec4 bindings).
**Problem**: The tiled QK/AV vec4 loaders run 4 per-lane `if` bounds checks on every load, every contraction iteration (8 loads/iter). But `head_dim` is always a multiple of 4, so the D-axis checks never fire, and the AV context axis only needs a bounds check on the last ragged chunk. Separately the storage buffers were declared `array<f32>`, so the 4-lane loads/stores were not guaranteed to compile to aligned 128-bit vector accesses.
**Solution**: Remove the dead checks, split the ragged axis, and vectorize the bindings:
- **Before**: `load_q_vec4`/`load_k_vec4` (and AV `load_a_vec4`/`load_v_d4`) do 4 per-lane bounds `if`s per call; the AV `c4` loop runs checked loads for every chunk; `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<f32>` accessed element-by-element.
- **After**: QK loads are a plain unchecked `vec4` (D%4==0, host-guarded); AV runs a branch-free aligned body over `c4 in [0, context_len - context_len%4)` then a 0-or-1 checked tail; the head-dim-indexed buffers `t_q`/`t_k_cache`/`t_v_cache`/`t_out` are `array<vec4<f32>>` indexed `[base/4u]`, and AV writes a single aligned `store_out_vec4`.
**Implementation**:
- QK: `load_q_vec4`/`load_k_vec4` drop the per-lane D checks and return `t_q[base/4u]` / `t_k_cache[base/4u]`.
- AV: branch-free `load_a_vec4_nc`/`load_v_d4_nc` for the aligned body; checked `load_a_vec4`/`load_v_d4` for the tail; V reads `t_v_cache[base/4u]`; output is one aligned `store_out_vec4`.
- Bindings: `t_q`, `t_k_cache` (QK) and `t_v_cache`, `t_out` (AV) are `array<vec4<f32>>`. `t_attn_weights` and the softmax buffer stay `array<f32>` — they are `context_len`-indexed (row stride not 4-aligned) and written per-element under the causal mask, so a `vec4` binding there would need a padded scratch row.
- Host: add a `D % 4 == 0` guard in `Sdpa.cpp` — WGSL has no `SDPA_PAD_D` pad-load, so fail loud rather than read past the row; this guard also makes every `[base/4u]` index 4-aligned and every buffer a 16-byte multiple.
- Test: add a `reject_d6` (head_dim=6) config + an `expect_reject` harness branch asserting the guard rejects a non-aligned head_dim at load.
- Mirrors Vulkan `sdpa_compute_out_tiled.glsl` (aligned/tail split) and Vulkan's `array<vec4>` SDPA bindings.
**Constraints**:
- Requires `head_dim % 4 == 0` (true for every Llama config, D=64); enforced by a loud host throw, not a silent narrowing.
- Bit-identical output: the aligned body processes the same chunks in the same accumulation order as the scalar loop, the tail's out-of-range lanes contribute 0, and the `vec4` bindings read/write the same bytes as the scalar version.
- No KV-cache layout, dispatch, or uniform change.
Co-authored with Claude Code.
ghstack-source-id: 396717582
@exported-using-ghexport
Differential Revision: [D109521069](https://our.internmc.facebook.com/intern/diff/D109521069/)1 parent 26bd1c1 commit 2fdc6d5
7 files changed
Lines changed: 132 additions & 78 deletions
File tree
- backends/webgpu
- runtime/ops/sdpa
- test
- ops/sdpa
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
339 | 339 | | |
340 | 340 | | |
341 | 341 | | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
342 | 347 | | |
343 | 348 | | |
344 | 349 | | |
| |||
Lines changed: 9 additions & 18 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | | - | |
3 | | - | |
| 2 | + | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | | - | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 30 | + | |
| 31 | + | |
36 | 32 | | |
37 | 33 | | |
38 | 34 | | |
39 | | - | |
40 | 35 | | |
41 | | - | |
| 36 | + | |
42 | 37 | | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
| 38 | + | |
| 39 | + | |
49 | 40 | | |
50 | 41 | | |
51 | 42 | | |
| |||
Lines changed: 10 additions & 19 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | | - | |
| 19 | + | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | | - | |
44 | 44 | | |
45 | | - | |
| 45 | + | |
46 | 46 | | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
| 47 | + | |
| 48 | + | |
53 | 49 | | |
54 | 50 | | |
55 | 51 | | |
56 | | - | |
57 | 52 | | |
58 | | - | |
| 53 | + | |
59 | 54 | | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
65 | | - | |
| 55 | + | |
| 56 | + | |
66 | 57 | | |
67 | 58 | | |
68 | 59 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
3 | | - | |
| 3 | + | |
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
| 22 | + | |
22 | 23 | | |
23 | 24 | | |
24 | 25 | | |
| |||
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
36 | | - | |
37 | 37 | | |
38 | | - | |
| 38 | + | |
39 | 39 | | |
40 | 40 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
46 | 56 | | |
47 | 57 | | |
48 | | - | |
49 | | - | |
| 58 | + | |
| 59 | + | |
50 | 60 | | |
51 | 61 | | |
52 | | - | |
53 | | - | |
| 62 | + | |
| 63 | + | |
54 | 64 | | |
55 | 65 | | |
56 | 66 | | |
| |||
77 | 87 | | |
78 | 88 | | |
79 | 89 | | |
| 90 | + | |
| 91 | + | |
80 | 92 | | |
81 | 93 | | |
82 | | - | |
| 94 | + | |
83 | 95 | | |
84 | 96 | | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
85 | 112 | | |
86 | 113 | | |
87 | 114 | | |
| |||
94 | 121 | | |
95 | 122 | | |
96 | 123 | | |
97 | | - | |
98 | 124 | | |
99 | 125 | | |
100 | 126 | | |
101 | 127 | | |
102 | 128 | | |
103 | 129 | | |
104 | 130 | | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
| 131 | + | |
110 | 132 | | |
111 | 133 | | |
112 | 134 | | |
Lines changed: 43 additions & 21 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | | - | |
| 16 | + | |
17 | 17 | | |
18 | | - | |
| 18 | + | |
19 | 19 | | |
20 | | - | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
39 | 40 | | |
40 | 41 | | |
41 | 42 | | |
| |||
50 | 51 | | |
51 | 52 | | |
52 | 53 | | |
53 | | - | |
54 | 54 | | |
55 | | - | |
| 55 | + | |
56 | 56 | | |
57 | 57 | | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
63 | 73 | | |
64 | 74 | | |
65 | | - | |
66 | | - | |
| 75 | + | |
| 76 | + | |
67 | 77 | | |
68 | 78 | | |
69 | | - | |
70 | | - | |
| 79 | + | |
| 80 | + | |
71 | 81 | | |
72 | 82 | | |
73 | 83 | | |
| |||
94 | 104 | | |
95 | 105 | | |
96 | 106 | | |
| 107 | + | |
| 108 | + | |
97 | 109 | | |
98 | 110 | | |
99 | | - | |
| 111 | + | |
100 | 112 | | |
101 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
102 | 129 | | |
103 | 130 | | |
104 | 131 | | |
| |||
111 | 138 | | |
112 | 139 | | |
113 | 140 | | |
114 | | - | |
115 | 141 | | |
116 | 142 | | |
117 | 143 | | |
118 | 144 | | |
119 | 145 | | |
120 | 146 | | |
121 | 147 | | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
| 148 | + | |
127 | 149 | | |
128 | 150 | | |
129 | 151 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
| 62 | + | |
| 63 | + | |
62 | 64 | | |
63 | 65 | | |
64 | 66 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
435 | 435 | | |
436 | 436 | | |
437 | 437 | | |
| 438 | + | |
438 | 439 | | |
439 | 440 | | |
440 | 441 | | |
| |||
454 | 455 | | |
455 | 456 | | |
456 | 457 | | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
457 | 469 | | |
458 | 470 | | |
459 | 471 | | |
| |||
507 | 519 | | |
508 | 520 | | |
509 | 521 | | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
510 | 531 | | |
511 | 532 | | |
512 | 533 | | |
| |||
0 commit comments