Commit 79c8119
SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580)
* SYCL: add BF16 to DMMV kernel path for ~4x token generation speedup
BF16 models had no dedicated token generation kernel — they fell through
to the generic full-GEMM path, resulting in ~14% memory bandwidth
utilization on Intel Arc GPUs. This adds BF16 support to the DMMV
(dequantize mul-mat-vec) path, matching the existing F16 implementation.
Fixes ggml-org#20478
* SYCL: fix BF16 DMMV out-of-bounds when ncols % 64 != 0
The qk=1 kernel (used for F16 and BF16) iterates with stride
2*GGML_SYCL_DMMV_X (= 64 on Intel targets where WARP_SIZE=16). When
ncols is a multiple of DMMV_X (32) but not of 2*DMMV_X (64), the last
warp iteration accesses elements at col >= ncols, producing NaN for the
final row and wrong values for interior rows.
Fix: tighten can_use_dequantize_mul_mat_vec to require ne[0] %
(2*DMMV_X) == 0 for F16/BF16 types, and update the ASSERT in the BF16
launcher to match. Quantized types use block-structured kernels with
different access patterns and keep the existing DMMV_X check.
Verified: test-backend-ops MUL_MAT passes 913/913 on Intel Arc Pro B70.
Previously failing: m=128/129 n=1 k=1056 cases (NaN and ERR > 0.0005).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>1 parent 5a098a3 commit 79c8119
2 files changed
Lines changed: 53 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
6 | 13 | | |
7 | 14 | | |
8 | 15 | | |
| |||
11 | 18 | | |
12 | 19 | | |
13 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
14 | 31 | | |
15 | 32 | | |
16 | 33 | | |
| |||
217 | 234 | | |
218 | 235 | | |
219 | 236 | | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
220 | 259 | | |
221 | 260 | | |
222 | 261 | | |
| |||
1497 | 1536 | | |
1498 | 1537 | | |
1499 | 1538 | | |
1500 | | - | |
| 1539 | + | |
| 1540 | + | |
1501 | 1541 | | |
1502 | 1542 | | |
1503 | 1543 | | |
| |||
1565 | 1605 | | |
1566 | 1606 | | |
1567 | 1607 | | |
| 1608 | + | |
| 1609 | + | |
| 1610 | + | |
| 1611 | + | |
| 1612 | + | |
1568 | 1613 | | |
1569 | 1614 | | |
1570 | 1615 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3455 | 3455 | | |
3456 | 3456 | | |
3457 | 3457 | | |
| 3458 | + | |
3458 | 3459 | | |
3459 | 3460 | | |
3460 | 3461 | | |
| |||
3818 | 3819 | | |
3819 | 3820 | | |
3820 | 3821 | | |
| 3822 | + | |
| 3823 | + | |
| 3824 | + | |
| 3825 | + | |
| 3826 | + | |
3821 | 3827 | | |
3822 | | - | |
| 3828 | + | |
3823 | 3829 | | |
3824 | 3830 | | |
3825 | 3831 | | |
| |||
0 commit comments