Commit aeb2964
committed
opencl: WIP GGML_OP_GATED_DELTA_NET — autoregressive (n_tokens==1) only
For Qwen3-Next / Qwen3.6-35B-A3B / kimi-linear etc, llama.cpp builds the
DeltaNet recurrence either as a fused ggml_gated_delta_net op (when the
backend supports it) or as a sequence of primitive ggml ops (chunked or
recurrent). ggml-opencl had no GATED_DELTA_NET support, so even at
decode (n_tokens==1) it used build_delta_net_chunking with chunk_size=64
and n_tokens=1 — the "soup" of ~260 tiny generic-elementwise dispatches
per token that dominated ~30% of decode GPU time in the cl_profiling
trace.
This commit adds the autoregressive (n_tokens==1) path:
- kernels/gated_delta_net.cl: stream-from-global kernel; one thread per
(column j, head h, seq s). Thread owns column j of the per-head state
matrix (transposed: s_out[j*S_v + i] = S[i][j]). Reads input state +
k/q/g/v/beta from global, writes decayed/updated state back to global,
writes attn_out[j]. Math directly mirrors
ggml_compute_forward_gated_delta_net_one_chunk for n_tokens==1.
- ggml_backend_opencl_device_supports_op: only true for n_tokens==1, so
prefill keeps the chunked-primitive path (cparams.fused_gdn_ch
auto-disables on the chunked-graph reservation; fused_gdn_ar stays on).
- ggml_cl_gated_delta_net: 6-input dispatch (q,k,v,g,beta,state) reading
v/g/state from dst->src[2..5], following the FLASH_ATTN_EXT pattern.
- supports_op + op routing + kernel compile + CMake registration done.
Confirmed:
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) not supported, set to disabled
**Status: BLOCKED on end-to-end validation.** With this kernel enabled
the model now hits a pre-existing -54 (CL_INVALID_WORK_GROUP_SIZE) in
kernel_moe_histogram for Qwen3.6-35B-A3B's n_experts=256 routing:
histogram_local_size[] = {64, ne20, 1} where ne20 == n_experts (256)
-> total local size = 16384 > device max 1024
This bug doesn't fire pre-change because the original CPU GDN fallback
puts post-attention ops on different graph splits; on-device GDN keeps
the MoE block on OpenCL and exposes the bad dispatch (ggml-opencl.cpp
near line 14684 -- size_t histogram_local_size[] = {64, ne20, 1}).
Next session: fix the histogram dispatch (split work along the experts
dim so total local size stays <= 1024) then run test-backend-ops -o
GATED_DELTA_NET and the Qwen3.6-35B decode bench A/B.1 parent 443c16a commit aeb2964
3 files changed
Lines changed: 220 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
168 | 168 | | |
169 | 169 | | |
170 | 170 | | |
| 171 | + | |
171 | 172 | | |
172 | 173 | | |
173 | 174 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
768 | 768 | | |
769 | 769 | | |
770 | 770 | | |
| 771 | + | |
771 | 772 | | |
772 | 773 | | |
773 | 774 | | |
| |||
2735 | 2736 | | |
2736 | 2737 | | |
2737 | 2738 | | |
| 2739 | + | |
| 2740 | + | |
| 2741 | + | |
| 2742 | + | |
| 2743 | + | |
| 2744 | + | |
| 2745 | + | |
| 2746 | + | |
| 2747 | + | |
| 2748 | + | |
| 2749 | + | |
| 2750 | + | |
| 2751 | + | |
| 2752 | + | |
| 2753 | + | |
| 2754 | + | |
| 2755 | + | |
2738 | 2756 | | |
2739 | 2757 | | |
2740 | 2758 | | |
| |||
5888 | 5906 | | |
5889 | 5907 | | |
5890 | 5908 | | |
| 5909 | + | |
| 5910 | + | |
| 5911 | + | |
| 5912 | + | |
| 5913 | + | |
| 5914 | + | |
| 5915 | + | |
| 5916 | + | |
| 5917 | + | |
| 5918 | + | |
5891 | 5919 | | |
5892 | 5920 | | |
5893 | 5921 | | |
| |||
10438 | 10466 | | |
10439 | 10467 | | |
10440 | 10468 | | |
| 10469 | + | |
| 10470 | + | |
| 10471 | + | |
| 10472 | + | |
| 10473 | + | |
| 10474 | + | |
| 10475 | + | |
| 10476 | + | |
| 10477 | + | |
| 10478 | + | |
| 10479 | + | |
| 10480 | + | |
| 10481 | + | |
| 10482 | + | |
| 10483 | + | |
| 10484 | + | |
| 10485 | + | |
| 10486 | + | |
| 10487 | + | |
| 10488 | + | |
| 10489 | + | |
| 10490 | + | |
| 10491 | + | |
| 10492 | + | |
| 10493 | + | |
| 10494 | + | |
| 10495 | + | |
| 10496 | + | |
| 10497 | + | |
| 10498 | + | |
| 10499 | + | |
| 10500 | + | |
| 10501 | + | |
| 10502 | + | |
| 10503 | + | |
| 10504 | + | |
| 10505 | + | |
| 10506 | + | |
| 10507 | + | |
| 10508 | + | |
| 10509 | + | |
| 10510 | + | |
| 10511 | + | |
| 10512 | + | |
| 10513 | + | |
| 10514 | + | |
| 10515 | + | |
| 10516 | + | |
| 10517 | + | |
| 10518 | + | |
| 10519 | + | |
| 10520 | + | |
| 10521 | + | |
| 10522 | + | |
| 10523 | + | |
| 10524 | + | |
| 10525 | + | |
| 10526 | + | |
| 10527 | + | |
| 10528 | + | |
| 10529 | + | |
| 10530 | + | |
| 10531 | + | |
| 10532 | + | |
| 10533 | + | |
| 10534 | + | |
| 10535 | + | |
| 10536 | + | |
| 10537 | + | |
| 10538 | + | |
| 10539 | + | |
| 10540 | + | |
| 10541 | + | |
| 10542 | + | |
| 10543 | + | |
| 10544 | + | |
| 10545 | + | |
| 10546 | + | |
| 10547 | + | |
| 10548 | + | |
10441 | 10549 | | |
10442 | 10550 | | |
10443 | 10551 | | |
| |||
20334 | 20442 | | |
20335 | 20443 | | |
20336 | 20444 | | |
| 20445 | + | |
| 20446 | + | |
| 20447 | + | |
| 20448 | + | |
| 20449 | + | |
| 20450 | + | |
20337 | 20451 | | |
20338 | 20452 | | |
20339 | 20453 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
0 commit comments