You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Extend the OpenCL gated_delta_net kernel to support K>1 input/output
state slots, matching the CUDA / Metal / Vulkan / SYCL implementations
landed by upstream PR ggml-org#22673 ("llama + spec: MTP Support") and
PR ggml-org#23174 (SYCL K>1).
MTP draft heads predict K tokens ahead; the verify batch then rolls
back any rejected draft tokens by reading from the K snapshot slots
the forward pass writes during the n_tokens loop. K==1 is the legacy
backwards-compatible single-slot final-state-only behaviour.
Layout
- Input state: (S_v*S_v*H, K, n_seqs) — only slot 0 carries the seed.
- Output state: K slots stacked as the outermost dim, each
S_v*S_v*H*n_seqs floats. shift = n_tokens - K; the kernel writes
this t's state to slot (t - shift) when 0 <= target_slot < K.
- For K>n_tokens (cold spec restart), only the last n_tokens slots
are written; earlier slots are caller-owned and left untouched.
- For K==1 the per-t write condition fires once on the last iteration
(slot 0 = final state), preserving prior semantics.
Both kernels updated
- kernel_gated_delta_net_f32 (generic, any S_v <= 128): adopts a
private working column s_col[GDN_GENERIC_MAX_SV] so the per-t slot
write doesn't have to read back from global between tokens. Replaces
the previous in-place global s_out modification.
- kernel_gated_delta_net_f32_sv128 (Qwen3-Next / Qwen3.6-A3B fast
path): state was already kept in per-lane private s_shard[4]; just
added the per-t slot write loop using the same target_slot rule.
Dispatch derives K from src_state->ne[1] and forwards it as the last
kernel arg. supports_op needed no change — the existing f32-only gate
already accepts both K==1 and K>1 ops.
test-backend-ops -o GATED_DELTA_NET: 36/36 pass (was 28/36 — the 8
K∈{2,3,4} cases now green). FLASH_ATTN_EXT regression check: 2564/2564.
Perf: feature-correctness commit; further tuning (cluster-32 ALU
optimisations, k_img staging for slot writes, etc.) deferred.
0 commit comments