Prerequisites
Feature Description
Add SuffixDecoding ([arXiv 2411.04975](https://arxiv.org/abs/2411.04975), NeurIPS 2025) as a speculative decoding backend in ik_llama.cpp.
SuffixDecoding is a model-free speculative decoding method: it maintains a suffix tree over prompt + generated tokens (and optionally a static corpus of prior outputs), matches the trailing n-gram of the current sequence against the tree, and proposes an adaptive-depth tree of draft tokens for parallel verification by the main model. No draft model, no extra trained heads, ~20 µs/token draft cost, output distribution exactly preserved.
Request: a --suffix-decoding path with adaptive tree-structured speculation, optional offline corpus via --suffix-corpus, and reuse of the existing batched verification machinery.
Motivation
Speculative decoding matters far more on CPU than on GPU because decode is hard memory-bandwidth-bound and there is idle compute the verify pass can absorb essentially for free. But on CPU the existing options are all bad in different ways:
- Draft model (
-md): the draft eats the same memory bandwidth as the main model on the same socket, killing most of the speedup. For large MoE bases (Qwen3-235B-A22B, Qwen3-Next, DeepSeek-V3, GLM-4.6, Kimi-K2) there is also no good dense draft.
- MTP heads: only available for models trained with them, per-architecture work, head still has to run each step.
- Prompt-lookup decoding: works, but fixed-length linear speculation with a flat hash table leaves a lot of acceptance length on the table compared to a proper suffix tree with adaptive depth.
SuffixDecoding fixes all three: zero draft cost, architecture-agnostic, adaptive tree-structured speculation. Published numbers: 5.3× on multi-agent SQL, 2.8× faster than EAGLE-2/3, 1.9× faster than Token Recycling, 1.8–4.5× end-to-end on SWE-Bench. Already shipping in vLLM via Snowflake's ArcticInference, so this is a validated technique with a reference implementation, not a research prototype.
Why this is especially relevant to ik_llama.cpp: the project's center of gravity is large MoE models running on CPU or hybrid CPU+GPU with IQ quantization, where TG is severely bandwidth-bound and current speculative options are weakest. Concrete data point: Xeon Gold 5318 (Ice Lake, 8ch DDR4) running a Qwen3 35B-A3B-class MoE under ik_llama.cpp with IQ4_K_R4 — PP512 ≈ 190 tok/s, TG128 ≈ 33 tok/s. Decode is firmly memory-bound; a 2× acceptance length on coding/agent workloads would translate almost linearly into a 2× TG speedup. The exact same argument applies to users running Qwen3-235B / DeepSeek-V3 / Kimi-K2 IQ-quants on dual-socket Xeon or EPYC boxes — these are precisely the deployments ik_llama.cpp is built for, and they are the ones that benefit most from a draft-free speculator.
Possible Implementation
Most of the infrastructure already exists — only the proposer changes; verification reuses the current batched decode + rollback path that lookup decoding already uses.
-
Two suffix trees per request:
- Online tree: Ukkonen-style, built incrementally from
prompt_tokens + generated_tokens, cleared at end of request.
- Global tree (optional): built offline from a user-supplied corpus of historical (prompt, response) pairs, mmap'd read-only, shared across requests. Suffix array + LCP is fine here since it's static.
-
Speculation tree construction: take trailing k tokens (k ≈ 4–8), walk the tree(s), greedily expand the most frequent continuations into a small speculation tree (≤32 nodes), score each path by empirical frequency × depth, stop expanding when score drops below threshold τ. This adaptive-depth scoring is the key difference from existing prompt-lookup.
-
Verification: pack the speculation tree into a single batch with tree-causal mask, one llama_decode, commit longest accepted prefix, repeat. Same logic as the existing lookup decoding example.
-
CLI:
--suffix-decoding
--suffix-max-depth N (default 16)
--suffix-max-nodes N (default 32)
--suffix-pattern-len N (default 4)
--suffix-score-threshold F (τ, tuned)
--suffix-corpus PATH (optional offline corpus, jsonl or tokenized .bin)
--suffix-tree-mode {tree,linear}
-
Future hybrid mode (not required for initial PR): SuffixDecoding as primary proposer, fall back to -md draft model only when the suffix score is below τ. Composes cleanly with the existing draft-model path.
References:
Happy to provide reproducible benchmarks on Xeon (Cascade Lake / Ice Lake) with Qwen3 MoE IQ-quants once an initial implementation lands, and to compare against the existing baselines on the same hardware.
Prerequisites
Feature Description
Add SuffixDecoding ([arXiv 2411.04975](https://arxiv.org/abs/2411.04975), NeurIPS 2025) as a speculative decoding backend in
ik_llama.cpp.SuffixDecoding is a model-free speculative decoding method: it maintains a suffix tree over
prompt + generated tokens(and optionally a static corpus of prior outputs), matches the trailing n-gram of the current sequence against the tree, and proposes an adaptive-depth tree of draft tokens for parallel verification by the main model. No draft model, no extra trained heads, ~20 µs/token draft cost, output distribution exactly preserved.Request: a
--suffix-decodingpath with adaptive tree-structured speculation, optional offline corpus via--suffix-corpus, and reuse of the existing batched verification machinery.Motivation
Speculative decoding matters far more on CPU than on GPU because decode is hard memory-bandwidth-bound and there is idle compute the verify pass can absorb essentially for free. But on CPU the existing options are all bad in different ways:
-md): the draft eats the same memory bandwidth as the main model on the same socket, killing most of the speedup. For large MoE bases (Qwen3-235B-A22B, Qwen3-Next, DeepSeek-V3, GLM-4.6, Kimi-K2) there is also no good dense draft.SuffixDecoding fixes all three: zero draft cost, architecture-agnostic, adaptive tree-structured speculation. Published numbers: 5.3× on multi-agent SQL, 2.8× faster than EAGLE-2/3, 1.9× faster than Token Recycling, 1.8–4.5× end-to-end on SWE-Bench. Already shipping in vLLM via Snowflake's ArcticInference, so this is a validated technique with a reference implementation, not a research prototype.
Why this is especially relevant to
ik_llama.cpp: the project's center of gravity is large MoE models running on CPU or hybrid CPU+GPU with IQ quantization, where TG is severely bandwidth-bound and current speculative options are weakest. Concrete data point: Xeon Gold 5318 (Ice Lake, 8ch DDR4) running a Qwen3 35B-A3B-class MoE underik_llama.cppwith IQ4_K_R4 — PP512 ≈ 190 tok/s, TG128 ≈ 33 tok/s. Decode is firmly memory-bound; a 2× acceptance length on coding/agent workloads would translate almost linearly into a 2× TG speedup. The exact same argument applies to users running Qwen3-235B / DeepSeek-V3 / Kimi-K2 IQ-quants on dual-socket Xeon or EPYC boxes — these are precisely the deploymentsik_llama.cppis built for, and they are the ones that benefit most from a draft-free speculator.Possible Implementation
Most of the infrastructure already exists — only the proposer changes; verification reuses the current batched decode + rollback path that lookup decoding already uses.
Two suffix trees per request:
prompt_tokens + generated_tokens, cleared at end of request.Speculation tree construction: take trailing k tokens (k ≈ 4–8), walk the tree(s), greedily expand the most frequent continuations into a small speculation tree (≤32 nodes), score each path by empirical frequency × depth, stop expanding when score drops below threshold τ. This adaptive-depth scoring is the key difference from existing prompt-lookup.
Verification: pack the speculation tree into a single batch with tree-causal mask, one
llama_decode, commit longest accepted prefix, repeat. Same logic as the existing lookup decoding example.CLI:
Future hybrid mode (not required for initial PR): SuffixDecoding as primary proposer, fall back to
-mddraft model only when the suffix score is below τ. Composes cleanly with the existing draft-model path.References:
vllm/v1/spec_decode/suffix_decoding.py)Happy to provide reproducible benchmarks on Xeon (Cascade Lake / Ice Lake) with Qwen3 MoE IQ-quants once an initial implementation lands, and to compare against the existing baselines on the same hardware.