Feature Request: SuffixDecoding (model-free speculative decoding via suffix trees)

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description


Add **SuffixDecoding** ([[arXiv 2411.04975](https://arxiv.org/abs/2411.04975)](https://arxiv.org/abs/2411.04975), NeurIPS 2025) as a speculative decoding backend in `ik_llama.cpp`.

SuffixDecoding is a model-free speculative decoding method: it maintains a suffix tree over `prompt + generated tokens` (and optionally a static corpus of prior outputs), matches the trailing n-gram of the current sequence against the tree, and proposes an adaptive-depth tree of draft tokens for parallel verification by the main model. No draft model, no extra trained heads, ~20 µs/token draft cost, output distribution exactly preserved.

Request: a `--suffix-decoding` path with adaptive tree-structured speculation, optional offline corpus via `--suffix-corpus`, and reuse of the existing batched verification machinery.

### Motivation


Speculative decoding matters far more on CPU than on GPU because decode is hard memory-bandwidth-bound and there is idle compute the verify pass can absorb essentially for free. But on CPU the existing options are all bad in different ways:

- **Draft model (`-md`)**: the draft eats the same memory bandwidth as the main model on the same socket, killing most of the speedup. For large MoE bases (Qwen3-235B-A22B, Qwen3-Next, DeepSeek-V3, GLM-4.6, Kimi-K2) there is also no good dense draft.
- **MTP heads**: only available for models trained with them, per-architecture work, head still has to run each step.
- **Prompt-lookup decoding**: works, but fixed-length linear speculation with a flat hash table leaves a lot of acceptance length on the table compared to a proper suffix tree with adaptive depth.

SuffixDecoding fixes all three: zero draft cost, architecture-agnostic, adaptive tree-structured speculation. Published numbers: **5.3× on multi-agent SQL, 2.8× faster than EAGLE-2/3, 1.9× faster than Token Recycling, 1.8–4.5× end-to-end on SWE-Bench**. Already shipping in vLLM via Snowflake's ArcticInference, so this is a validated technique with a reference implementation, not a research prototype.

**Why this is especially relevant to `ik_llama.cpp`:** the project's center of gravity is large MoE models running on CPU or hybrid CPU+GPU with IQ quantization, where TG is severely bandwidth-bound and current speculative options are weakest. Concrete data point: Xeon Gold 5318 (Ice Lake, 8ch DDR4) running a Qwen3 35B-A3B-class MoE under `ik_llama.cpp` with IQ4_K_R4 — PP512 ≈ 190 tok/s, TG128 ≈ 33 tok/s. Decode is firmly memory-bound; a 2× acceptance length on coding/agent workloads would translate almost linearly into a 2× TG speedup. The exact same argument applies to users running Qwen3-235B / DeepSeek-V3 / Kimi-K2 IQ-quants on dual-socket Xeon or EPYC boxes — these are precisely the deployments `ik_llama.cpp` is built for, and they are the ones that benefit most from a draft-free speculator.


### Possible Implementation


Most of the infrastructure already exists — only the proposer changes; verification reuses the current batched decode + rollback path that lookup decoding already uses.

1. **Two suffix trees per request**:
   - **Online tree**: Ukkonen-style, built incrementally from `prompt_tokens + generated_tokens`, cleared at end of request.
   - **Global tree** *(optional)*: built offline from a user-supplied corpus of historical (prompt, response) pairs, mmap'd read-only, shared across requests. Suffix array + LCP is fine here since it's static.

2. **Speculation tree construction**: take trailing k tokens (k ≈ 4–8), walk the tree(s), greedily expand the most frequent continuations into a small speculation tree (≤32 nodes), score each path by empirical frequency × depth, stop expanding when score drops below threshold τ. This adaptive-depth scoring is the key difference from existing prompt-lookup.

3. **Verification**: pack the speculation tree into a single batch with tree-causal mask, one `llama_decode`, commit longest accepted prefix, repeat. Same logic as the existing lookup decoding example.

4. **CLI**:
   ```
   --suffix-decoding
   --suffix-max-depth N           (default 16)
   --suffix-max-nodes N           (default 32)
   --suffix-pattern-len N         (default 4)
   --suffix-score-threshold F     (τ, tuned)
   --suffix-corpus PATH           (optional offline corpus, jsonl or tokenized .bin)
   --suffix-tree-mode {tree,linear}
   ```

5. **Future hybrid mode** *(not required for initial PR)*: SuffixDecoding as primary proposer, fall back to `-md` draft model only when the suffix score is below τ. Composes cleanly with the existing draft-model path.

**References:**
- Paper: https://arxiv.org/abs/2411.04975 (NeurIPS 2025 Spotlight)
- Project page: https://suffix-decoding.github.io/
- Reference implementation (Snowflake ArcticInference): https://github.com/snowflakedb/ArcticInference
- **vLLM integration PR**: [vllm-project/vllm#25784 — \[Spec Decode\] Integrate Suffix Decoding from Arctic Inference](https://github.com/vllm-project/vllm/pull/25784) (merged; now lives at `vllm/v1/spec_decode/suffix_decoding.py`)
- **vLLM RFC** discussing the integration design: [[vllm-project/vllm#18037](https://github.com/vllm-project/vllm/issues/18037)](https://github.com/vllm-project/vllm/issues/18037)
- **SGLang feature request** (with a minimal Python-only reference implementation linked in-thread): [[sgl-project/sglang#13165](https://github.com/sgl-project/sglang/issues/13165)](https://github.com/sgl-project/sglang/issues/13165)
- Snowflake engineering blogs with benchmarks: [[initial release (May 2025)](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/)](https://www.snowflake.com/en/engineering-blog/fast-speculative-decoding-vllm-arctic/), [[production-scale CPU optimizations (Jan 2026)](https://www.snowflake.com/en/engineering-blog/suffixdecoding-arctic-inference-vllm/)](https://www.snowflake.com/en/engineering-blog/suffixdecoding-arctic-inference-vllm/)

Happy to provide reproducible benchmarks on Xeon (Cascade Lake / Ice Lake) with Qwen3 MoE IQ-quants once an initial implementation lands, and to compare against the existing baselines on the same hardware.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: SuffixDecoding (model-free speculative decoding via suffix trees) #1602

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature Request: SuffixDecoding (model-free speculative decoding via suffix trees) #1602

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions