Skip to content

Feature Request: SuffixDecoding (model-free speculative decoding via suffix trees) #1602

@w1ida

Description

@w1ida

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add SuffixDecoding ([arXiv 2411.04975](https://arxiv.org/abs/2411.04975), NeurIPS 2025) as a speculative decoding backend in ik_llama.cpp.

SuffixDecoding is a model-free speculative decoding method: it maintains a suffix tree over prompt + generated tokens (and optionally a static corpus of prior outputs), matches the trailing n-gram of the current sequence against the tree, and proposes an adaptive-depth tree of draft tokens for parallel verification by the main model. No draft model, no extra trained heads, ~20 µs/token draft cost, output distribution exactly preserved.

Request: a --suffix-decoding path with adaptive tree-structured speculation, optional offline corpus via --suffix-corpus, and reuse of the existing batched verification machinery.

Motivation

Speculative decoding matters far more on CPU than on GPU because decode is hard memory-bandwidth-bound and there is idle compute the verify pass can absorb essentially for free. But on CPU the existing options are all bad in different ways:

  • Draft model (-md): the draft eats the same memory bandwidth as the main model on the same socket, killing most of the speedup. For large MoE bases (Qwen3-235B-A22B, Qwen3-Next, DeepSeek-V3, GLM-4.6, Kimi-K2) there is also no good dense draft.
  • MTP heads: only available for models trained with them, per-architecture work, head still has to run each step.
  • Prompt-lookup decoding: works, but fixed-length linear speculation with a flat hash table leaves a lot of acceptance length on the table compared to a proper suffix tree with adaptive depth.

SuffixDecoding fixes all three: zero draft cost, architecture-agnostic, adaptive tree-structured speculation. Published numbers: 5.3× on multi-agent SQL, 2.8× faster than EAGLE-2/3, 1.9× faster than Token Recycling, 1.8–4.5× end-to-end on SWE-Bench. Already shipping in vLLM via Snowflake's ArcticInference, so this is a validated technique with a reference implementation, not a research prototype.

Why this is especially relevant to ik_llama.cpp: the project's center of gravity is large MoE models running on CPU or hybrid CPU+GPU with IQ quantization, where TG is severely bandwidth-bound and current speculative options are weakest. Concrete data point: Xeon Gold 5318 (Ice Lake, 8ch DDR4) running a Qwen3 35B-A3B-class MoE under ik_llama.cpp with IQ4_K_R4 — PP512 ≈ 190 tok/s, TG128 ≈ 33 tok/s. Decode is firmly memory-bound; a 2× acceptance length on coding/agent workloads would translate almost linearly into a 2× TG speedup. The exact same argument applies to users running Qwen3-235B / DeepSeek-V3 / Kimi-K2 IQ-quants on dual-socket Xeon or EPYC boxes — these are precisely the deployments ik_llama.cpp is built for, and they are the ones that benefit most from a draft-free speculator.

Possible Implementation

Most of the infrastructure already exists — only the proposer changes; verification reuses the current batched decode + rollback path that lookup decoding already uses.

  1. Two suffix trees per request:

    • Online tree: Ukkonen-style, built incrementally from prompt_tokens + generated_tokens, cleared at end of request.
    • Global tree (optional): built offline from a user-supplied corpus of historical (prompt, response) pairs, mmap'd read-only, shared across requests. Suffix array + LCP is fine here since it's static.
  2. Speculation tree construction: take trailing k tokens (k ≈ 4–8), walk the tree(s), greedily expand the most frequent continuations into a small speculation tree (≤32 nodes), score each path by empirical frequency × depth, stop expanding when score drops below threshold τ. This adaptive-depth scoring is the key difference from existing prompt-lookup.

  3. Verification: pack the speculation tree into a single batch with tree-causal mask, one llama_decode, commit longest accepted prefix, repeat. Same logic as the existing lookup decoding example.

  4. CLI:

    --suffix-decoding
    --suffix-max-depth N           (default 16)
    --suffix-max-nodes N           (default 32)
    --suffix-pattern-len N         (default 4)
    --suffix-score-threshold F     (τ, tuned)
    --suffix-corpus PATH           (optional offline corpus, jsonl or tokenized .bin)
    --suffix-tree-mode {tree,linear}
    
  5. Future hybrid mode (not required for initial PR): SuffixDecoding as primary proposer, fall back to -md draft model only when the suffix score is below τ. Composes cleanly with the existing draft-model path.

References:

Happy to provide reproducible benchmarks on Xeon (Cascade Lake / Ice Lake) with Qwen3 MoE IQ-quants once an initial implementation lands, and to compare against the existing baselines on the same hardware.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions