Status: Stable — two delivery paths, both verified working. On integrated GPUs the runtime defaults to
n_max=1; published 1.5–2.5× speedups are the discrete-GPU regime (see §2).
MTP (Multi-Token Prediction) speculative decode lets a model predict multiple future tokens cheaply from an auxiliary head, then verify them in a single forward pass of the full model. Accepted drafts increase the number of tokens generated per step without changing output quality.
Two delivery mechanisms, both Stable:
| Internal NextN-tail | External assistant | |
|---|---|---|
| Models | Qwen3.5 / Qwen3.6 (dense + MoE) | Gemma 4 26B-A4B |
| Draft source | MTP head bundled inside the same GGUF | Separate assistant GGUF (-md) |
| Accept rate (measured) | 75.56% (Qwen3.5/3.6 9B bundled) | 47.3% |
| Key CLI flag | --spec-type draft-mtp |
-md assistant.gguf --spec-type draft-mtp --spec-draft-n-max 4 |
| Backends | ROCm, Vulkan | ROCm verified |
Quick start — internal NextN-tail (Qwen3.5/3.6):
llama-speculative-simple \
-m qwen3.5-9b-mtp.gguf \
--spec-type draft-mtpQuick start — external assistant (Gemma 4):
llama-speculative-simple \
-m gemma4-27b.gguf \
-md gemma4-assistant.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 4Server quick start — Qwen3.5/3.6 (read the gotchas in §2 first):
llama-server \
-m qwen3.5-9b-mtp.gguf \
--spec-type draft-mtp \
--parallel 1 \
--reasoning-budget 0The internal NextN-tail path is mainline-aligned. The MTP head is bundled as extra layers in the target GGUF and loaded as a dedicated draft context (LLAMA_CONTEXT_TYPE_MTP, include/llama.h:236). The speculative loop runs through the shared driver in common/speculative.cpp.
The external-assistant path for Gemma 4 is a guided port of mainline PR #23398. The assistant model uses LLM_ARCH_GEMMA4_ASSISTANT and runs through the NEXTN_PRE/POST projection path defined in src/models/gemma4-assistant.cpp (203 LOC). The older D1/ASSIST_ code path has been retired.
The Qwen3.5/3.6 converter that creates or splits MTP-bundled GGUFs at conversion time is a separate feature — see qwen35-mtp-converter.md. This doc covers runtime speculative decode only.
The canonical flag is --spec-type draft-mtp. The string mtp is retained as an alias (common/speculative.cpp:67). The older --mtp / --multi-token-prediction flags are deprecated (common/arg.cpp:1387–1389) — they still function but log a warning; update scripts to use --spec-type draft-mtp.
| Flag | Description |
|---|---|
--spec-type draft-mtp |
Enable MTP speculative decode |
--spec-draft-n-max N |
Maximum draft depth per step (common/arg.cpp:3701). Default 3; Gemma 4 requires --spec-draft-n-max 4 to draft at all. On iGPUs the runtime auto-clamps to 1 unless you set this explicitly. |
--reasoning-budget 0 |
Forces immediate </think> on Qwen3.5/3.6 thinking models (see gotcha below) |
llama-cli --spec-type draft-mtp loads the NextN tail but runs plain autoregressive decode — it does not speculate. Speculative decode is triggered by:
llama-speculative-simple --spec-type draft-mtp— recommended lightweight CLI speculatorllama-server --spec-type draft-mtp— server path (see gotchas below)
llama-server requires --parallel 1 — the MTP path is single-sequence only. The server logs a reminder at startup (common/arg.cpp:3777).
Qwen3.5/3.6 thinking-mode trap — per-request "thinking":{"type":"disabled"} is insufficient to suppress thinking traces. Pass --reasoning-budget 0 at server startup to force immediate </think> termination. Without it, the server emits 10–20 minute thinking traces and subsequent requests time out.
iGPU throughput — on integrated GPUs, MTP is tuned to n_max=1 by default after the C1 catch-up-batching optimization. At that depth it reaches ~1.16× pure decode throughput (gfx1150 ROCm and Vulkan; CHANGELOG.md:33–34 and :74). Setting n_max≥2 on an iGPU remains a net slowdown due to unified-memory sync overhead — the default is correct and should not be overridden on iGPU hardware. The runtime logs a one-time notice explaining this when an integrated GPU is detected.
PPL is unchanged by construction — perplexity evaluation only exercises the prefill pass; the speculative path never fires. MTP is a decode-driver feature only.
- Real throughput gain on discrete GPUs — 1.5–2.5× token generation throughput at high accept rates in discrete-GPU deployments.
- Modest iGPU win — C1 catch-up batching delivers ~1.16× at
n_max=1even on integrated GPUs (Vulkan parity confirmed). - No model-quality cost — PPL is identical-by-construction; speculation is transparent to the output distribution.
- High accept rates — 75.56% measured (Qwen3.5/3.6 9B bundled); 47.3% (Gemma 4 external, near the CUDA reference at 0.588).
- Cross-backend parity (Vulkan) — gfx1150 RADV Vulkan reaches the same 32.4 t/s as ROCm at
n_max=1.
- iGPU net-negative at
n_max≥2— unified-memory sync cost dominates any draft-accept gain; stay at the default. --parallel 1restriction — MTP cannot serve multiple simultaneous sequences inllama-server.- Thinking-mode footgun — Qwen3.5/3.6 requires a server-level
--reasoning-budget 0; per-request thinking disable is not enough. - Gemma 4 needs a separate assistant GGUF — the external assistant path requires downloading and providing the matching assistant model file.
TBD (pending benchmark)
Metric: accept-rate (%) + tokens/s vs MTP-OFF baseline at the same prompt and n_predict. Perplexity is not the relevant metric here (PPL is identical-by-construction).
| Config | MTP-OFF baseline (t/s) | MTP accept-% | MTP t/s | Speedup |
|---|---|---|---|---|
| Qwen3.5/3.6 internal — iGPU, n_max=1 (measured anchor) | 28.0 | 100% | 32.4 | 1.16× |
| Qwen3.5/3.6 internal — iGPU, n_max≥2 | — | — | TBD | net slowdown |
| Qwen3.5/3.6 internal — dGPU | TBD | TBD | TBD | TBD |
| Gemma 4 external assistant — dGPU | TBD | 47.3% (anchor) | TBD | TBD |
MTP is registered as COMMON_SPECULATIVE_TYPE_DRAFT_MTP (common/common.h:168). The factory at common/speculative.cpp:2424 dispatches to a unified implementation class common_speculative_impl_draft_mtp (:610), which holds the draft-context state in common_speculative_state_draft_mtp (:1058).
mtp_any_igpu() (speculative.cpp:52–60) queries every registered backend device for GGML_BACKEND_DEVICE_TYPE_IGPU — the check is backend-agnostic and covers both CUDA/HIP and Vulkan builds. When an iGPU is found and the user has not explicitly set --spec-draft-n-max, the constructor clamps n_max to 1 at speculative.cpp:681.
Two hooks manage KV cache state across draft and verify steps:
mtp_update_kv_cache()(speculative.cpp:2719) — appends draft tokens to the target context's KV cache after each draft stepmtp_accept_tokens()(speculative.cpp:2753) — commits accepted tokens and rolls back rejected ones
The target context is connected to the MTP draft context through three API calls:
llama_set_mtp_op_type()(llama-context.cpp:4033) — marks the context's MTP operation rolellama_set_embeddings_nextn()(llama-context.cpp:4041; internal declarationllama-context.h:143) — activates the masked next-token (nextn) embedding path that feeds the MTP headllama_set_mtp_source()(llama-context.cpp:4037) — registers the target context as the source for MTP draft generation
The assistant model reads LLM_KV_NEXTN_PREDICT_LAYERS at load time to determine how many draft steps to generate (src/models/gemma4-assistant.cpp:13). Draft logits are computed via learned projection tensors LLM_TENSOR_NEXTN_PRE_PROJ / LLM_TENSOR_NEXTN_POST_PROJ (:44–45), allowing the assistant to produce drafts entirely from the target backbone's hidden states (Q-only drafter; borrows target KV).
- Related speculative decode docs: NLD diffusion self-spec · DFlash drafter spec-decode · PHANTOM-X n-gram drafter
- Converter (separate feature): Qwen3.5/3.6 MTP converter — bundling or splitting the MTP draft head at GGUF-creation time
- Upstream references: mainline MTP spine (
LLAMA_CONTEXT_TYPE_MTP) · PR #23398 (Gemma 4 external assistant port) · PR #22673 (Qwen3.5/3.6 converter) - Concept primer: Feature maturity levels & backend support