Skip to content

llama + spec: MTP Support #22673

Draft
am17an wants to merge 11 commits into
ggml-org:masterfrom
am17an:mtp-clean
Draft

llama + spec: MTP Support #22673
am17an wants to merge 11 commits into
ggml-org:masterfrom
am17an:mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 4, 2026

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

Note: --mmproj and parallel sequence (-np > 1 don't work at the moment), they will get fixed

Next Steps

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 4, 2026

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review, but opening some discussions)

Comment thread src/llama-memory-recurrent.h
Comment thread src/models/qwen35.cpp

for (int il = 0; il < n_layer; ++il) {
// MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

Comment on lines +811 to +823
//TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
params_base.model.path.c_str());

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = "qwen35_mtp";

model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
if (model_mtp == nullptr) {
SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems like the correct way to do this if we want to support MTP in a generic way

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 4, 2026

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

am17an added 4 commits May 4, 2026 20:15
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 4, 2026

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented May 4, 2026

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

@mbednarek360
Copy link
Copy Markdown

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P 

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

@nawoa
Copy link
Copy Markdown

nawoa commented May 4, 2026

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

@cturan
Copy link
Copy Markdown

cturan commented May 4, 2026

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

  • MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
  • Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
  • Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
  • MTP adds about 2.49 GiB loaded VRAM in this setup.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

@iiLaurens
Copy link
Copy Markdown

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

@nybblr
Copy link
Copy Markdown

nybblr commented May 4, 2026

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

@volkermauel
Copy link
Copy Markdown

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

@chris-hatton
Copy link
Copy Markdown

macOS / Apple Metal results - M1 Max 64GB, Qwen3.6-35B-A3B-MTP

Tested commit 5d5f1b4 on Apple M1 Max (unified memory, recommendedMaxWorkingSetSize ~53 GiB).

Baseline (no MTP), Q8_K_XL (37 GiB):

  • Prompt: 117 t/s, Generation: 40.8 t/s
  • Model: 37,515 MiB · Context: 105 MiB · Compute: 489 MiB · Unaccounted: 0 · Free: 14,973 MiB

MTP enabled (--spec-type mtp --spec-draft-n-max 2), Q8_K_XL (37 GiB):

  • Generation drops to ~1 t/s (MTP draft runs on CPU with -ngld defaulting to 0)
  • With -ngld 99: GPU OOM (kIOGPUCommandBufferCallbackErrorOutOfMemory)
  • Memory breakdown reveals the issue - Model: 37,515 MiB · Context: 230 MiB · Compute: 489 MiB · Unaccounted: 38,009 MiB

⚠️ The Metal backend allocates an "unaccounted" buffer roughly equal to the full model size for the MTP draft context. On CUDA this overhead is ~2.5 GiB (per @cturan's RTX 3090 report). On Metal it scales to the model size, pushing the Q8 (37 GiB) well past the 53 GiB working set.

Workaround - use Q4_K_XL (22 GiB):

Config Prompt t/s Generation t/s
Q4, no MTP 117.2 40.5
Q4, MTP draft-n-max 2 100.7 52.7
Q4, MTP draft-n-max 3 94.6 49.0

With Q4 the 22 GiB "unaccounted" buffer fits (7.5 GiB free), no OOM. MTP gives a 30% generation speedup at draft-n-max 2. Draft-n-max 2 outperforms 3 on this hardware, likely because the acceptance rate doesn't compensate for the extra draft overhead at Apple Metal's compute throughput.

Happy to run additional configurations if helpful.

@boutell
Copy link
Copy Markdown

boutell commented May 10, 2026 via email

@polhdez
Copy link
Copy Markdown

polhdez commented May 10, 2026

Hi there's a double free corruption, when llama.cpp server is run in router mode and configured with sleep-idle-seconds to unload model after N seconds. (Using latest llama.cpp head and latest commit from this PR, just pulled and re-built)

[59931] draft acceptance rate = 0.65574 (  120 accepted /   183 generated)
[59931] statistics mtp: #calls(b,g,a) = 131 27854 22160, #gen drafts = 27854, #acc drafts = 22160, #gen tokens = 83562, #acc tokens = 51155, dur(b,g,a) = 0.227, 151318.420, 471.852 ms
[59931] slot      release: id  0 | task 28235 | stop processing: n_tokens = 98624, truncated = 0
[59931] srv  update_slots: all slots are idle
[59931] que    start_loop: entering sleeping state
[59931] cmd_child_to_router:sleep
[59931] srv  handle_sleep: server is entering sleeping state
[59931] double free or corruption (!prev)

and the server unable to fulfill any requests beyond this point.

I'm hitting the double free error after exiting the server too. I'll try to debug it

@ulyazen
Copy link
Copy Markdown

ulyazen commented May 10, 2026

5060ti 16G 60t/s Bravo!!!

what model you are running? i have same gpu

@netlooker
Copy link
Copy Markdown

I compiled llama based on this PR on the nVidia DGX Spark compatible machine. I downloaded the Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf model from https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF repo. I confirm that it works and that generation speed bumped from 7t/s to over 20t/s which I find incredible (with 256k ctx size).
MTP is a deal breaker. I also confirm that parallel must be 1 and that mmproj doesn't work so the vision is down.

@frozename
Copy link
Copy Markdown

Data point from M4 Pro 48 GB (Apple10 / A18-class GPU, ~273 GB/s mem BW). PR head pinned to 5d5f1b46. Built cmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DCMAKE_BUILD_TYPE=Release. Server flags follow the recipe being recommended downstream: --ctx-size 8192 --no-warmup -np 1 -ngl 99 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type mtp --spec-draft-n-max 3. Bench prompt set is am17an's gist; bench client uses temperature=0 seed=42 n_predict=192.

Q5_K_M and Q8_0 MTP both OOM the Metal working set mid-decode. Final memory breakdown for Q8_0:

MTL0 (Apple M4 Pro) | 38338 = ... + (29056 = 27690 + 870 + 495) + 28213
                                          model    ctx   compute   unaccounted

unaccounted ≈ model size is the smoking gun — an extra ~model-sized buffer is being attributed to the Metal allocation outside the formally-tracked self. Tried with -ub 512 / default -ub, with --flash-attn on / off — neither moves the breakdown. The doubled footprint appears structural to the MTP draft path on Apple Silicon, not flag-driven.

Q4_K_M (16 GB) fits the doubled footprint (32 GB) under the 38 GB Metal cap, and runs cleanly — but is slower than vanilla:

Aggregate vanilla: 11.9 tok/s   wall 127.3 s   accept=n/a
Aggregate MTP:     10.0 tok/s   wall 150.0 s   accept=0.701
Ratio: 0.84x

70% draft acceptance, so the speculative path is working; the per-token MTP overhead just exceeds the saved forward passes on this hardware.

All positive Apple-Silicon reports in this thread are Apple9 family (M1 Ultra, M2 Max, M3 Ultra — A17 GPU architecture, 38–80 cores, 400+ GB/s). M3 Pro/Max and the M4 line are Apple10 (A18, dynamic caching, fewer GPU cores per tier). This may be the first M4-class data point — possibly relevant to #22787 ("single draft context instead of one per each slot — less memory"), which from the description seems to target exactly this symptom. Independent confirmation of the VRAM increase even outside MTP mode is in this thread already (the earlier note about needing to drop context from 256k → 200k just from checking out this branch).

Re-piloting once #22787 lands on master will be straightforward — happy to post follow-up numbers from the same M4 Pro.

@tbraun96

This comment has been minimized.

@cpietsch
Copy link
Copy Markdown

AMD ROCm data point for Qwen3.6-27B-MTP-UD-GGUF on an older 16 GiB card.

Tested PR head 5d5f1b46e4f5 (mtp-pr-22673) on a single AMD Instinct MI50/MI60 (gfx906) with ROCm/HIP. I used only ROCm0 for these numbers and required the logs to show full GPU offload:

  • baseline: offloaded 66/66 layers to GPU
  • MTP: base model offloaded 66/66 layers to GPU and MTP head offloaded 66/66 layers to GPU
  • MTP head registration confirmed with set_mtp: MTP draft head registered
  • no CPU layer offload was used
  • ROCm environment note: this is not a stock distro ROCm install. I am using a self-compiled ROCm 7.1/rocBLAS setup so gfx906 works with current ROCm. The build follows the notes from [Issue]: The rocblas of 6.4.0 release doesn't ship TensileLibrary files for gfx906 ROCm/ROCm#4625 (comment): rocBLAS release/rocm-rel-7.1, TENSILE_VERSION=4.45.0, libmsgpack-cxx-dev, and ./install.sh -a gfx906:xnack-.

Build:

export LLAMACPP_ROCM_ARCH=gfx906
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build-mtp \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS=$LLAMACPP_ROCM_ARCH \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON
cmake --build build-mtp --config Release --target llama-server -j20

Runtime settings used for all measured runs:

--device ROCm0 \
-ctk q4_0 -ctv q4_0 \
-b 64 -ub 64 -c 2048 -np 1 -ngl 99 \
--flash-attn on --jinja --no-mmproj \
--no-warmup --no-cache-prompt --cache-ram 0

MTP runs additionally used:

--spec-type mtp --spec-draft-n-max 3 --spec-draft-ngl 99

Method:

  • Endpoint: /completion, raw prompt.
  • 9 prompt set similar to the benchmark prompts being used in this thread.
  • One warmup request before measured prompts for each config.
  • Request settings: temperature=0, seed=42, n_predict=192, cache_prompt=false.
  • Temperature gating: only waited when GPU edge temp was above 70 C.
  • Note: with the larger settings I initially tried (-c 8192, larger batch), Q3 MTP did not fit fully on a 16 GiB MI50. It failed while allocating the MTP head (1425.06 MiB) after the fully-offloaded base model. The smaller -c 2048 -b 64 -ub 64 --no-cache-prompt settings are what let Q3 MTP remain fully GPU-resident.

Aggregate results:

Quant Mode Avg generation tok/s Speedup vs baseline Prompt tok/s avg MTP acceptance VRAM avg Temp avg start -> end
Q2_K_XL baseline 19.76 1.00x 40.39 n/a 71.1% 54.2 -> 60.4 C
Q2_K_XL MTP n=3 23.43 1.19x 38.11 67.96% 83.7% 68.1 -> 73.3 C
Q3_K_XL baseline 18.74 1.00x 68.56 n/a 86.0% 68.6 -> 74.6 C
Q3_K_XL MTP n=3 22.15 1.18x 62.06 73.00% 98.0% 68.7 -> 73.8 C

Per-prompt generation rates:

Prompt Q2 baseline Q2 MTP n=3 Q2 accept Q3 baseline Q3 MTP n=3 Q3 accept
code_python 19.7 24.7 70.5% 18.6 25.6 86.2%
code_cpp 19.7 24.2 75.9% 19.0 24.8 85.4%
explain_concept 19.7 20.1 57.6% 18.6 20.7 66.1%
summarize 19.9 23.8 66.7% 19.0 22.1 72.6%
qa_factual 19.7 25.1 72.8% 18.6 21.7 71.0%
translation 20.4 21.2 54.2% 19.5 18.8 54.2%
creative_short 19.7 21.1 57.1% 18.7 19.9 61.7%
stepwise_math 19.7 27.9 86.2% 18.5 23.5 80.4%
long_code_review 19.4 22.7 64.1% 18.0 22.3 75.9%

Takeaways from this MI50 run:

  • MTP works on ROCm/gfx906 when the MTP head is explicitly kept on GPU with --spec-draft-ngl 99.
  • On this 16 GiB card, Q2 has comfortable headroom; Q3 is very tight but can run fully on GPU with reduced context/batch and prompt cache disabled.
  • The speedup is modest but consistent in this constrained setup: about 1.18-1.19x average generation tok/s.
  • Acceptance varies a lot by prompt; translation and creative prompts were the weakest cases here, while code/math prompts benefited more.

@chatton2-coles
Copy link
Copy Markdown

chatton2-coles commented May 11, 2026

Just to TL;DR the conclusion of the verbose MacOS reports so far...

⚠️ While MTP works; there's some buggy interaction with the Metal backend that causes it to allocate twice the memory of the model. This is obviously a severe usability issue for this common hardware platform.

@Geramy
Copy link
Copy Markdown

Geramy commented May 11, 2026

Data point from M4 Pro 48 GB (Apple10 / A18-class GPU, ~273 GB/s mem BW). PR head pinned to 5d5f1b46. Built cmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DLLAMA_CURL=ON -DCMAKE_BUILD_TYPE=Release. Server flags follow the recipe being recommended downstream: --ctx-size 8192 --no-warmup -np 1 -ngl 99 --cache-type-k q8_0 --cache-type-v q8_0 --spec-type mtp --spec-draft-n-max 3. Bench prompt set is am17an's gist; bench client uses temperature=0 seed=42 n_predict=192.

Q5_K_M and Q8_0 MTP both OOM the Metal working set mid-decode. Final memory breakdown for Q8_0:

MTL0 (Apple M4 Pro) | 38338 = ... + (29056 = 27690 + 870 + 495) + 28213
                                          model    ctx   compute   unaccounted

unaccounted ≈ model size is the smoking gun — an extra ~model-sized buffer is being attributed to the Metal allocation outside the formally-tracked self. Tried with -ub 512 / default -ub, with --flash-attn on / off — neither moves the breakdown. The doubled footprint appears structural to the MTP draft path on Apple Silicon, not flag-driven.

Q4_K_M (16 GB) fits the doubled footprint (32 GB) under the 38 GB Metal cap, and runs cleanly — but is slower than vanilla:

Aggregate vanilla: 11.9 tok/s   wall 127.3 s   accept=n/a
Aggregate MTP:     10.0 tok/s   wall 150.0 s   accept=0.701
Ratio: 0.84x

70% draft acceptance, so the speculative path is working; the per-token MTP overhead just exceeds the saved forward passes on this hardware.

All positive Apple-Silicon reports in this thread are Apple9 family (M1 Ultra, M2 Max, M3 Ultra — A17 GPU architecture, 38–80 cores, 400+ GB/s). M3 Pro/Max and the M4 line are Apple10 (A18, dynamic caching, fewer GPU cores per tier). This may be the first M4-class data point — possibly relevant to #22787 ("single draft context instead of one per each slot — less memory"), which from the description seems to target exactly this symptom. Independent confirmation of the VRAM increase even outside MTP mode is in this thread already (the earlier note about needing to drop context from 256k → 200k just from checking out this branch).

Re-piloting once #22787 lands on master will be straightforward — happy to post follow-up numbers from the same M4 Pro.

I have a m5 max 128GB and it works great with MTP period.

@chatton2-coles
Copy link
Copy Markdown

@Geramy Does that mean your M5 configuration is not erroneously allocating twice the model memory?

@Geramy
Copy link
Copy Markdown

Geramy commented May 11, 2026

@Geramy Does that mean your M5 configuration is not erroneously allocating twice the model memory?

Not that I noticed? But I could double check I see more memory usage with MLX from LM Studio than llama.cop with MTP. Maybe it's the model?

@danielattilasimon
Copy link
Copy Markdown

@Geramy Does that mean your M5 configuration is not erroneously allocating twice the model memory?

Not that I noticed? But I could double check I see more memory usage with MLX from LM Studio than llama.cop with MTP. Maybe it's the model?

Could it be that it's working for you because you have a lot more memory to spare (128GB)? How big is the model you tried?

On my end (M3 Pro 36GB) it goes OOM very quickly with a 21GB model.

grimlee added a commit to grimlee/llama.cpp that referenced this pull request May 11, 2026
@Geramy
Copy link
Copy Markdown

Geramy commented May 11, 2026

@Geramy Does that mean your M5 configuration is not erroneously allocating twice the model memory?

Not that I noticed? But I could double check I see more memory usage with MLX from LM Studio than llama.cop with MTP. Maybe it's the model?

Could it be that it's working for you because you have a lot more memory to spare (128GB)? How big is the model you tried?

On my end (M3 Pro 36GB) it goes OOM very quickly with a 21GB model.

I run Qwen3.6-35B-A3B q6 model and about 32GB I think but I can run a long time and not hit oom ever even at 250k context. Maybe it's the different architecture and extra ram or the compile options I'm using?

@chatton2-coles
Copy link
Copy Markdown

chatton2-coles commented May 11, 2026

@Geramy Does that mean your M5 configuration is not erroneously allocating twice the model memory?

Not that I noticed? But I could double check I see more memory usage with MLX from LM Studio than llama.cop with MTP. Maybe it's the model?

Could it be that it's working for you because you have a lot more memory to spare (128GB)? How big is the model you tried?
On my end (M3 Pro 36GB) it goes OOM very quickly with a 21GB model.

I run Qwen3.6-35B-A3B q6 model and about 32GB I think but I can run a long time and not hit oom ever even at 250k context. Maybe it's the different architecture and extra ram or the compile options I'm using?

This isn't about context. It's about the Metal backend seemingly allocating twice the memory for the loaded model.
With a 128GB system, even a 'double allocated' 32GB model (so ~64GB) would still fit comfortably.
That result doesn't rule out the likelihood that this issue affects all MacOS users.

@frozename
Copy link
Copy Markdown

frozename commented May 11, 2026

So I spent today digging in and found the root cause for the memory doubling problem.

The MTP head gets loaded by reopening the same gguf with override_arch=qwen35_mtp. That arch registers tensors at both ends of the file (tok_embd near the start, output/nextn.*/last layer at the end), so the mmap-backed buffer in llama-model.cpp ends up covering pretty much the whole file. Apple Metal then uploads that whole range to a Metal-resident buffer = a full duplicate of the main model in VRAM. Two MTL0_Mapped model buffer size = 18760 MiB lines in the server log, side by side. Shows up as unaccounted ~= model_size in the breakdown because the MTP context lives in a sibling llama_context.

Fix is one line in tools/server/server-context.cpp — force use_mmap=false on the MTP load so the non-mmap allocator sizes the buffer to the registered tensors only. MTP buffer drops 18760 → 1425 MiB at Q5, 28213 → 1719 MiB at Q8.

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = mtp_arch;
mparams_mtp.use_mmap = false;   // <-- this is the whole fix

model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));

Post-fix on the same box, exact server flags, am17an's 9-prompt suite:

Q4_K_M  vanilla 11.9  mtp 10.0  ratio 0.85  accept 0.70
Q5_K_M  vanilla  9.5  mtp  8.6  ratio 0.91  accept 0.71
Q8_0    vanilla  7.4  mtp 11.1  ratio 1.49  accept 0.73

Q8 wins clean (1.26x–1.82x per prompt, code/math at the top). Q4/Q5 still lose — speculative win scales with main-pass cost so the largest quant you can fit is where MTP pays off on this hardware. Not 2.5x but I'll take 1.5x at Q8_0.

params_base.model.path.c_str(), mtp_arch);

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = mtp_arch;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On M4-class Macs the MTP head load duplicates the main model in Metal VRAM (MTL0_Mapped model buffer size = 18760 MiB twice in the server log). The MTP arch registers tok_embd near the start of the gguf and output/nextn.* near the end, so the mmap-backed buffer in llama-model.cpp covers [first, last) ≈ the whole file. Forcing use_mmap = false here drops the MTP buffer 13–16× (Q5: 18 760 → 1 425 MiB; Q8: 28 213 → 1 719 MiB) and unblocks Q8_0 at 1.49× decode on M4 Pro.

Suggested change
mparams_mtp.override_arch = mtp_arch;
mparams_mtp.override_arch = mtp_arch;
mparams_mtp.use_mmap = false;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant we use --no-mmap ? as a switch on launch args of llama.cpp or does that not apply?

@cpietsch
Copy link
Copy Markdown

Did a retest with ROCm 7.2.3 and rocBLAS-7.2.3 on my AMD Instinct MI50 16 GiB (gfx906)

Bench settings: -ctk q4_0 -ctv q4_0 -b 64 -ub 64 -c 2048 -np 1 -ngl 99 --flash-attn on --no-cache-prompt --cache-ram 0

MTP settings: --spec-type mtp --spec-draft-n-max 3 --spec-draft-ngl 99

Quant Baseline tok/s MTP tok/s Speedup MTP acceptance VRAM avg
Q2_K_XL 19.58 23.48 1.20x 65.24% 84.0%
Q3_K_XL 18.86 21.61 1.15x 70.70% 98.7%

Overall comparable speedup: 1.17x.

yrougy added a commit to yrougy/llama.cpp that referenced this pull request May 11, 2026
…n35/qwen35moe)

Adds llama_model_qwen35_mtp / llama_model_qwen35moe_mtp architectures
that the server auto-loads from the same GGUF with override_arch when
--spec-type mtp is requested. The MTP block runs as a separate draft
context for speculative decoding, yielding ~2-3× throughput increase.

Conflict resolutions vs our local changes:

- qwen35.cpp / qwen35moe.cpp: removed our duplicate nextn_predict_layers
  block (now handled in the merged load_arch_hparams); kept TENSOR_SKIP
  for MTP-layer tensors in the base model to avoid loading ~200 MiB of
  unused weights into VRAM; extended TENSOR_SKIP block with the two new
  nextn tensors (embed_tokens, shared_head_head) using
  TENSOR_NOT_REQUIRED|TENSOR_SKIP so GUFs without them still work.

- convert_hf_to_gguf.py: kept both _Qwen35MRopeMixin (mrope_section
  default) and _Qwen35MtpMixin (MTP block count/tensor remapping) as
  separate mixins; both classes now inherit from both.

- tests/test-backend-ops.cpp: merged ggml_set_name + ggml_l2_norm from
  our side with the new keep_intermediates parameter from the PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Thomasedv

This comment has been minimized.

@happydog-bot
Copy link
Copy Markdown

Sharing some production benchmark data on this MTP implementation — might be useful for prioritizing the -np > 1 follow-up work.

Setup: llama-server with this PR merged locally, Qwen3.6-35B-A3B-MTP (Q4_K_XL UD), AMD Ryzen AI Max+ 395 / Radeon 8060S / Vulkan backend, -c 262144 -fa on -ctk q8_0 -ctv q8_0 --no-mmap --mlock -ngl 99. 200-token decode, ignore_eos=true, temperature=0, seed=42, 3 warm runs.

Config Per-stream 5-job end-to-end Aggregate at saturation
MTP, -np 1 70.46 tok/s ~14.2s (serialized) 70.46 tok/s
no MTP, -np 5 -kvu, 1 active 51.84 tok/s 51.84 tok/s
no MTP, -np 5 -kvu, 5 active 25.26 tok/s 8.27s 120.86 tok/s

MTP gives a clean 1.36× per-stream speedup (draft acceptance ~72%, ~136/189 drafted tokens kept). Parallel slots give a 1.72× end-to-end speedup on a 5-agent burst. Both are large wins.

The MTP implementation here is excellent work — really appreciate it. Wanted to flag this from an agentic-workflow perspective specifically: the choice between MTP and -np > 1 is currently strictly binary, but agent deployments hit both axes simultaneously. A single interactive agent benefits enormously from MTP's 36% per-token win; a fleet of 5–10 concurrent agents needs parallel slots or they serialize and time out. With the n_parallel > 1 guard in server-context.cpp we're forced to drop MTP entirely to keep agents alive, which costs 36% per stream on every interactive flow.

If MTP and parallel slots were composable, the same workload could plausibly land near 1.36 × 1.72 ≈ 2.3× over baseline. That would be a major deployment-side win for anyone running agent fleets on llama.cpp and would make this PR even more impactful when it lands.

Happy to test patches against this hardware (Strix Halo / Vulkan is somewhat under-represented in benchmark data) if useful — just ping.

@am17an am17an mentioned this pull request May 11, 2026
6 tasks
@Master-Pr0grammer

This comment has been minimized.

Comment thread convert_hf_to_gguf.py
n_layer = self.hparams["num_hidden_layers"]
if name.find("layers.") != -1:
assert bid is not None
name = name.replace(f"mtp.layers.{bid}", f"model.layers.{bid + n_layer}")
Copy link
Copy Markdown

@kauffman12 kauffman12 May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting Qwen3.5 122B I get this error:
KeyError: 'model.layers.0.mlp.experts.0.down_proj.weight'

I changed this code to update the bid value as well as the name and it seems to have fixed the problem. I've been using the MTP version of 122B Q8_0 for a few days and it's been great. I also built a Qwen 3.6 35B with this code change and it still worked fine.

new_bid = bid + n_layer
name = name.replace(f"mtp.layers.{bid}", f"model.layers.{new_bid}")
bid = new_bid

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right, I also modified that part, and then 122B can be converted correctly.

@curvedinf
Copy link
Copy Markdown

curvedinf commented May 11, 2026

Here are some results on my 7900 XTX.

Environment:

  • PR commit: 5d5f1b46e
  • GPU: Radeon RX 7900 XTX, gfx1100

Benchmark flags:
-ctk q4_0 -ctv q4_0 -b 64 -ub 64 -c 2048 -np 1 -ngl 99 --flash-attn on --no-cache-prompt --cache-ram 0 --chat-template-kwargs '{"preserve_thinking": true}'

MTP flags:
--spec-type mtp --spec-draft-n-max 3 --spec-draft-ngl 99

ROCm

Model Quant Baseline tok/s MTP tok/s Speedup MTP acceptance VRAM base VRAM MTP VRAM delta
Qwen3.6-27B Q2_K 36.15 41.17 1.14x 54.01% 46.0% 53.2% +7.2 pp
Qwen3.6-27B Q3_K_M 30.00 45.92 1.53x 68.49% 55.6% 63.0% +7.4 pp
Qwen3.6-27B Q3_K_S 30.94 46.83 1.51x 68.09% 51.0% 58.0% +7.0 pp
Qwen3.6-27B Q4_K_M 27.51 47.07 1.71x 70.73% 68.0% 75.1% +7.1 pp
Qwen3.6-27B Q4_K_S 27.52 44.32 1.61x 66.04% 64.0% 71.2% +7.2 pp
Qwen3.6-27B Q5_K_M 29.79 45.25 1.52x 69.74% 78.0% 85.2% +7.2 pp
Qwen3.6-35B-A3B Q2_K 102.85 116.45 1.13x 53.26% 55.2% 60.0% +4.8 pp
Qwen3.6-35B-A3B Q3_K_M 87.03 131.16 1.51x 70.50% 70.6% 75.0% +4.4 pp
Qwen3.6-35B-A3B Q4_K_M 86.69 131.02 1.51x 69.89% 88.0% 93.0% +5.0 pp

Vulkan

Model Quant Baseline tok/s MTP tok/s Speedup MTP acceptance VRAM base VRAM MTP VRAM delta
Qwen3.6-27B Q2_K 50.36 56.00 1.11x 53.26% 43.5% 50.0% +6.5 pp
Qwen3.6-27B Q3_K_M 43.48 64.46 1.48x 70.22% 53.0% 60.0% +7.0 pp
Qwen3.6-27B Q3_K_S 45.20 64.62 1.43x 69.74% 48.0% 55.0% +7.0 pp
Qwen3.6-27B Q4_K_M 38.46 60.86 1.58x 69.40% 66.0% 73.0% +7.0 pp
Qwen3.6-27B Q4_K_S 41.11 60.60 1.47x 69.89% 61.7% 69.0% +7.3 pp
Qwen3.6-27B Q5_K_M 34.54 65.91 1.91x 70.27% 76.0% 83.0% +7.0 pp
Qwen3.6-35B-A3B Q2_K 143.67 163.87 1.14x 55.96% 54.0% 57.0% +3.0 pp
Qwen3.6-35B-A3B Q3_K_M 140.64 188.14 1.34x 67.66% 69.0% 73.0% +4.0 pp
Qwen3.6-35B-A3B Q4_K_M 139.87 187.93 1.34x 69.61% 86.0% 91.0% +5.0 pp
Qwen3.6-35B-A3B Q5_K_M 42.73 37.65 0.88x 70.65% 99.0% 99.0% +0.0 pp

@syzhizhu
Copy link
Copy Markdown

am17an#6 (comment)

--split-mode tensor become invalid and affect MTP speed.Removing --split-mode tensor restores normal MTP speed.

However, previously --split-mode tensor was available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.