llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

am17an · 2026-05-04T09:41:20Z

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc.
I saw a problem in [Speculative decoding] feat: add EAGLE3 speculative decoding support #18039 where the hidden features weren't propagated correctly across multiple ubatches, so this PR adds a separate "hook" for the MTP to consume after each ubatch
The MTP speculative class is fairly trivial (although it does depend on llama: allow partial seq_rm for GDN models for speculative decoding #22400, but could work without it)

Tip

MTP is compatible with Vision input and Tensor/Pipeline Parallelism

Note

Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.

Note

Parallel decoding with MTP is supported, but not fully optimized yet.

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

These are some sample commands to get started with MTP:

# MTP with draft size N (values for N: 2,3,...)
llama-server -hf [model-with-mtp] --spec-type draft-mtp --spec-draft-n-max 2

# add `--no-mmproj` to disable vision support if not needed (uses less memory)
llama-server ... --no-mmproj

# [ADVANCED]
# combine MTP + ngram-* (experimental, suitable for non-CUDA systems)
# use these combinations only if you know what you are doing 
llama-server -hf [model-with-mtp] \
  --spec-type draft-mtp --spec-draft-n-max 3 \
  --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 64

# (same as above, but shorter)
llama-server -hf [model-with-mtp] --spec-default --spec-type draft-mtp --spec-draft-n-max 3

Models

Quality check

The results from 4 runs of the AIME2026 eval (4x30 questions in total) with MTP enabled, using llama-eval, are within expectation and match the reported value by Qwen team.

Full data: aime2026-qwen3.6-27b-mtp-q4_k-x4.json.html

Next Steps until merge

Wait for spec : parallel drafting support #22838
Support separate GGUF for mtp
Fix recurrent state save/load with partial rollback
Regen docs for new CLI arg spec type draft-mtp

TODOs after merge

Improve ngram compatibility with mtp
Add recurrent state tests to CI
Re-enable --spec-draft-p-min support for mtp
Fix partial rollback for batch size > 1 + n_rs_seq (sample patch)
Improve multi-seq performance of the recurrent memory for n_rs_seq > 0 (currently the multi-seq states are not contiguous in memory so cannot be batched together)
Avoid D2H + H2D pre-norm embedding transfers somehow?
Metal drafting improvements metal: reuse K/V in flash-attn vec for spec-decode #23114 ?

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

ngxson · 2026-05-04T10:08:48Z

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

ngxson

(not a review, but opening some discussions)

am17an · 2026-05-04T10:25:00Z

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

pwilkin · 2026-05-04T10:41:19Z

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

cmp-nct · 2026-05-04T13:17:27Z

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

Dampfinchen · 2026-05-04T13:18:48Z

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

mbednarek360 · 2026-05-04T13:31:26Z

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P

am17an · 2026-05-04T13:35:31Z

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

nawoa · 2026-05-04T15:21:12Z

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

cturan · 2026-05-04T15:25:48Z

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode	model	prefill tok/s avg	generation tok/s avg	MTP acceptance	loaded VRAM
MTP enabled	Qwen3.6-27B-MTP-Q6_K.gguf + `--spec-type mtp --spec-draft-n-max 3`	665.14	42.45	76.0%	24.96 GiB
MTP disabled, same GGUF	Qwen3.6-27B-MTP-Q6_K.gguf, no spec	1315.46	22.97	n/a	22.47 GiB
Existing non-MTP Q6	Qwen3.6-27B-Q6_K.gguf, no spec	1260.12	22.39	n/a	22.59 GiB

Result:

MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
MTP adds about 2.49 GiB loaded VRAM in this setup.

am17an · 2026-05-04T15:33:17Z

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

iiLaurens · 2026-05-04T17:41:11Z

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

nybblr · 2026-05-04T18:05:54Z

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

volkermauel · 2026-05-04T19:16:44Z

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

alexandrupetraru · 2026-05-04T23:56:09Z

this is a game changer, on Strix Halo with the q8 Qwen 3.6 35B3A jumping from 40 to 70 tg at low context and for the 27B from 12 to 25 tg(with layer split 7900 xtx and strix halo 50,50) for coding. We need this one to master asap together with turbo4, it performs very well and without any issues. Good job

GloballyUniquePlaceholder · 2026-05-05T01:33:45Z

On a 3060 Laptop 6GB vram + 64GB ram running your provided Qwen 3.6 35A3B gguf there is a reasonable speed up.

spec-draft-n-max	average tk\s	wall_s_total	aggregate_accept_rate
n/a - no mtp	22.92	77.69	n/a
1	27.58	68.34	0.8835
2	29.39	66.00	0.815
3	27.78	67.96	0.7127
4	26.09	72.23	0.6421

raw results

spec-draft-n-max 4

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 4

python mtp-bench.py
  code_python        pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=31.3
  code_cpp           pred= 192 draft= 216 acc= 136 rate=0.630 tok/s=22.7
  explain_concept    pred= 192 draft= 224 acc= 134 rate=0.598 tok/s=22.3
  summarize          pred=  53 draft=  52 acc=  39 rate=0.750 tok/s=33.3
  qa_factual         pred= 192 draft= 196 acc= 141 rate=0.719 tok/s=29.2
  translation        pred=  22 draft=  32 acc=  13 rate=0.406 tok/s=19.4
  creative_short     pred= 192 draft= 264 acc= 124 rate=0.470 tok/s=20.7
  stepwise_math      pred= 192 draft= 192 acc= 143 rate=0.745 tok/s=30.7
  long_code_review   pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=25.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1576,
  "total_draft_accepted": 1012,
  "aggregate_accept_rate": 0.6421,
  "wall_s_total": 72.23
}

spec-draft-n-max 3

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

python mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 136 rate=0.824 tok/s=30.2
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=27.6
  explain_concept    pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=25.3
  summarize          pred=  53 draft=  48 acc=  36 rate=0.750 tok/s=32.5
  qa_factual         pred= 192 draft= 180 acc= 131 rate=0.728 tok/s=29.2
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=24.5
  creative_short     pred= 192 draft= 210 acc= 120 rate=0.571 tok/s=23.2
  stepwise_math      pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=30.5
  long_code_review   pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1347,
  "total_draft_accepted": 960,
  "aggregate_accept_rate": 0.7127,
  "wall_s_total": 67.96
}

spec-draft-n-max 2

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

python mtp-bench.py
  code_python        pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=31.5
  code_cpp           pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=27.0
  explain_concept    pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=25.6
  summarize          pred=  53 draft=  40 acc=  32 rate=0.800 tok/s=32.2
  qa_factual         pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.1
  translation        pred=  22 draft=  16 acc=  13 rate=0.812 tok/s=30.8
  creative_short     pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=25.9
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=31.3
  long_code_review   pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=29.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 1070,
  "total_draft_accepted": 872,
  "aggregate_accept_rate": 0.815,
  "wall_s_total": 66.0
}

spec-draft-n-max 1

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 1

python mtp-bench.py
  code_python        pred= 192 draft=  96 acc=  94 rate=0.979 tok/s=28.3
  code_cpp           pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=26.2
  explain_concept    pred= 192 draft= 102 acc=  89 rate=0.873 tok/s=25.9
  summarize          pred=  56 draft=  29 acc=  26 rate=0.897 tok/s=30.6
  qa_factual         pred= 192 draft= 100 acc=  90 rate=0.900 tok/s=28.5
  translation        pred=  22 draft=  12 acc=   9 rate=0.750 tok/s=27.0
  creative_short     pred= 192 draft= 104 acc=  86 rate=0.827 tok/s=24.9
  stepwise_math      pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.7
  long_code_review   pred= 192 draft= 102 acc=  88 rate=0.863 tok/s=28.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1422,
  "total_draft": 747,
  "total_draft_accepted": 660,
  "aggregate_accept_rate": 0.8835,
  "wall_s_total": 68.34
}

no mtp

llama.cpp\build\bin\Release\llama-server.exe -fa on -c 5000 -np 1 -fit on -m Qwen3.6-35BA3B-MTP.gguf --chat-template-kwargs "{\"preserve_thinking\": true}"

python mtp-bench.py
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.2
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=25.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=22.1
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=22.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=21.4
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=24.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1419,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 77.69
}

ninjas28 · 2026-05-05T02:11:03Z

Crashes when using -sm tensor with llama-server launch command args -hf am17an/Qwen3.6-27B-MTP-GGUF:Q8_0 -sm tensor -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3. Using -sm tensor without MTP works fine. This is on a triple GPU setup using ROCm.

srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist 
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 356
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 352, batch.n_tokens = 352, progress = 0.988764
/root/llama.cpp/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
/root/llama.cpp/build/bin/libggml-base.so.0(+0x1b25b)[0x74b4b4ca925b]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x74b4b4ca96df]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x152)[0x74b4b4ca98b2]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41506)[0x74b4b4ccf506]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x3d579)[0x74b4b4ccb579]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41adb)[0x74b4b4ccfadb]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x474)[0x74b4b4cbff54]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111)[0x74b4b4cc6351]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe8)[0x74b4b44dac08]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context21handle_mtp_for_ubatchEiPKiS1_P11ggml_tensor+0x20d)[0x74b4b44da9bd]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x142)[0x74b4b44dac62]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
llama-server(+0xf846e)[0x63c5e42c046e]
llama-server(+0x172971)[0x63c5e433a971]
llama-server(+0x5842c)[0x63c5e422042c]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74b4b3c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74b4b3c29e40]
llama-server(+0x58cd5)[0x63c5e4220cd5]
Aborted```

superjamie · 2026-05-05T03:25:09Z

Tested on 3x RTX3060 12Gb. Sorry I don't have the VRAM for your Q8, I used RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF which was quantized with ik_llama's MTP.

Prompt: "Write a simple minimal hash table implementation in C99."

Three runs with no MTP, avg generation 18.51 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}'

prompt eval time =     177.62 ms /    24 tokens (    7.40 ms per token,   135.12 tokens per second)
       eval time =   99331.08 ms /  1837 tokens (   54.07 ms per token,    18.49 tokens per second)
      total time =   99508.70 ms /  1861 tokens

prompt eval time =     159.10 ms /    24 tokens (    6.63 ms per token,   150.85 tokens per second)
       eval time =  107505.42 ms /  1988 tokens (   54.08 ms per token,    18.49 tokens per second)
      total time =  107664.52 ms /  2012 tokens

prompt eval time =     158.43 ms /    24 tokens (    6.60 ms per token,   151.49 tokens per second)
       eval time =   48263.07 ms /   895 tokens (   53.93 ms per token,    18.54 tokens per second)
      total time =   48421.51 ms /   919 tokens

Three runs with MTP, avg generation 32.24 tok/sec:

llama-server --model /models/RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF/Qwen3.6-27B-MTP-Q4_K_M.gguf \
 --port 8080 --host 0.0.0.0 --n-gpu-layers 999 --flash-attn on --ctx-size $((16*1024)) \
 --temp 0.6 --top-p 0.95 --presence-penalty 0.0 --top-k 20 --min-p 0.0 --repeat_penalty 1.0 \
 --no-mmproj --chat-template-kwargs '{"enable_thinking":false}' \
 --spec-type mtp --spec-draft-n-max 3 --parallel 1

prompt eval time =     232.24 ms /    24 tokens (    9.68 ms per token,   103.34 tokens per second)
       eval time =   34610.94 ms /  1110 tokens (   31.18 ms per token,    32.07 tokens per second)
      total time =   34843.18 ms /  1134 tokens 
      
prompt eval time =     207.99 ms /    24 tokens (    8.67 ms per token,   115.39 tokens per second)
       eval time =   32110.05 ms /  1064 tokens (   30.18 ms per token,    33.14 tokens per second)
      total time =   32318.03 ms /  1088 tokens
      
prompt eval time =     208.50 ms /    24 tokens (    8.69 ms per token,   115.11 tokens per second)
       eval time =   39029.34 ms /  1230 tokens (   31.73 ms per token,    31.51 tokens per second)
      total time =   39237.84 ms /  1254 tokens

Result 74% speedup. Wow!

Thank you for your work. You will make many users happy with this. What an exciting PR!

One small hiccup. On my initial attempt I got the error message:

load_model: MTP currently supports only n_parallel=1; got 4

Adding --parallel 1 fixed that.

candrews · 2026-05-20T15:41:00Z

--fit doesn't appear to take MTP into account - is that being worked on?

AbdulrahmanHashem · 2026-05-20T20:28:13Z

@ggerganov @am17an somewhere between merged PRs #23234 and #23333 something made models need more ram to fit into the same setup
and i'm 100% sure of that

15 May 19:46 build_4_BFMTP
18 May 02:13 build_5 <--- not broken
19 May 23:32 build_7_Bugged <--- broken
20 May 22:02 build_8

it seems to be MTP related, i tested the none MTP model from unsloth and i didn't see the problem.

And the model has been doing this ever since that update happened
though i haven't tried normal models enough to see if they do that

build_5
never did this like at all.

andyskw · 2026-05-21T05:24:02Z

Adding the first gfx1150 (AMD Radeon 890M / Strix Point APU, Ryzen AI 9 HX 470) data point. Tested on unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL, all 42 layers fully on Vulkan0 (UMA, no CPU fallback — verified with --verbose).

Setup

gfx1150, RDNA 3.5, RADV STRIX1, Mesa 26.1.0, Vulkan 1.4.348 (subgroup_size 32, subgroup_clustered yes)
Ubuntu 24.04, kernel 6.17.0-23, 57 GiB UMA
llama.cpp commit ad27757 (master 2026-05-20), Vulkan build
Bench: am17an's gist (https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090), 9 prompts, max_tokens=192, seed=42, temp=1.0

./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  -ngl 999 -c 8192 -fa on -np 1 \
  -ctk q8_0 -ctv q8_0 --jinja --no-mmap \
  --spec-type draft-mtp --spec-draft-n-max 2

n-max sweep

n-max	runs	mean tps	std	accept	speedup
baseline (no MTP)	3	21.50	0.24	—	1.000×
1	2	23.99	0.19	0.883	1.116×
2	2	25.77	0.04	0.803	1.199×
3	3	24.78	0.11	0.698	1.153×
4	2	22.51	0.01	0.588	1.047×
6	2	22.25	0.02	0.538	1.034×

Best config (mtp-d2-r1) — per-prompt detail

  code_python        pred= 192 draft= 147 acc= 117 rate=0.796 tok/s= 25.37
  code_cpp           pred= 192 draft= 137 acc= 122 rate=0.890 tok/s= 27.37
  explain_concept    pred= 192 draft= 150 acc= 116 rate=0.773 tok/s= 25.21
  summarize          pred= 192 draft= 143 acc= 119 rate=0.832 tok/s= 26.19
  qa_factual         pred= 192 draft= 138 acc= 121 rate=0.877 tok/s= 27.16
  translation        pred= 192 draft= 144 acc= 119 rate=0.826 tok/s= 26.22
  creative_short     pred= 192 draft= 160 acc= 110 rate=0.688 tok/s= 23.39
  stepwise_math      pred= 192 draft= 140 acc= 121 rate=0.864 tok/s= 26.94
  long_code_review   pred= 192 draft= 157 acc= 112 rate=0.713 tok/s= 23.91

{
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1316,
  "total_draft_accepted": 1057,
  "aggregate_accept_rate": 0.8032,
  "wall_s_total": 73.0
}

Observations

Sweet spot is n=2 on this hardware, not n=3-6. Per-step Vulkan verify overhead grows faster than accepted-token throughput beyond n=2.
Consistent with what @sheevy reported earlier on Strix Halo for this same MoE model (51→60 tg/s ≈ 1.18×). The bigger Strix Halo speedups in this thread are on dense Qwen3.6-27B (~2.5×) or higher quants — MoE A3B has less MTP headroom because only ~3 GB of expert weights are read per token, so even baseline is already near memory-bw ceiling on these LPDDR5x APUs. Would predict 27B-dense to give ~2× on gfx1150 as well, by the same ratio.
wave32 vs wave64: RADV_DEBUG=nosubgroupsizectrl to force wave64 = identical perf (25.82 vs 25.77 tok/s). Both subgroup modes are equally well-tuned on RDNA 3.5.
Default config is near-optimal on gfx1150. Swept KV cache quant (f16/q8_0/q4_0 combos), ctx (4k/8k/16k), ubatch (256-2048), threads (8-24), and all RADV_PERFTEST/DEBUG flags. All within ±1% of mtp-d2 default. q4_0 KV is the one degradation worth flagging (-4% tps and -4pp accept rate). Default -ctk q8_0 -ctv q8_0 is correct.
GDN shader lanes_per_column (spec constant in gated_delta_net.comp): patched ggml-vulkan.cpp locally to env-override and swept {1,2,4,8,16}. Clean U-curve with the upstream default of 8 at the peak. The auto-chosen value for S_V=128 + subgroup_clustered is correct on gfx1150.

Two minor things worth filing separately if reproducible

GGML_VK_GDN_LANES=32 (i.e., lanes_per_column == subgroup_size, which switches to the nocluster shader path) crashed the server mid-bench in this model. The --verbose log just shows the gen loop terminating with no error. I haven't followed up to reproduce reliably or get a backtrace, but might be worth a glance at the nocluster path for that boundary case.
-fa off with -ctv q8_0 segfaults silently. The error V cache quantization requires flash_attn is printed but exit is via SIGSEGV, not clean. A small return after that check would be friendlier.

ggerganov · 2026-05-21T06:12:04Z

@AbdulrahmanHashem Do you have a consistent repro of the garbage generation? If yes, please open an issue with detailed information about your hardware, model, logs, etc.

AbdulrahmanHashem · 2026-05-21T22:19:08Z

@AbdulrahmanHashem Do you have a consistent repro of the garbage generation? If yes, please open an issue with detailed information about your hardware, model, logs, etc.

sadly i can't get enough time to make the issue but i just made a build and it's alittle better with when it comes to using more VRAM but it still uses at least about 250 mb more ram at the moment

i tested it on my own projects and i haven't seen garbage again. Update : it just gave me garbage again.

nipeone · 2026-05-22T10:24:54Z

Great! The current branch has been merged into the main branch. But how do I use it after merging into the main branch? After compiling with the latest code of master branch, llama-server --spec-type does not support draft-mtp.

error while handling argument "--spec-type": unknown speculative decoding type without draft model

usage:
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
                                        type of speculative decoding to use when no draft model is provided
                                        (default: none)
                                        
                                        (env: LLAMA_ARG_SPEC_TYPE)

kyuz0 · 2026-05-22T13:25:35Z

For somebody interested in a full comparison on a coding benchmark to see how much benefit MTP gives you, I run SWE Verified mini on Strix Halo and AMD R9700:

https://pi-local-coding-bench.dev/

In a nutshell, there is a measurable improvement in average time to complete tasks, more than I expected given the negative performance hit with prompt processing:

Strix Halo / Qwen 3.6 35B-A3B UD_Q8-K-XL:

R9700 / Qwen 3.6 27B UD_Q4-K-XL:

I also observed an improvement in task completion - is it just random or does MTP change some of the sampling strategy?

ggerganov · 2026-05-22T13:40:07Z

I also observed an improvement in task completion - is it just random or does MTP change some of the sampling strategy?

It should be random. You need to run the eval multiple times to reduce the variance of the result.

kyuz0 · 2026-05-22T13:44:57Z

It should be random. You need to run the eval multiple times to reduce the variance of the result.

Yup, that was what I thought. I need to find the time to do it, even single runs of the full benchmark can take half a day!

ggerganov · 2026-05-22T13:54:19Z

Yes, you need to distribute it on many machines.

Btw, without MTP, the Qwen3.x models should support parallel processing efficiently. So depending on the max context needed for these tasks, you can run requests in parallel on a single server. The more requests you can batch, the better.

However, batching with MTP enabled using a recurrent model (i.e. Qwen3.x) is currently not optimized, so you won't benefit from parallel processing on a single machine in that case. The only way atm is to scale the machines.

kyuz0 · 2026-05-22T14:54:03Z

Thanks @ggerganov , I'm definitely doing that... Interestingly a comment on my video on this seems to think instead MTP might actually improve performance:

But, this will be clear after I've re-run all benchmarks.

cmp-nct · 2026-05-22T15:02:18Z

Yes, you need to distribute it on many machines.

Btw, without MTP, the Qwen3.x models should support parallel processing efficiently. So depending on the max context needed for these tasks, you can run requests in parallel on a single server. The more requests you can batch, the better.

However, batching with MTP enabled using a recurrent model (i.e. Qwen3.x) is currently not optimized, so you won't benefit from parallel processing on a single machine in that case. The only way atm is to scale the machines.

What I would have liked to see is to restrict n_rs_seq to one slot only.
So we could run parallel workloads (with just ngram-mod) and then a single slot continuation on the same context with MTP enabled.
Currently enabling n_rs_seq will multiply it into all slots, that's half a GB vram per slot. So it can not selectively run on slot 0 only.
Though my use case is less standard, most people just run a parallel server or a single slot instance

darkbasic · 2026-05-22T15:50:19Z

@kyuz0 did you limit the parallel agents for the non-mtp use case? Because MTP currently limits you to 1 concurrent agent, while I noticed that qwen3.6-35B likes to use several parallel agents in opencode (as opposed to qwen3.5-122B) which makes things faster on its own.

kyuz0 · 2026-05-22T16:21:02Z

@kyuz0 did you limit the parallel agents for the non-mtp use case? Because MTP currently limits you to 1 concurrent agent, while I noticed that qwen3.6-35B likes to use several parallel agents in opencode (as opposed to qwen3.5-122B) which makes things faster on its own.

@darkbasic the benchmark was done with pi, no sub agents, so both MTP and non-MTP had 1 agent thread.

darkbasic · 2026-05-22T16:24:14Z

@kyuz0 that explains it. Also keep in mind that opencode tends to bloat the context much more if it doesn't know the model. You can fake it to a known one by running llama-server with --alias gpt-5.5, but beware that the model must be smart enough to handle what the harness expects it to.

github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

ngxson reviewed May 4, 2026

View reviewed changes

Comment thread src/llama-memory-recurrent.h

Comment thread src/models/qwen35.cpp Outdated

Comment thread tools/server/server-context.cpp Outdated

am17an force-pushed the mtp-clean branch from 6b40a9f to 10829db Compare May 4, 2026 12:33

wjy9902 mentioned this pull request May 4, 2026

2026-05-05 wjy9902/ai-daily#58

Open

MirkoCovizzi mentioned this pull request May 4, 2026

Potential 2x decoding speedup with MTP antirez/llama.cpp-deepseek-v4-flash#5

Closed

github-actions Bot mentioned this pull request May 5, 2026

Reddit News Daily 2026-05-05 gitlawr/reddit-daily-news#235

Open

carbocation mentioned this pull request May 19, 2026

Eval bug: draft-mtp changes deterministic output on Qwen3.6 MTP model #23335

Closed

reneleonhardt mentioned this pull request May 19, 2026

Qwen3.5 Model Support axolotl-ai-cloud/axolotl#3434

Closed

5 tasks

aamsellem mentioned this pull request May 19, 2026

Compile bug: BUILD_SHARED_LIBS=ON + GGML_CUDA=ON fails at llama-server link with undefined references to cuMem*/cuDevice* (missing INTERFACE propagation of CUDA::cuda_driver) #23357

Open

fiorelorenzo mentioned this pull request May 19, 2026

Eval bug: Qwen3.5 GGUF aborts on FGDN_AR tensor-name prefix assertion in sched_reserve #23347

Closed

melroy89 mentioned this pull request May 20, 2026

[Feature] Draft model / speculative decoding unslothai/unsloth#4753

Closed

noonghunna mentioned this pull request May 20, 2026

[bug] Cannot start using mtp.yml for single card noonghunna/club-3090#173

Closed

mudler mentioned this pull request May 21, 2026

imatrix: optionally activate MTP/NextN draft head during collection #23476

Draft

jtmckinn mentioned this pull request May 25, 2026

Misc. bug: Prompt cache serialization crashes on second request with SSM model on Vulkan #21762

Closed

leehack mentioned this pull request May 26, 2026

MTP support? leehack/llamadart#168

Closed

usabarashi mentioned this pull request May 27, 2026

Enable MTP speculative decoding for llama-server usabarashi/nix-config#101

Merged

AlienWalker1995 mentioned this pull request Jun 23, 2026

fix(llamacpp): pin mainline build by digest, retire dead TurboQuant fork AlienWalker1995/Ordo-AI-Stack#54

Merged

rohithj7 mentioned this pull request Jun 26, 2026

models: fix Qwen3.5 dense/MoE load when MTP block is absent (trunk-only GGUF) #25024

Closed

1 task

avifenesh mentioned this pull request Jul 1, 2026

Research: FR-Spec-style draft-vocab trimming for native MTP speculative decoding #25187

Open

5 tasks

kekzl mentioned this pull request Jul 2, 2026

spec(mtp): activate MTP verify — imp is the only major engine with telemetry-only MTP kekzl/imp#847

Closed

Ansonfishing mentioned this pull request Jul 9, 2026

Feature: Support for MTP (Multi-Token Prediction) speculative decoding llamastash/llamastash#51

Open

Uh oh!

Conversation

am17an commented May 4, 2026 • edited by ggerganov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Performance

No MTP (baseline)

MTP --spec-draft-max-n 3

MTP --spec-draft-max-n 2

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

Master with draft model with spec-draft-n-max 64 with no partial rollback

How to use

Models

Quality check

Next Steps until merge

TODOs after merge

Requirements

Uh oh!

ngxson commented May 4, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented May 4, 2026

Uh oh!

pwilkin commented May 4, 2026

Uh oh!

cmp-nct commented May 4, 2026

Uh oh!

Dampfinchen commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbednarek360 commented May 4, 2026

Uh oh!

am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nawoa commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cturan commented May 4, 2026

Uh oh!

am17an commented May 4, 2026

Uh oh!

iiLaurens commented May 4, 2026

Uh oh!

nybblr commented May 4, 2026

Uh oh!

volkermauel commented May 4, 2026

Uh oh!

alexandrupetraru commented May 4, 2026

Uh oh!

GloballyUniquePlaceholder commented May 5, 2026

spec-draft-n-max 4

spec-draft-n-max 3

spec-draft-n-max 2

spec-draft-n-max 1

no mtp

Uh oh!

ninjas28 commented May 5, 2026

Uh oh!

superjamie commented May 5, 2026

Uh oh!

candrews commented May 20, 2026

Uh oh!

AbdulrahmanHashem commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andyskw commented May 21, 2026

Uh oh!

ggerganov commented May 21, 2026

Uh oh!

AbdulrahmanHashem commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nipeone commented May 22, 2026

Uh oh!

am17an commented May 4, 2026 •

edited by ggerganov

Loading

Dampfinchen commented May 4, 2026 •

edited

Loading

am17an commented May 4, 2026 •

edited

Loading

nawoa commented May 4, 2026 •

edited

Loading

AbdulrahmanHashem commented May 20, 2026 •

edited

Loading

AbdulrahmanHashem commented May 21, 2026 •

edited

Loading

kyuz0 commented May 22, 2026 •

edited

Loading