Support Step3.5/3.7 flash mtp3 by forforever73 · Pull Request #24340 · ggml-org/llama.cpp

forforever73 · 2026-06-09T08:50:49Z

Overview

📜 Full data-flow trace — couldn't think of a good way to draw this, so I wrote it all down instead. It's long, but every byte is load-bearing.

Notation:

token@pos / h(pos) — positions are explicit (0-indexed)
h_tgt(p) — target NextN hidden at p (before the output norm)
h45(p) / h46(p) — head 45/46 output hidden, chained between heads while drafting
pending_h = h_tgt(pos of id_last − 1) — always the trunk h, regardless of chaining

Example: a 4-token prompt at positions 0–3.

Time ─────────────────────────────────────────────────────────────────────▶

═══ Round 0: prompt bootstrap ══════════════════════════════════════════════
  prompt: [t0@0, t1@1, t2@2, t3@3]   (need_embd → all logits=true)

  ② target decode → verify_h = [h_tgt(0), h_tgt(1), h_tgt(2), h_tgt(3)]

  ③ process (mirror once per head, all logits=0, h_tgt shifted right by 1):
       for head in {45, 46, 47}:
         seq_rm(ctx_dft, seq, ≥ this ubatch's first pos)   // = 0 here; reset before each head
         token = [t0@0, t1@1,    t2@2,    t3@3]
         embd  = [0,    h_tgt(0), h_tgt(1), h_tgt(2)]   ← row 0 = pending_h = 0 (sentinel)
         → write KV_head@0..3 (teacher-forced, logits=0)
                              h_tgt(3) pushed out → pending_h default = h_tgt(3)
       (no draft → accept not called → pending_h stays = h_tgt(3))

  sample logits at pos 3 → T_first@4
       slot.sampled = T_first@4 ;  pending_h = h_tgt(3)
       invariant: id_last=T_first@4, pending_h=h_tgt(3)=h_tgt(4-1)  ✓

═══ Round 1: generation (id_last=T_first@4, pending_h=h_tgt(3)) ═════════════
  ⑥ draft (accumulated batch; switch head each step, 1 decode/step, batch grows):
       round_start = pos of id_last = 4; seq_rm(≥4) before switching to each head

       set_mtp_layer_offset(ctx_dft, 0)   // head 45
       step0 [h45]  seq_rm(≥4); batch=[T_first@4],            embd=[h_tgt(3)]
                    → decode → write KV45@4; sample draft1@5; output h45(4)

       set_mtp_layer_offset(ctx_dft, 1)   // head 46
       step1 [h46]  seq_rm(≥4); batch=[T_first@4, draft1@5],  embd=[h_tgt(3), h45(4)]
                    → decode → write KV46@4,5; sample draft2@6; output h46(5)

       set_mtp_layer_offset(ctx_dft, 2)   // head 47
       step2 [h47]  seq_rm(≥4); batch=[T_first@4, draft1@5, draft2@6],
                    embd=[h_tgt(3), h45(4), h46(5)]
                    → decode → write KV47@4,5,6; sample draft3@7
                    (draft length capped at n_mtp_layers = 3)

       spec_draft = [draft1, draft2, draft3]

  ── post-draft seq_rm (server, cross-round cleanup; unchanged) ──
       ckpt.pos_max = 3 (ctx_tgt only covers prompt @0..3 here; T_first not decoded yet)
       seq_rm(ctx_dft, slot.id, ≥4)  → every head's KV trimmed back to @0..3

  ① target batch (4 tokens, all logits=true):
       token = [T_first@4, draft1@5, draft2@6, draft3@7]

  ② target decode → verify_h = [h_tgt(4), h_tgt(5), h_tgt(6), h_tgt(7)] + logits[0..3]

  ③ process (once per head, logits=0; embd = h_tgt right-shift, NOT inter-head h):
       for head in {45, 46, 47}:
         seq_rm(ctx_dft, seq, ≥4)   // reset before each head; all heads rewrite from @4
         token = [T_first@4, draft1@5, draft2@6, draft3@7]
         embd  = [h_tgt(3),  h_tgt(4), h_tgt(5), h_tgt(6)]
                  ↑ pending_h           h_tgt(7) pushed out → pending_h default
         → rewrite KV_head@4..7 (teacher-forced, logits=0)

  ④ verify (sample from ctx_tgt logits, identical to the single-block path):
       logits[0] → real tok@5 == draft1  ✓
       logits[1] → real tok@6 == draft2  ✓
       logits[2] → real tok@7 != draft3  ✗ → resample T_new@7
       accepted = [draft1, draft2, T_new] ;  n_accepted = 2

  ⑤ accept (identical to the single-block path):
       pending_h    = verify_h[n_accepted] = verify_h[2] = h_tgt(6)
       slot.sampled = accepted.back()      = T_new (pos 7)
       invariant: id_last=T_new@7, pending_h=h_tgt(6)=h_tgt(7-1)  ✓

═══ Round 2: generation (id_last=T_new@7, pending_h=h_tgt(6)) ═══════════════
  ⑥ step0 [h45]  seq_rm(≥7); batch=[T_new@7],                  embd=[h_tgt(6)]
                 → write KV45@7; draft1'@8, h45(7)
     step1 [h46]  seq_rm(≥7); batch=[T_new@7, draft1'@8],       embd=[h_tgt(6), h45(7)]
                 → write KV46@7,8; draft2'@9, h46(8)
     step2 [h47]  seq_rm(≥7); batch=[T_new@7, draft1'@8, draft2'@9],
                 embd=[h_tgt(6), h45(7), h46(8)]
                 → write KV47@7,8,9; draft3'@10
     ── post-draft seq_rm(≥8); ① target batch; ② target decode; ③ process ×3
        (process resets seq_rm(≥7) per head, then rewrites @7..10)

Core strategy
Each MTP head is its own decoder layer with its own KV, and the driver runs one head per llama_decode. A seq_rm before each head clears the range it re-decodes, so it reuses the same slots (find_slot is deterministic) instead of stacking duplicate positions; find_slot / apply_ubatch are untouched. The two phases differ only in what feeds the heads:

phase	embd fed to each head	purpose
`process()`	trunk `h_tgt`, right-shifted by one (not the inter-head hidden)	re-anchor each head's committed-prefix KV to the target's real hidden → next round starts target-aligned
`draft()`	the previous head's output, chained (slot 0 = trunk `pending_h`)	generate the draft tokens; each head rebuilds its own layer's KV on the growing prefix

Only the trunk h_tgt crosses rounds, so pending_h / verify_h stay single-layer.

MTP Block Selection Strategy

cparams.mtp_layer_offset (src/llama-cparams.h) — picks which appended MTP block the DECODER_MTP graph runs: il = n_layer() + offset. Default 0.
graph_mtp selects the head by offset (il = n_layer() + cparams.mtp_layer_offset, was a hardcoded n_layer()).
graph_mtp now gathers its output rows via build_inp_out_ids(), like the trunk graph. The fix that makes chaining work: from step 1 on, the output is the last batch row, not row 0, so without it heads 46/47 read the wrong row. Identity gather when n_outputs == n_tokens, so the single-head path is unchanged.
Loader now requires all n_layer_nextn MTP blocks.
n_max is clamped to the head count when chaining (each head used once).

Results

./llama-server \
    -m Step-3.7-IQ4_XS.gguf \
    --spec-type draft-mtp \
    --spec-draft-model Step3.7-flash-mtp-Q8_0.gguf \
    -ngl all \
    --spec-draft-ngl all \
    -c 35000 \
    -np 1 \
    -b 2048 \
    -ub 1024 \
    --temp 0 \
    --spec-draft-n-max {n} \
    --spec-draft-p-min 0.0 \
    --host 127.0.0.1 \
    --port 8080

The command is identical on both machines; only --spec-draft-n-max and the build change. Before = single-block MTP on master (one head, looped when n-max > 1); after = the three-layer chain.

DGX Spark GB10

Before (single-block MTP, master)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 164 acc= 108 rate=0.658 tok/s=30.8
  code_cpp           pred= 192 draft= 172 acc= 104 rate=0.605 tok/s=30.5
  explain_concept    pred= 192 draft= 171 acc= 105 rate=0.614 tok/s=30.8
  summarize          pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=33.2
  qa_factual         pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=36.6
  translation        pred= 192 draft= 173 acc= 104 rate=0.601 tok/s=30.8
  creative_short     pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=28.1
  stepwise_math      pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=34.6
  long_code_review   pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=31.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1492,
  "total_draft_accepted": 966,
  "aggregate_accept_rate": 0.6475,
  "wall_s_total": 60.19
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 233 acc= 112 rate=0.481 tok/s=28.8
  code_cpp           pred= 192 draft= 243 acc= 109 rate=0.449 tok/s=27.7
  explain_concept    pred= 192 draft= 252 acc= 106 rate=0.421 tok/s=26.3
  summarize          pred= 192 draft= 233 acc= 112 rate=0.481 tok/s=27.7
  qa_factual         pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=30.5
  translation        pred= 192 draft= 242 acc= 109 rate=0.450 tok/s=25.7
  creative_short     pred= 192 draft= 271 acc=  99 rate=0.365 tok/s=24.9
  stepwise_math      pred= 192 draft= 226 acc= 115 rate=0.509 tok/s=30.0
  long_code_review   pred= 192 draft= 235 acc= 112 rate=0.477 tok/s=27.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2131,
  "total_draft_accepted": 999,
  "aggregate_accept_rate": 0.4688,
  "wall_s_total": 70.1
}

After (three-layer MTP, this PR)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=33.9
  code_cpp           pred= 192 draft= 156 acc= 112 rate=0.718 tok/s=32.6
  explain_concept    pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=34.9
  summarize          pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=35.5
  qa_factual         pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=36.9
  translation        pred= 192 draft= 164 acc= 108 rate=0.658 tok/s=29.8
  creative_short     pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=30.0
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=39.4
  long_code_review   pred= 192 draft= 143 acc= 118 rate=0.825 tok/s=36.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1376,
  "total_draft_accepted": 1022,
  "aggregate_accept_rate": 0.7427,
  "wall_s_total": 56.95
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=35.1
  code_cpp           pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=31.4
  explain_concept    pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=32.2
  summarize          pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=33.0
  qa_factual         pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=38.0
  translation        pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=34.7
  creative_short     pred= 192 draft= 244 acc= 108 rate=0.443 tok/s=27.4
  stepwise_math      pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=40.6
  long_code_review   pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=33.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1804,
  "total_draft_accepted": 1110,
  "aggregate_accept_rate": 0.6153,
  "wall_s_total": 56.57
}

Mac Studio M4 Max

Before (single-block MTP, master)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=42.6
  code_cpp           pred= 192 draft= 179 acc= 101 rate=0.564 tok/s=38.4
  explain_concept    pred= 192 draft= 171 acc= 105 rate=0.614 tok/s=40.2
  summarize          pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=42.5
  qa_factual         pred= 161 draft= 122 acc= 101 rate=0.828 tok/s=47.4
  translation        pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=40.0
  creative_short     pred= 192 draft= 188 acc=  97 rate=0.516 tok/s=36.7
  stepwise_math      pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=44.3
  long_code_review   pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=41.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1475,
  "total_draft_accepted": 949,
  "aggregate_accept_rate": 0.6434,
  "wall_s_total": 46.38
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 233 acc= 113 rate=0.485 tok/s=33.7
  code_cpp           pred= 192 draft= 259 acc= 104 rate=0.402 tok/s=30.4
  explain_concept    pred= 192 draft= 249 acc= 107 rate=0.430 tok/s=31.7
  summarize          pred= 192 draft= 226 acc= 115 rate=0.509 tok/s=34.7
  qa_factual         pred= 161 draft= 165 acc= 105 rate=0.636 tok/s=39.9
  translation        pred= 192 draft= 244 acc= 109 rate=0.447 tok/s=32.3
  creative_short     pred= 192 draft= 278 acc=  98 rate=0.352 tok/s=28.4
  stepwise_math      pred= 192 draft= 220 acc= 117 rate=0.532 tok/s=35.8
  long_code_review   pred= 192 draft= 232 acc= 113 rate=0.487 tok/s=33.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 2106,
  "total_draft_accepted": 981,
  "aggregate_accept_rate": 0.4658,
  "wall_s_total": 56.7
}

After (three-layer MTP, this PR)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=45.3
  code_cpp           pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=41.2
  explain_concept    pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=43.8
  summarize          pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=45.0
  qa_factual         pred= 161 draft= 114 acc= 104 rate=0.912 tok/s=49.7
  translation        pred= 192 draft= 153 acc= 113 rate=0.739 tok/s=43.8
  creative_short     pred= 192 draft= 172 acc= 104 rate=0.605 tok/s=39.1
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=46.8
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=43.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1356,
  "total_draft_accepted": 1007,
  "aggregate_accept_rate": 0.7426,
  "wall_s_total": 43.83
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=40.8
  code_cpp           pred= 192 draft= 205 acc= 122 rate=0.595 tok/s=37.2
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=36.8
  summarize          pred= 192 draft= 194 acc= 126 rate=0.649 tok/s=39.4
  qa_factual         pred= 161 draft= 141 acc= 114 rate=0.808 tok/s=45.6
  translation        pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=38.3
  creative_short     pred= 192 draft= 242 acc= 109 rate=0.450 tok/s=31.5
  stepwise_math      pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=44.9
  long_code_review   pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=36.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1754,
  "total_draft_accepted": 1099,
  "aggregate_accept_rate": 0.6266,
  "wall_s_total": 49.38
}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: CC for code and test, but the core data flow was designed by human

pwilkin · 2026-06-09T10:54:59Z

I think we want at least @ggerganov and @am17an here for the discussion about how to solve multi-layer MTP in core.

forforever73 · 2026-06-09T11:10:04Z

Yes, I propose an initial approach here. I think it's semantically correct while keeping the changes relatively small.

pwilkin · 2026-06-09T12:23:25Z

Yeah, I purposefully wanted the original StepFun MTP PR to be small because I did a little foray into implementing the full MTP and then saw it would be quite a challenging task, think it's good to discuss this :)

forforever73 · 2026-06-11T07:16:49Z

@ggerganov @am17an Would you have some time to take a look ?

am17an · 2026-06-13T07:41:31Z

From what I understand, this can be achieved if we fix the draft length (via --spec-draft-n-max/min to be the same and --spec-draft-n-min 0) and we pass some state in the ctx_dft for selecting which nextn layer to use

forforever73 · 2026-06-13T08:40:05Z

@am17an Yeah, i guess the ctx_dft state you mean is the mtp_layer_offset I added. getting 3-layer mtp to run is the easy part; the rest of the diff is all about keeping the KV cache correct.

Unlike gemma4 (all nextn layers in one graph, shared target KV, single position), step35's heads are chained with a sample in between and run on their own kv. So I can't just bump the layer per step on the single-token loop, or head 46 attends back to a cell only head 45 ever wrote on its layer and reads garbage. That's why each step has to seq_rm the round and re-decode the accumulated prefix on the current head's layer.

The correct semantics in vLLM architecture can reference mtp3. I think it can be done more simply under llama.cpp's architecture.

am17an · 2026-06-13T08:49:51Z

I think you can still optionally add the llama_memory_seq_rm for this architecture after every decode step if I understand correctly.

forforever73 · 2026-06-13T08:57:36Z

Right, seq_rm is part of it. And also need to re-decode the whole accumulated prefix [id_last, draft_1, …] on the current head's layer each step, so that head writes its own layer's kv for every position.

am17an · 2026-06-13T09:24:40Z

I see, so you have to keep the draft_tokens and embeddings to copy them in each subsequent draft round. I think you can keep these two vectors host side and add them while rebuilding the batch. As an aside, this would not be able to use CUDA graphs as the topology for the draft will keep changing (i.e. batch size goes from 1 to 2 to 3 etc).

forforever73 · 2026-06-13T09:50:55Z

I think you can keep these two vectors host side and add them while rebuilding the batch

Yep, that's exactly what the current implementation does.

this would not be able to use CUDA graphs as the topology for the draft will keep changing

I'm aware. But each one needs the token the previous head sampled so is hard to avoid. And the perf cost is contained, draft is at most n_nextn tiny decodes.

am17an · 2026-06-13T10:13:33Z

Yes but I think the current implementation can be simplified.

am17an · 2026-06-13T13:37:17Z

+    // Each slot's embd is the hidden produced by the PREVIOUS head for that token
+    // (slot 0 is always pending_h = trunk h). Per-step seq_rm keeps each head's KV
+    // on a clean, position-aligned slot set.
+    void draft_multi_head(common_speculative_draft_params_vec & dparams) {


this should be part of draft rather a separate function. The only difference is that you need to add embd + token of the last sampled head to the batch

right, have merged

am17an

You can clean-up the comments a bit to follow the rest of the repo. If the code is self-explanatory we prefer not to add comments in cpp files (.h files is encouraged). If something is a bit non-intuitive (like seq_rm in this PR) then it takes sense to add a comment to explain. You should also check MTP performance/correctness of Qwen3.6 and Gemma4

am17an · 2026-06-14T06:01:04Z

    LLAMA_API int32_t llama_model_n_embd_inp (const struct llama_model * model);
    LLAMA_API int32_t llama_model_n_embd_out (const struct llama_model * model);
    LLAMA_API int32_t llama_model_n_layer    (const struct llama_model * model);
+    // Number of appended NextN/MTP prediction blocks (0 if the model has none)


Suggested change

// Number of appended NextN/MTP prediction blocks (0 if the model has none)

Also need to fix the alignment

am17an · 2026-06-14T06:02:04Z

+    // MTP (multi-token prediction): which appended NextN/MTP block the
+    // DECODER_MTP graph runs, as an offset past the trunk (il = n_layer() + offset).
+    // 0 selects the first MTP head; the speculative driver bumps it per draft step.
+    int32_t  mtp_layer_offset = 0;


replace occurences of "mtp" with "nextn" to be make it more consistent

am17an · 2026-06-14T06:02:15Z

    void set_embeddings (bool value);
    void set_embeddings_nextn(bool value, bool masked);
    void set_embeddings_layer_inp(uint32_t lid, bool enable);
+    void set_mtp_layer_offset(int32_t offset);


Suggested change

void set_mtp_layer_offset(int32_t offset);

void set_nextn_layer_offset(int32_t offset);

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

forforever73 · 2026-06-14T09:42:18Z

Test MTP performance/correctness of Qwen3.6 and Gemma4 on H800, with parameter

CA = -ngl 999 -fa on -ctk bf16 -ctv bf16 --no-mmap -c 8192 -b 4096 -ub 2048 -np 1 -t 32 --jinja
#Qwen3.6
<bin> -m Qwen3.6-27B-MTP-Q8_0.gguf $CA --spec-type draft-mtp --spec-draft-n-max 3
# Gemma4 MTP
<bin> -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -ngld 999 $CA --spec-type draft-mtp --spec-draft-n-max 3

📜 performance of Qwen3.6 and Gemma4.

Qwen3.6 master

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=135.4
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.1
  explain_concept    pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=122.9
  summarize          pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=128.8
  qa_factual         pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=127.0
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=127.0
  creative_short     pred= 192 draft= 198 acc= 125 rate=0.631 tok/s=113.9
  stepwise_math      pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=131.4
  long_code_review   pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=114.2
  AGG  accept=0.733 draft=1607 acc=1178 pred=1728 wall=17.13s

Qwen3.6 new

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=136.5
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.7
  explain_concept    pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.2
  summarize          pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=129.4
  qa_factual         pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=127.4
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=128.0
  creative_short     pred= 192 draft= 198 acc= 125 rate=0.631 tok/s=114.4
  stepwise_math      pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=132.2
  long_code_review   pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=115.2
  AGG  accept=0.733 draft=1607 acc=1178 pred=1728 wall=17.13s

Gemma4 master

  code_python        pred= 192 draft= 166 acc= 135 rate=0.813 tok/s=127.0
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=126.7
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=102.9
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=126.8
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=112.0
  translation        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=129.1
  creative_short     pred= 192 draft= 223 acc= 115 rate=0.516 tok/s=95.3
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=118.9
  long_code_review   pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=104.9
  AGG  accept=0.7034 draft=1652 acc=1162 pred=1728 wall=17.27s

Gemma4 new

  code_python        pred= 192 draft= 166 acc= 135 rate=0.813 tok/s=127.5
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=127.1
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=103.3
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=127.3
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=112.4
  translation        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=129.4
  creative_short     pred= 192 draft= 223 acc= 115 rate=0.516 tok/s=95.5
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=119.2
  long_code_review   pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=105.2
  AGG  accept=0.7034 draft=1652 acc=1162 pred=1728 wall=17.21s

📜 correctness of Qwen3.6 and Gemma4.

Qwen3.6 master

{
  "results": [
    {
      "name": "code_python",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Task:** Write a Python function to return the n-th Fibonacci number.\n   - **Requirement:** Use memoization.\n   - **Requirement:** Include a docstring.\n\n2.  **Define Fibonacci Sequence:**\n   - F(0) = 0\n   - F(1) = 1\n   - F(n) = F(n-1) + F(n-2) for n >= 2\n   - Note: Sometimes F(1)=1, F(2)=1 is used, but\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "Here's a thinking process:\n\n1.  **Understand the User Request:**\n   - **Goal:** Write a C++ template function `clamp(x, lo, hi)`\n   - **Behavior:** Returns `x` clamped to the range `[lo, hi]` (i.e., if `x < lo` return `lo`, if `x > hi` return `hi`, else return `x`)\n   - **Constraint:** Do NOT use `std::clamp`\n   - **Language:** C++\n\n2.  **Identify Key Requirements for a Template Function:**\n   - Needs to\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Topic:** Speculative decoding in large language model (LLM) inference\n   - **Format:** Three short paragraphs\n   - **Goal:** Explain how it works clearly and concisely\n\n2.  **Deconstruct Speculative Decoding:**\n   - What is it? A technique to speed up LLM inference.\n   - Core idea: Use a smaller, faster \"draft\" model to propose multiple tokens, then verify them with the larger, slower \"target\" model in parallel.\n   - Process: \n    \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n   - **Task:** Summarize in exactly two sentences.\n\n2.  **Identify Key Information:**\n   - Origin: Britain, late 18th century\n   - Core changes: Mechanization, steam power, factory system transformed manufacturing\n   - Spread: Continental\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Question:** \"What are the four fundamental forces of physics?\"\n   - **Expected Answer:** A clear, concise list of the four fundamental forces with brief explanations if needed.\n\n2.  **Identify Key Concepts:**\n   - The four fundamental forces in physics are:\n     1. Gravitational force\n     2. Electromagnetic force\n     3. Strong nuclear force (or strong interaction)\n     4. Weak nuclear force (or weak interaction)\n\n3.  **Verify Accuracy:**\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "Here's a thinking process:\n\n1.  **Analyze the Input:**\n   - Source text: \"The quick brown fox jumps over the lazy dog.\"\n   - Target language: French\n   - Note: This is a famous pangram (contains every letter of the alphabet in English). In French, the equivalent pangram is \"Portez ce vieux whisky au juge blond qui fume.\" But the user specifically asked to translate the given English sentence, not to provide the French pangram equivalent.\n\n2.  **Identify Key Components:**\n   - The quick brown fox -> Le renard brun rapide / Le\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "Thinking Process:\n\n1.  **Deconstruct the request:**\n    *   Topic: Old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n2.  **Brainstorming imagery and themes:**\n    *   *Old:* Weathered, stone, salt, time, standing guard, silent, gray, moss, cracks.\n    *   *Lighthouse:* Light, beam, sea, waves, ships, guiding, warning, tower, lens, rotation.\n    *   *Action:* Standing, watching, shining, enduring.\n\n3.  **Drafting - Attempt \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "Here's a thinking process:\n\n1.  **Understand the User's Request:**\n   - Problem: A train leaves station A at 60 km/h. Two hours later, a second train leaves the same station on the same track at 90 km/h. How long until the second train catches the first?\n   - Requirement: Solve step by step.\n\n2.  **Identify Key Variables & Concepts:**\n   - Train 1 speed: $v_1 = 60$ km/h\n   - Train 2 speed: $v_2 = 90$ km/h\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Context:** Backend service with intermittent latency spikes in production.\n   - **Tech Stack:** Python FastAPI, Postgres 15, 4 replicas behind nginx load balancer.\n   - **Input:** The user mentions \"Below is the relevant code and a description of the system\" but *didn't actually provide any code*. I need to handle this gracefully.\n   - **Output Requirements:** Structured review with three sections:\n     1. Likely root causes ranked by probability\n     2. Concrete code/configuration\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Qwen3.6 new

{
  "results": [
    {
      "name": "code_python",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Task:** Write a Python function to return the n-th Fibonacci number.\n   - **Requirement:** Use memoization.\n   - **Requirement:** Include a docstring.\n\n2.  **Define Fibonacci Sequence:**\n   - F(0) = 0\n   - F(1) = 1\n   - F(n) = F(n-1) + F(n-2) for n >= 2\n   - Note: Sometimes F(1)=1, F(2)=1 is used, but\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "Here's a thinking process:\n\n1.  **Understand the User Request:**\n   - **Goal:** Write a C++ template function `clamp(x, lo, hi)`\n   - **Behavior:** Returns `x` clamped to the range `[lo, hi]` (i.e., if `x < lo` return `lo`, if `x > hi` return `hi`, else return `x`)\n   - **Constraint:** Do NOT use `std::clamp`\n   - **Language:** C++\n\n2.  **Identify Key Requirements for a Template Function:**\n   - Needs to\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Topic:** Speculative decoding in large language model (LLM) inference\n   - **Format:** Three short paragraphs\n   - **Goal:** Explain how it works clearly and concisely\n\n2.  **Deconstruct Speculative Decoding:**\n   - What is it? A technique to speed up LLM inference.\n   - Core idea: Use a smaller, faster \"draft\" model to propose multiple tokens, then verify them with the larger, slower \"target\" model in parallel.\n   - Process: \n    \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n   - **Task:** Summarize in exactly two sentences.\n\n2.  **Identify Key Information:**\n   - Origin: Britain, late 18th century\n   - Core changes: Mechanization, steam power, factory system transformed manufacturing\n   - Spread: Continental\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Question:** \"What are the four fundamental forces of physics?\"\n   - **Expected Answer:** A clear, concise list of the four fundamental forces with brief explanations if needed.\n\n2.  **Identify Key Concepts:**\n   - The four fundamental forces in physics are:\n     1. Gravitational force\n     2. Electromagnetic force\n     3. Strong nuclear force (or strong interaction)\n     4. Weak nuclear force (or weak interaction)\n\n3.  **Verify Accuracy:**\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "Here's a thinking process:\n\n1.  **Analyze the Input:**\n   - Source text: \"The quick brown fox jumps over the lazy dog.\"\n   - Target language: French\n   - Note: This is a famous pangram (contains every letter of the alphabet in English). In French, the equivalent pangram is \"Portez ce vieux whisky au juge blond qui fume.\" But the user specifically asked to translate the given English sentence, not to provide the French pangram equivalent.\n\n2.  **Identify Key Components:**\n   - The quick brown fox -> Le renard brun rapide / Le\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "Thinking Process:\n\n1.  **Deconstruct the request:**\n    *   Topic: Old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n2.  **Brainstorming imagery and themes:**\n    *   *Old:* Weathered, stone, salt, time, standing guard, silent, gray, moss, cracks.\n    *   *Lighthouse:* Light, beam, sea, waves, ships, guiding, warning, tower, lens, rotation.\n    *   *Action:* Standing, watching, shining, enduring.\n\n3.  **Drafting - Attempt \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "Here's a thinking process:\n\n1.  **Understand the User's Request:**\n   - Problem: A train leaves station A at 60 km/h. Two hours later, a second train leaves the same station on the same track at 90 km/h. How long until the second train catches the first?\n   - Requirement: Solve step by step.\n\n2.  **Identify Key Variables & Concepts:**\n   - Train 1 speed: $v_1 = 60$ km/h\n   - Train 2 speed: $v_2 = 90$ km/h\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Context:** Backend service with intermittent latency spikes in production.\n   - **Tech Stack:** Python FastAPI, Postgres 15, 4 replicas behind nginx load balancer.\n   - **Input:** The user mentions \"Below is the relevant code and a description of the system\" but *didn't actually provide any code*. I need to handle this gracefully.\n   - **Output Requirements:** Structured review with three sections:\n     1. Likely root causes ranked by probability\n     2. Concrete code/configuration\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Gemma4 master

{
  "results": [
    {
      "name": "code_python",
      "text": "The objective is to write a Python function that calculates the $n$-th Fibonacci number using *memoization*.\n\n    *   $F(0) = 0$\n    *   $F(1) = 1$\n    *   $F(n) = F(n-1) + F(n-2)$ for $n > 1$.\n\n    *   Standard recursion is $O(2^n)$ because it recalculates the same values many times.\n    *   Memoization stores the results of expensive function calls and returns the cached result when the same inputs occur\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "*   Goal: Write a C++ template function `clamp(x, lo, hi)`.\n    *   Functionality: Return `x` if it's between `lo` and `hi` (inclusive). If `x < lo`, return `lo`. If `x > hi`, return `hi`.\n    *   Constraint: Do not use `std::clamp` (which was introduced in C++17).\n    *   Requirement: Must be a template function to handle different numeric types.\n\n    *   Input: `x` (value to clamp), `lo` (\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "*   Topic: Speculative Decoding in LLM inference.\n*   Constraint 1: Explain how it works.\n*   Constraint 2: Three short paragraphs.\n\n    *   *What is it?* A technique to speed up LLM generation.\n    *   *Why is it needed?* LLMs are memory-bound; generating one token at a time is slow because the whole model must be loaded for every single token.\n    *   *How does it work?* Use a small, fast \"draft\" model to predict several tokens ahead, then use the large \"target\"\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "*   Input text: \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n    *   Constraint: Summarize in exactly two sentences.\n\n    *   Origin: Britain, late 18th century.\n    *   Key changes: Mechanization, steam power, factory system (manufacturing transformation).\n    *   Expansion: Continental Europe and North America, 19th century.\n\n    *   *Draft 1:*\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "*   Question: \"What are the four fundamental forces of physics?\"\n    *   Goal: Provide a clear, accurate, and concise answer.\n\n    *   Gravity (Gravitational force)\n    *   Electromagnetism (Electromagnetic force)\n    *   Strong Nuclear Force (Strong interaction)\n    *   Weak Nuclear Force (Weak interaction)\n\n    *   *Gravity:* Attracts masses. Long range. Weakest.\n    *   *Electromagnetism:* Attracts/repels charges. Long range. Stronger than gravity.\n    *   *Strong Nuclear Force\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "*   Source sentence: \"The quick brown fox jumps over the lazy dog.\"\n    *   Target language: French.\n    *   Context: This is a famous English pangram (contains every letter of the alphabet).\n\n    *   The: Le (masculine) / La (feminine)\n    *   quick: rapide (adj)\n    *   brown: brun / marron (adj)\n    *   fox: renard (masculine noun)\n    *   jumps: saute (verb *sauter*)\n    *   over: par-dessus\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "*   Topic: An old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n    *   Lighthouse: beam, light, tower, stone, salt, spray, ocean, waves, guide, warning, rust, weathered, lonely, sentinel, night, storm.\n    *   Old: crumbling, ancient, faded, tired, timeless, forgotten.\n\n    A tower of stone by the crashing sea,\n    Guiding the ships so they can be free.\n    Its light is dim but it still glows bright,\n    Watching the ocean through the dark night.\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "*   Train 1 (T1) speed: $60\\text{ km/h}$.\n    *   Train 2 (T2) speed: $90\\text{ km/h}$.\n    *   T2 starts $2\\text{ hours}$ after T1.\n    *   Goal: Find the time it takes for T2 to catch T1.\n\n    *   T1 has a head start of $2\\text{ hours}$.\n    *   Distance = $\\text{Speed} \\times \\text{Time}$.\n    *   Distance T1 traveled\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Backend service with intermittent latency spikes.\nPython FastAPI, Postgres 15, 4 replicas, Nginx load balancer.\nStructured review with three sections:\n        1.  Likely root causes (ranked by probability).\n        2.  Concrete code/config changes.\n        3.  Telemetry to add.\nThe user *hasn't provided the code yet*.\n\n    *   Wait, the prompt says \"Below is the relevant code,\" but there is no code provided.\n    *   *Self-Correction:* I cannot analyze specific code that isn't there. However, as\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Gemma4 new

{
  "results": [
    {
      "name": "code_python",
      "text": "The objective is to write a Python function that calculates the $n$-th Fibonacci number using *memoization*.\n\n    *   $F(0) = 0$\n    *   $F(1) = 1$\n    *   $F(n) = F(n-1) + F(n-2)$ for $n > 1$.\n\n    *   Standard recursion is $O(2^n)$ because it recalculates the same values many times.\n    *   Memoization stores the results of expensive function calls and returns the cached result when the same inputs occur\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "*   Goal: Write a C++ template function `clamp(x, lo, hi)`.\n    *   Functionality: Return `x` if it's between `lo` and `hi` (inclusive). If `x < lo`, return `lo`. If `x > hi`, return `hi`.\n    *   Constraint: Do not use `std::clamp` (which was introduced in C++17).\n    *   Requirement: Must be a template function to handle different numeric types.\n\n    *   Input: `x` (value to clamp), `lo` (\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "*   Topic: Speculative Decoding in LLM inference.\n*   Constraint 1: Explain how it works.\n*   Constraint 2: Three short paragraphs.\n\n    *   *What is it?* A technique to speed up LLM generation.\n    *   *Why is it needed?* LLMs are memory-bound; generating one token at a time is slow because the whole model must be loaded for every single token.\n    *   *How does it work?* Use a small, fast \"draft\" model to predict several tokens ahead, then use the large \"target\"\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "*   Input text: \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n    *   Constraint: Summarize in exactly two sentences.\n\n    *   Origin: Britain, late 18th century.\n    *   Key changes: Mechanization, steam power, factory system (manufacturing transformation).\n    *   Expansion: Continental Europe and North America, 19th century.\n\n    *   *Draft 1:*\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "*   Question: \"What are the four fundamental forces of physics?\"\n    *   Goal: Provide a clear, accurate, and concise answer.\n\n    *   Gravity (Gravitational force)\n    *   Electromagnetism (Electromagnetic force)\n    *   Strong Nuclear Force (Strong interaction)\n    *   Weak Nuclear Force (Weak interaction)\n\n    *   *Gravity:* Attracts masses. Long range. Weakest.\n    *   *Electromagnetism:* Attracts/repels charges. Long range. Stronger than gravity.\n    *   *Strong Nuclear Force\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "*   Source sentence: \"The quick brown fox jumps over the lazy dog.\"\n    *   Target language: French.\n    *   Context: This is a famous English pangram (contains every letter of the alphabet).\n\n    *   The: Le (masculine) / La (feminine)\n    *   quick: rapide (adj)\n    *   brown: brun / marron (adj)\n    *   fox: renard (masculine noun)\n    *   jumps: saute (verb *sauter*)\n    *   over: par-dessus\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "*   Topic: An old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n    *   Lighthouse: beam, light, tower, stone, salt, spray, ocean, waves, guide, warning, rust, weathered, lonely, sentinel, night, storm.\n    *   Old: crumbling, ancient, faded, tired, timeless, forgotten.\n\n    A tower of stone by the crashing sea,\n    Guiding the ships so they can be free.\n    Its light is dim but it still glows bright,\n    Watching the ocean through the dark night.\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "*   Train 1 (T1) speed: $60\\text{ km/h}$.\n    *   Train 2 (T2) speed: $90\\text{ km/h}$.\n    *   T2 starts $2\\text{ hours}$ after T1.\n    *   Goal: Find the time it takes for T2 to catch T1.\n\n    *   T1 has a head start of $2\\text{ hours}$.\n    *   Distance = $\\text{Speed} \\times \\text{Time}$.\n    *   Distance T1 traveled\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Backend service with intermittent latency spikes.\nPython FastAPI, Postgres 15, 4 replicas, Nginx load balancer.\nStructured review with three sections:\n        1.  Likely root causes (ranked by probability).\n        2.  Concrete code/config changes.\n        3.  Telemetry to add.\nThe user *hasn't provided the code yet*.\n\n    *   Wait, the prompt says \"Below is the relevant code,\" but there is no code provided.\n    *   *Self-Correction:* I cannot analyze specific code that isn't there. However, as\n====CONTENT====\n",
      "n": 128
    }
  ]
}

forforever73 · 2026-06-14T11:59:52Z

@CISC could you take a look :)

CISC · 2026-06-14T12:19:15Z

@CISC could you take a look :)

LGTM, but @ggerganov should sign off on the API.

ggerganov

Looks incorrect with multiple sequences.

forforever73 · 2026-06-15T11:55:27Z

You're right. Have fixed it (matching the eagle3 pattern), verified single-sequence output unchanged and concurrent runs now hit 0 decode failures on both unified and non-unified caches

ggerganov · 2026-06-15T12:40:50Z

+            auto * mem_dft = llama_get_memory(ctx_dft);
+
+            bool ok = true;
+            for (int head = 0; head < n_mtp_layers; ++head) {
+                if (chain_heads) {
+                    for (llama_seq_id seq_id = 0; seq_id < (llama_seq_id) n_seq; ++seq_id) {
+                        if (i_batch_beg[seq_id] < 0) {
+                            continue;
+                        }
+                        llama_memory_seq_rm(mem_dft, seq_id, batch_in.pos[i_batch_beg[seq_id]], -1);
+                    }
+                    llama_set_nextn_layer_offset(ctx_dft, head);
+                }
+
+                const int32_t rc = llama_decode(ctx_dft, batch);
+                if (rc != 0) {
+                    LOG_ERR("%s: llama_decode(ctx_dft) head=%d failed rc=%d (pos=%d)\n",
+                            __func__, head, (int) rc, (int) batch_in.pos[0]);
+                    ok = false;
+                    break;
+                }
+            }
+
+            if (chain_heads) {
+                llama_set_nextn_layer_offset(ctx_dft, 0); // restore default for non-draft decodes
+            }


I don't understand the logic here - seems incorrect. Every head iteration will basically erase the result of the previous iteration.

Each head runs a different layer, set_nextn_layer_offset(head) makes graph_mtp build layer n_layer()+head, i.e. 45/46/47. So each head writes its own k_l[il]/v_l[il].
seq_rm here doesn't drop any KV data — it just clears the cell metadata so find_slot hands back the same cells at the same positions for every head. Without it, head 46/47 would land on fresh cells and we'd get duplicate positions in v_cells .
So after the loop those cells hold valid KV for all three MTP layers at once. it's the teacher-forcing catch-up that seeds each head's layer so the next draft() round attends to a correct, target-aligned cache.

Ok, got it. That's interesting.

forforever73 · 2026-06-20T10:51:28Z

@ggerganov Hey, just a quick ping on this pr when you have a chance :)

ggerganov · 2026-06-21T06:52:11Z

Here is a minor patch I wanted to push, but don't have the permission:

diff --git a/common/speculative.cpp b/common/speculative.cpp
index fd0cf138f..d7a177b7b 100644
--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@@ -1027,6 +1027,7 @@ struct common_speculative_impl_draft_mtp : public common_speculative_impl {
             bool ok = true;
             for (int head = 0; head < n_mtp_layers; ++head) {
                 if (chain_heads) {
+                    // ref: https://github.com/ggml-org/llama.cpp/pull/24340/changes#r3413498544
                     for (llama_seq_id seq_id = 0; seq_id < (llama_seq_id) n_seq; ++seq_id) {
                         if (i_batch_beg[seq_id] < 0) {
                             continue;
@@ -1837,7 +1838,7 @@ common_speculative * common_speculative_init(common_params_speculative & params,
 
         bool has_draft_simple = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE));
         bool has_draft_eagle3 = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3)) && params.draft.ctx_dft != nullptr;
-        bool has_mtp = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP)) && params.draft.ctx_dft != nullptr;
+        bool has_draft_mtp    = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP))    && params.draft.ctx_dft != nullptr;
 
 
 
@@ -1875,7 +1876,7 @@ common_speculative * common_speculative_init(common_params_speculative & params,
         if (has_draft_eagle3) {
             configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3, params));
         }
-        if (has_mtp) {
+        if (has_draft_mtp) {
             configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_MTP, params));
         }
     }

ggerganov · 2026-06-21T06:53:48Z

+                if (chain_heads) {
+                    chain_h[seq_id].insert(chain_h[seq_id].end(), h_row, h_row + n_embd);
+
+                    const int n_rows = (int) result.size() + 1; // id_last + tokens drafted so far
+                    for (int t = 0; t < n_rows; ++t) {
+                        const llama_token tok = (t == 0) ? dp.id_last : result[t - 1];
+                        common_batch_add(batch, tok, dp.n_past + t, { seq_id }, t == n_rows - 1);
+                        std::memcpy(batch.embd + (size_t) (batch.n_tokens - 1) * n_embd,
+                                    chain_h[seq_id].data() + (size_t) t * n_embd, row_bytes);
+                    }


This seems incorrect - every next draft, we decode all previous tokens again. Why is that? Normally, we should decode just the latest token.

You're right that every draft step re-decodes the whole prefix — that's intentional. This path isn't a normal AR draft; it's an accumulating-batch + per-step-seq_rm flow, because each step runs a different head.

After process(), every head's KV is already filled for the prompt + accepted prefix (positions < n_past), and the per-step seq_rm lower bound is n_past, so we never re-decode the prompt. What gets replayed each step is only the draft region — id_last plus the few tokens drafted so far (capped at n_max=3).

Why replay it under each head: since we switch heads per step, the current head hasn't written its own KV for any draft-region position yet this round. If we only decoded the latest token, its attention over the earlier draft positions would read cells that only another head ever wrote → garbage. So each step we seq_rm the draft region (so find_slot reuses the same slots and positions stay aligned), switch the head, and replay the accumulated prefix so this head fills its own KV.

For contrast, the other two branches reuse a single head across steps, so their prefix KV is already valid and they just append the latest token

Ah yes. I'm still not used to that approach, but it seems correct. Add a reference to this explanation in the code.

Agreed it's not intuitive, it took me quite a while to design this too :) Added a comment and committed the other suggestions as well.

Add a reference:

diff --git a/common/speculative.cpp b/common/speculative.cpp index d7a177b7b..f8a6287c2 100644 --- a/common/speculative.cpp +++ b/common/speculative.cpp @@ -1184,6 +1184,7 @@ struct common_speculative_impl_draft_mtp : public common_speculative_impl { } if (chain_heads) { + // ref: https://github.com/ggml-org/llama.cpp/pull/24340#discussion_r3448031546 chain_h[seq_id].insert(chain_h[seq_id].end(), h_row, h_row + n_embd); const int n_rows = (int) result.size() + 1; // id_last + tokens drafted so far

ggerganov · 2026-06-21T06:55:34Z


+        std::vector<int> i_last(n_seq, -1);
+
+        std::vector<std::vector<float>> chain_h;


Should be allocated and reserved once at construction time.

ggerganov · 2026-06-21T06:57:52Z

+            cparams.embeddings              == other.cparams.embeddings              &&
+            cparams.embeddings_nextn        == other.cparams.embeddings_nextn        &&
+            cparams.embeddings_nextn_masked == other.cparams.embeddings_nextn_masked &&
+            cparams.nextn_layer_offset      == other.cparams.nextn_layer_offset      &&


This is correct, but effectively it will disable graph reuse during drafting. However, there isn't a better way to do it for now as we don't have a mechanism to do layer selection at compute time. It's something to think about in the future.

Add TODO to not forget:

diff --git a/src/llama-graph.h b/src/llama-graph.h index d2a1b39d4..ac00d6cc6 100644 --- a/src/llama-graph.h +++ b/src/llama-graph.h @@ -682,11 +682,15 @@ struct llm_graph_params { } } + // TODO: https://github.com/ggml-org/llama.cpp/pull/24340#discussion_r3448035248 + if (cparams.nextn_layer_offset != other.cparams.nextn_layer_offset) { + return false; + } + return cparams.embeddings == other.cparams.embeddings && cparams.embeddings_nextn == other.cparams.embeddings_nextn && cparams.embeddings_nextn_masked == other.cparams.embeddings_nextn_masked && - cparams.nextn_layer_offset == other.cparams.nextn_layer_offset && cparams.causal_attn == other.cparams.causal_attn && arch == other.arch && gtype == other.gtype &&

ggerganov

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

CISC · 2026-06-21T09:45:57Z

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

For some reason the line number is the would-be-merged line number, so sometimes it's a bit off.

ggerganov · 2026-06-21T11:30:14Z

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

For some reason the line number is the would-be-merged line number, so sometimes it's a bit off.

Yes, though I merged master into my working copy of this branch and it didn't result into whitespaces. Not sure what caused this.

remeh · 2026-06-22T12:15:36Z

A big thank you for this PR folks! I've been from ~18tok/s to ~30tok/s with coding tasks on a Strix Halo (with ROCm, the IQ4_XS quant and --spec-draft-n-max 3). I'll see now if it still hallucinates typos with bigger contexts.
Thanks a bunch @forforever73 !

github-actions Bot added the model Model specific label Jun 9, 2026

forforever73 marked this pull request as ready for review June 11, 2026 07:06

forforever73 requested review from a team, CISC and ggerganov as code owners June 11, 2026 07:06

am17an reviewed Jun 13, 2026

View reviewed changes

forforever73 added 10 commits June 14, 2026 13:09

add mtp_layer_offset + include nextn flags in graph reuse

e98e0b3

add llama_set_mtp_layer_offset + llama_model_n_nextn_layer API

1208d99

offset head select + require all MTP blocks

0b8aa51

speculative multi-head process()

e0fb9ff

speculative multi-head draft()

34f68f5

gather outputs via inp_out_ids

ae013e3

cleanup

1885d8f

fix core

f4a2c12

minor cleanup

9a0ff26

merged draft_multi_head into draft()

2952d83

forforever73 force-pushed the step35-multilayer-mtp-rebase branch from cffdd9a to 2952d83 Compare June 14, 2026 05:23

am17an reviewed Jun 14, 2026

View reviewed changes

mtp rename nextn

48a7484

forforever73 and others added 2 commits June 14, 2026 15:14

Apply suggestions from code review

2ce24fb

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

clean-up comments

9b8f3b6

am17an approved these changes Jun 14, 2026

View reviewed changes

ggerganov requested changes Jun 15, 2026

View reviewed changes

fix for multi seq

7a0a247

ggerganov reviewed Jun 15, 2026

View reviewed changes

ggerganov self-assigned this Jun 16, 2026

ggerganov reviewed Jun 21, 2026

View reviewed changes

forforever73 added 2 commits June 21, 2026 15:59

apply suggestions && chain-heads comment

9858fd2

add a reference for chain_heads discussion

60b04d3

forforever73 mentioned this pull request Jun 21, 2026

feat(step3.7): support NextN/MTP heads for speculative decoding stepfun-ai/llama.cpp#1

Open

5 tasks

ggerganov approved these changes Jun 21, 2026

View reviewed changes

ggerganov merged commit d789527 into ggml-org:master Jun 21, 2026
24 of 25 checks passed

Pento95 mentioned this pull request Jun 23, 2026

Cap n_outputs_max on MTP draft contexts LostRuins/koboldcpp#2287

Merged

deadprogram mentioned this pull request Jun 24, 2026

pkg/llama: add support for llama_model_n_layer_nextn hybridgroup/yzma#264

Merged

	void set_mtp_layer_offset(int32_t offset);
	void set_nextn_layer_offset(int32_t offset);


		std::vector<int> i_last(n_seq, -1);

		std::vector<std::vector<float>> chain_h;

Uh oh!

Conversation

forforever73 commented Jun 9, 2026

Overview

Results

DGX Spark GB10

Mac Studio M4 Max

Requirements

Uh oh!

pwilkin commented Jun 9, 2026

Uh oh!

forforever73 commented Jun 9, 2026

Uh oh!

pwilkin commented Jun 9, 2026

Uh oh!

forforever73 commented Jun 11, 2026

Uh oh!

am17an commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forforever73 commented Jun 13, 2026

Uh oh!

am17an commented Jun 13, 2026

Uh oh!

forforever73 commented Jun 13, 2026

Uh oh!

am17an commented Jun 13, 2026

Uh oh!

forforever73 commented Jun 13, 2026

Uh oh!

am17an commented Jun 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

forforever73 commented Jun 14, 2026

Uh oh!

forforever73 commented Jun 14, 2026

Uh oh!

CISC commented Jun 14, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

forforever73 commented Jun 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

forforever73 commented Jun 20, 2026

Uh oh!

ggerganov commented Jun 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an commented Jun 13, 2026 •

edited

Loading

remeh commented Jun 22, 2026 •

edited

Loading