Skip to content

Support Step3.5/3.7 flash mtp3#24340

Merged
ggerganov merged 16 commits into
ggml-org:masterfrom
stepfun-ai:step35-multilayer-mtp-rebase
Jun 21, 2026
Merged

Support Step3.5/3.7 flash mtp3#24340
ggerganov merged 16 commits into
ggml-org:masterfrom
stepfun-ai:step35-multilayer-mtp-rebase

Conversation

@forforever73

Copy link
Copy Markdown
Contributor

Overview

follow-up to #23274.(cc @pwilkin )

📜 Full data-flow trace — couldn't think of a good way to draw this, so I wrote it all down instead. It's long, but every byte is load-bearing.

Notation:

  • token@pos / h(pos) — positions are explicit (0-indexed)
  • h_tgt(p) — target NextN hidden at p (before the output norm)
  • h45(p) / h46(p) — head 45/46 output hidden, chained between heads while drafting
  • pending_h = h_tgt(pos of id_last − 1) — always the trunk h, regardless of chaining

Example: a 4-token prompt at positions 0–3.

Time ─────────────────────────────────────────────────────────────────────▶

═══ Round 0: prompt bootstrap ══════════════════════════════════════════════
  prompt: [t0@0, t1@1, t2@2, t3@3]   (need_embd → all logits=true)

  ② target decode → verify_h = [h_tgt(0), h_tgt(1), h_tgt(2), h_tgt(3)]

  ③ process (mirror once per head, all logits=0, h_tgt shifted right by 1):
       for head in {45, 46, 47}:
         seq_rm(ctx_dft, seq, ≥ this ubatch's first pos)   // = 0 here; reset before each head
         token = [t0@0, t1@1,    t2@2,    t3@3]
         embd  = [0,    h_tgt(0), h_tgt(1), h_tgt(2)]   ← row 0 = pending_h = 0 (sentinel)
         → write KV_head@0..3 (teacher-forced, logits=0)
                              h_tgt(3) pushed out → pending_h default = h_tgt(3)
       (no draft → accept not called → pending_h stays = h_tgt(3))

  sample logits at pos 3 → T_first@4
       slot.sampled = T_first@4 ;  pending_h = h_tgt(3)
       invariant: id_last=T_first@4, pending_h=h_tgt(3)=h_tgt(4-1)  ✓

═══ Round 1: generation (id_last=T_first@4, pending_h=h_tgt(3)) ═════════════
  ⑥ draft (accumulated batch; switch head each step, 1 decode/step, batch grows):
       round_start = pos of id_last = 4; seq_rm(≥4) before switching to each head

       set_mtp_layer_offset(ctx_dft, 0)   // head 45
       step0 [h45]  seq_rm(≥4); batch=[T_first@4],            embd=[h_tgt(3)]
                    → decode → write KV45@4; sample draft1@5; output h45(4)

       set_mtp_layer_offset(ctx_dft, 1)   // head 46
       step1 [h46]  seq_rm(≥4); batch=[T_first@4, draft1@5],  embd=[h_tgt(3), h45(4)]
                    → decode → write KV46@4,5; sample draft2@6; output h46(5)

       set_mtp_layer_offset(ctx_dft, 2)   // head 47
       step2 [h47]  seq_rm(≥4); batch=[T_first@4, draft1@5, draft2@6],
                    embd=[h_tgt(3), h45(4), h46(5)]
                    → decode → write KV47@4,5,6; sample draft3@7
                    (draft length capped at n_mtp_layers = 3)

       spec_draft = [draft1, draft2, draft3]

  ── post-draft seq_rm (server, cross-round cleanup; unchanged) ──
       ckpt.pos_max = 3 (ctx_tgt only covers prompt @0..3 here; T_first not decoded yet)
       seq_rm(ctx_dft, slot.id, ≥4)  → every head's KV trimmed back to @0..3

  ① target batch (4 tokens, all logits=true):
       token = [T_first@4, draft1@5, draft2@6, draft3@7]

  ② target decode → verify_h = [h_tgt(4), h_tgt(5), h_tgt(6), h_tgt(7)] + logits[0..3]

  ③ process (once per head, logits=0; embd = h_tgt right-shift, NOT inter-head h):
       for head in {45, 46, 47}:
         seq_rm(ctx_dft, seq, ≥4)   // reset before each head; all heads rewrite from @4
         token = [T_first@4, draft1@5, draft2@6, draft3@7]
         embd  = [h_tgt(3),  h_tgt(4), h_tgt(5), h_tgt(6)]
                  ↑ pending_h           h_tgt(7) pushed out → pending_h default
         → rewrite KV_head@4..7 (teacher-forced, logits=0)

  ④ verify (sample from ctx_tgt logits, identical to the single-block path):
       logits[0] → real tok@5 == draft1  ✓
       logits[1] → real tok@6 == draft2  ✓
       logits[2] → real tok@7 != draft3  ✗ → resample T_new@7
       accepted = [draft1, draft2, T_new] ;  n_accepted = 2

  ⑤ accept (identical to the single-block path):
       pending_h    = verify_h[n_accepted] = verify_h[2] = h_tgt(6)
       slot.sampled = accepted.back()      = T_new (pos 7)
       invariant: id_last=T_new@7, pending_h=h_tgt(6)=h_tgt(7-1)  ✓

═══ Round 2: generation (id_last=T_new@7, pending_h=h_tgt(6)) ═══════════════
  ⑥ step0 [h45]  seq_rm(≥7); batch=[T_new@7],                  embd=[h_tgt(6)]
                 → write KV45@7; draft1'@8, h45(7)
     step1 [h46]  seq_rm(≥7); batch=[T_new@7, draft1'@8],       embd=[h_tgt(6), h45(7)]
                 → write KV46@7,8; draft2'@9, h46(8)
     step2 [h47]  seq_rm(≥7); batch=[T_new@7, draft1'@8, draft2'@9],
                 embd=[h_tgt(6), h45(7), h46(8)]
                 → write KV47@7,8,9; draft3'@10
     ── post-draft seq_rm(≥8); ① target batch; ② target decode; ③ process ×3
        (process resets seq_rm(≥7) per head, then rewrites @7..10)

Core strategy
Each MTP head is its own decoder layer with its own KV, and the driver runs one head per llama_decode. A seq_rm before each head clears the range it re-decodes, so it reuses the same slots (find_slot is deterministic) instead of stacking duplicate positions; find_slot / apply_ubatch are untouched. The two phases differ only in what feeds the heads:

phase embd fed to each head purpose
process() trunk h_tgt, right-shifted by one (not the inter-head hidden) re-anchor each head's committed-prefix KV to the target's real hidden → next round starts target-aligned
draft() the previous head's output, chained (slot 0 = trunk pending_h) generate the draft tokens; each head rebuilds its own layer's KV on the growing prefix

Only the trunk h_tgt crosses rounds, so pending_h / verify_h stay single-layer.

MTP Block Selection Strategy

  • cparams.mtp_layer_offset (src/llama-cparams.h) — picks which appended MTP block the DECODER_MTP graph runs: il = n_layer() + offset. Default 0.
  • graph_mtp selects the head by offset (il = n_layer() + cparams.mtp_layer_offset, was a hardcoded n_layer()).
  • graph_mtp now gathers its output rows via build_inp_out_ids(), like the trunk graph. The fix that makes chaining work: from step 1 on, the output is the last batch row, not row 0, so without it heads 46/47 read the wrong row. Identity gather when n_outputs == n_tokens, so the single-head path is unchanged.
  • Loader now requires all n_layer_nextn MTP blocks.
  • n_max is clamped to the head count when chaining (each head used once).

Results

./llama-server \
    -m Step-3.7-IQ4_XS.gguf \
    --spec-type draft-mtp \
    --spec-draft-model Step3.7-flash-mtp-Q8_0.gguf \
    -ngl all \
    --spec-draft-ngl all \
    -c 35000 \
    -np 1 \
    -b 2048 \
    -ub 1024 \
    --temp 0 \
    --spec-draft-n-max {n} \
    --spec-draft-p-min 0.0 \
    --host 127.0.0.1 \
    --port 8080

The command is identical on both machines; only --spec-draft-n-max and the build change. Before = single-block MTP on master (one head, looped when n-max > 1); after = the three-layer chain.

DGX Spark GB10

Before (single-block MTP, master)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 164 acc= 108 rate=0.658 tok/s=30.8
  code_cpp           pred= 192 draft= 172 acc= 104 rate=0.605 tok/s=30.5
  explain_concept    pred= 192 draft= 171 acc= 105 rate=0.614 tok/s=30.8
  summarize          pred= 192 draft= 160 acc= 110 rate=0.688 tok/s=33.2
  qa_factual         pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=36.6
  translation        pred= 192 draft= 173 acc= 104 rate=0.601 tok/s=30.8
  creative_short     pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=28.1
  stepwise_math      pred= 192 draft= 156 acc= 113 rate=0.724 tok/s=34.6
  long_code_review   pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=31.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1492,
  "total_draft_accepted": 966,
  "aggregate_accept_rate": 0.6475,
  "wall_s_total": 60.19
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 233 acc= 112 rate=0.481 tok/s=28.8
  code_cpp           pred= 192 draft= 243 acc= 109 rate=0.449 tok/s=27.7
  explain_concept    pred= 192 draft= 252 acc= 106 rate=0.421 tok/s=26.3
  summarize          pred= 192 draft= 233 acc= 112 rate=0.481 tok/s=27.7
  qa_factual         pred= 192 draft= 196 acc= 125 rate=0.638 tok/s=30.5
  translation        pred= 192 draft= 242 acc= 109 rate=0.450 tok/s=25.7
  creative_short     pred= 192 draft= 271 acc=  99 rate=0.365 tok/s=24.9
  stepwise_math      pred= 192 draft= 226 acc= 115 rate=0.509 tok/s=30.0
  long_code_review   pred= 192 draft= 235 acc= 112 rate=0.477 tok/s=27.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2131,
  "total_draft_accepted": 999,
  "aggregate_accept_rate": 0.4688,
  "wall_s_total": 70.1
}

After (three-layer MTP, this PR)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=33.9
  code_cpp           pred= 192 draft= 156 acc= 112 rate=0.718 tok/s=32.6
  explain_concept    pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=34.9
  summarize          pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=35.5
  qa_factual         pred= 192 draft= 146 acc= 117 rate=0.801 tok/s=36.9
  translation        pred= 192 draft= 164 acc= 108 rate=0.658 tok/s=29.8
  creative_short     pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=30.0
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=39.4
  long_code_review   pred= 192 draft= 143 acc= 118 rate=0.825 tok/s=36.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1376,
  "total_draft_accepted": 1022,
  "aggregate_accept_rate": 0.7427,
  "wall_s_total": 56.95
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=35.1
  code_cpp           pred= 192 draft= 213 acc= 119 rate=0.559 tok/s=31.4
  explain_concept    pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=32.2
  summarize          pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=33.0
  qa_factual         pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=38.0
  translation        pred= 192 draft= 198 acc= 124 rate=0.626 tok/s=34.7
  creative_short     pred= 192 draft= 244 acc= 108 rate=0.443 tok/s=27.4
  stepwise_math      pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=40.6
  long_code_review   pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=33.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1804,
  "total_draft_accepted": 1110,
  "aggregate_accept_rate": 0.6153,
  "wall_s_total": 56.57
}

Mac Studio M4 Max

Before (single-block MTP, master)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=42.6
  code_cpp           pred= 192 draft= 179 acc= 101 rate=0.564 tok/s=38.4
  explain_concept    pred= 192 draft= 171 acc= 105 rate=0.614 tok/s=40.2
  summarize          pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=42.5
  qa_factual         pred= 161 draft= 122 acc= 101 rate=0.828 tok/s=47.4
  translation        pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=40.0
  creative_short     pred= 192 draft= 188 acc=  97 rate=0.516 tok/s=36.7
  stepwise_math      pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=44.3
  long_code_review   pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=41.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1475,
  "total_draft_accepted": 949,
  "aggregate_accept_rate": 0.6434,
  "wall_s_total": 46.38
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 233 acc= 113 rate=0.485 tok/s=33.7
  code_cpp           pred= 192 draft= 259 acc= 104 rate=0.402 tok/s=30.4
  explain_concept    pred= 192 draft= 249 acc= 107 rate=0.430 tok/s=31.7
  summarize          pred= 192 draft= 226 acc= 115 rate=0.509 tok/s=34.7
  qa_factual         pred= 161 draft= 165 acc= 105 rate=0.636 tok/s=39.9
  translation        pred= 192 draft= 244 acc= 109 rate=0.447 tok/s=32.3
  creative_short     pred= 192 draft= 278 acc=  98 rate=0.352 tok/s=28.4
  stepwise_math      pred= 192 draft= 220 acc= 117 rate=0.532 tok/s=35.8
  long_code_review   pred= 192 draft= 232 acc= 113 rate=0.487 tok/s=33.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 2106,
  "total_draft_accepted": 981,
  "aggregate_accept_rate": 0.4658,
  "wall_s_total": 56.7
}

After (three-layer MTP, this PR)

--spec-draft-n-max 2

  code_python        pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=45.3
  code_cpp           pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=41.2
  explain_concept    pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=43.8
  summarize          pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=45.0
  qa_factual         pred= 161 draft= 114 acc= 104 rate=0.912 tok/s=49.7
  translation        pred= 192 draft= 153 acc= 113 rate=0.739 tok/s=43.8
  creative_short     pred= 192 draft= 172 acc= 104 rate=0.605 tok/s=39.1
  stepwise_math      pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=46.8
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=43.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1356,
  "total_draft_accepted": 1007,
  "aggregate_accept_rate": 0.7426,
  "wall_s_total": 43.83
}

--spec-draft-n-max 3

  code_python        pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=40.8
  code_cpp           pred= 192 draft= 205 acc= 122 rate=0.595 tok/s=37.2
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=36.8
  summarize          pred= 192 draft= 194 acc= 126 rate=0.649 tok/s=39.4
  qa_factual         pred= 161 draft= 141 acc= 114 rate=0.808 tok/s=45.6
  translation        pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=38.3
  creative_short     pred= 192 draft= 242 acc= 109 rate=0.450 tok/s=31.5
  stepwise_math      pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=44.9
  long_code_review   pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=36.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1697,
  "total_draft": 1754,
  "total_draft_accepted": 1099,
  "aggregate_accept_rate": 0.6266,
  "wall_s_total": 49.38
}

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: CC for code and test, but the core data flow was designed by human

@github-actions github-actions Bot added the model Model specific label Jun 9, 2026
@pwilkin

pwilkin commented Jun 9, 2026

Copy link
Copy Markdown
Member

I think we want at least @ggerganov and @am17an here for the discussion about how to solve multi-layer MTP in core.

@forforever73

Copy link
Copy Markdown
Contributor Author

Yes, I propose an initial approach here. I think it's semantically correct while keeping the changes relatively small.

@pwilkin

pwilkin commented Jun 9, 2026

Copy link
Copy Markdown
Member

Yeah, I purposefully wanted the original StepFun MTP PR to be small because I did a little foray into implementing the full MTP and then saw it would be quite a challenging task, think it's good to discuss this :)

@forforever73 forforever73 marked this pull request as ready for review June 11, 2026 07:06
@forforever73 forforever73 requested review from a team, CISC and ggerganov as code owners June 11, 2026 07:06
@forforever73

Copy link
Copy Markdown
Contributor Author

@ggerganov @am17an Would you have some time to take a look ?

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

From what I understand, this can be achieved if we fix the draft length (via --spec-draft-n-max/min to be the same and --spec-draft-n-min 0) and we pass some state in the ctx_dft for selecting which nextn layer to use

@forforever73

Copy link
Copy Markdown
Contributor Author

@am17an Yeah, i guess the ctx_dft state you mean is the mtp_layer_offset I added. getting 3-layer mtp to run is the easy part; the rest of the diff is all about keeping the KV cache correct.

Unlike gemma4 (all nextn layers in one graph, shared target KV, single position), step35's heads are chained with a sample in between and run on their own kv. So I can't just bump the layer per step on the single-token loop, or head 46 attends back to a cell only head 45 ever wrote on its layer and reads garbage. That's why each step has to seq_rm the round and re-decode the accumulated prefix on the current head's layer.

The correct semantics in vLLM architecture can reference mtp3. I think it can be done more simply under llama.cpp's architecture.

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

I think you can still optionally add the llama_memory_seq_rm for this architecture after every decode step if I understand correctly.

@forforever73

Copy link
Copy Markdown
Contributor Author

Right, seq_rm is part of it. And also need to re-decode the whole accumulated prefix [id_last, draft_1, …] on the current head's layer each step, so that head writes its own layer's kv for every position.

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

I see, so you have to keep the draft_tokens and embeddings to copy them in each subsequent draft round. I think you can keep these two vectors host side and add them while rebuilding the batch. As an aside, this would not be able to use CUDA graphs as the topology for the draft will keep changing (i.e. batch size goes from 1 to 2 to 3 etc).

@forforever73

Copy link
Copy Markdown
Contributor Author

I think you can keep these two vectors host side and add them while rebuilding the batch

Yep, that's exactly what the current implementation does.

this would not be able to use CUDA graphs as the topology for the draft will keep changing

I'm aware. But each one needs the token the previous head sampled so is hard to avoid. And the perf cost is contained, draft is at most n_nextn tiny decodes.

@am17an

am17an commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Yes but I think the current implementation can be simplified.

Comment thread common/speculative.cpp Outdated
// Each slot's embd is the hidden produced by the PREVIOUS head for that token
// (slot 0 is always pending_h = trunk h). Per-step seq_rm keeps each head's KV
// on a clean, position-aligned slot set.
void draft_multi_head(common_speculative_draft_params_vec & dparams) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of draft rather a separate function. The only difference is that you need to add embd + token of the last sampled head to the batch

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, have merged

@forforever73 forforever73 force-pushed the step35-multilayer-mtp-rebase branch from cffdd9a to 2952d83 Compare June 14, 2026 05:23

@am17an am17an left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can clean-up the comments a bit to follow the rest of the repo. If the code is self-explanatory we prefer not to add comments in cpp files (.h files is encouraged). If something is a bit non-intuitive (like seq_rm in this PR) then it takes sense to add a comment to explain. You should also check MTP performance/correctness of Qwen3.6 and Gemma4

Comment thread include/llama.h Outdated
LLAMA_API int32_t llama_model_n_embd_inp (const struct llama_model * model);
LLAMA_API int32_t llama_model_n_embd_out (const struct llama_model * model);
LLAMA_API int32_t llama_model_n_layer (const struct llama_model * model);
// Number of appended NextN/MTP prediction blocks (0 if the model has none)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Number of appended NextN/MTP prediction blocks (0 if the model has none)

Also need to fix the alignment

Comment thread src/llama-cparams.h Outdated
// MTP (multi-token prediction): which appended NextN/MTP block the
// DECODER_MTP graph runs, as an offset past the trunk (il = n_layer() + offset).
// 0 selects the first MTP head; the speculative driver bumps it per draft step.
int32_t mtp_layer_offset = 0;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace occurences of "mtp" with "nextn" to be make it more consistent

Comment thread src/llama-context.h Outdated
void set_embeddings (bool value);
void set_embeddings_nextn(bool value, bool masked);
void set_embeddings_layer_inp(uint32_t lid, bool enable);
void set_mtp_layer_offset(int32_t offset);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void set_mtp_layer_offset(int32_t offset);
void set_nextn_layer_offset(int32_t offset);

Comment thread common/speculative.cpp Outdated
Comment thread common/speculative.cpp Outdated
Comment thread common/speculative.cpp Outdated
Comment thread common/speculative.cpp Outdated
forforever73 and others added 2 commits June 14, 2026 15:14
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
@forforever73

Copy link
Copy Markdown
Contributor Author

Test MTP performance/correctness of Qwen3.6 and Gemma4 on H800, with parameter

CA = -ngl 999 -fa on -ctk bf16 -ctv bf16 --no-mmap -c 8192 -b 4096 -ub 2048 -np 1 -t 32 --jinja
#Qwen3.6
<bin> -m Qwen3.6-27B-MTP-Q8_0.gguf $CA --spec-type draft-mtp --spec-draft-n-max 3
# Gemma4 MTP
<bin> -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -ngld 999 $CA --spec-type draft-mtp --spec-draft-n-max 3
📜 performance of Qwen3.6 and Gemma4.

Qwen3.6 master

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=135.4
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.1
  explain_concept    pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=122.9
  summarize          pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=128.8
  qa_factual         pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=127.0
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=127.0
  creative_short     pred= 192 draft= 198 acc= 125 rate=0.631 tok/s=113.9
  stepwise_math      pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=131.4
  long_code_review   pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=114.2
  AGG  accept=0.733 draft=1607 acc=1178 pred=1728 wall=17.13s

Qwen3.6 new

  code_python        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=136.5
  code_cpp           pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.7
  explain_concept    pred= 192 draft= 181 acc= 130 rate=0.718 tok/s=123.2
  summarize          pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=129.4
  qa_factual         pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=127.4
  translation        pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=128.0
  creative_short     pred= 192 draft= 198 acc= 125 rate=0.631 tok/s=114.4
  stepwise_math      pred= 192 draft= 168 acc= 134 rate=0.798 tok/s=132.2
  long_code_review   pred= 192 draft= 195 acc= 126 rate=0.646 tok/s=115.2
  AGG  accept=0.733 draft=1607 acc=1178 pred=1728 wall=17.13s

Gemma4 master

  code_python        pred= 192 draft= 166 acc= 135 rate=0.813 tok/s=127.0
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=126.7
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=102.9
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=126.8
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=112.0
  translation        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=129.1
  creative_short     pred= 192 draft= 223 acc= 115 rate=0.516 tok/s=95.3
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=118.9
  long_code_review   pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=104.9
  AGG  accept=0.7034 draft=1652 acc=1162 pred=1728 wall=17.27s

Gemma4 new

  code_python        pred= 192 draft= 166 acc= 135 rate=0.813 tok/s=127.5
  code_cpp           pred= 192 draft= 168 acc= 135 rate=0.804 tok/s=127.1
  explain_concept    pred= 192 draft= 207 acc= 121 rate=0.585 tok/s=103.3
  summarize          pred= 192 draft= 167 acc= 135 rate=0.808 tok/s=127.3
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=112.4
  translation        pred= 192 draft= 163 acc= 136 rate=0.834 tok/s=129.4
  creative_short     pred= 192 draft= 223 acc= 115 rate=0.516 tok/s=95.5
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=119.2
  long_code_review   pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=105.2
  AGG  accept=0.7034 draft=1652 acc=1162 pred=1728 wall=17.21s
📜 correctness of Qwen3.6 and Gemma4.

Qwen3.6 master

{
  "results": [
    {
      "name": "code_python",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Task:** Write a Python function to return the n-th Fibonacci number.\n   - **Requirement:** Use memoization.\n   - **Requirement:** Include a docstring.\n\n2.  **Define Fibonacci Sequence:**\n   - F(0) = 0\n   - F(1) = 1\n   - F(n) = F(n-1) + F(n-2) for n >= 2\n   - Note: Sometimes F(1)=1, F(2)=1 is used, but\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "Here's a thinking process:\n\n1.  **Understand the User Request:**\n   - **Goal:** Write a C++ template function `clamp(x, lo, hi)`\n   - **Behavior:** Returns `x` clamped to the range `[lo, hi]` (i.e., if `x < lo` return `lo`, if `x > hi` return `hi`, else return `x`)\n   - **Constraint:** Do NOT use `std::clamp`\n   - **Language:** C++\n\n2.  **Identify Key Requirements for a Template Function:**\n   - Needs to\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Topic:** Speculative decoding in large language model (LLM) inference\n   - **Format:** Three short paragraphs\n   - **Goal:** Explain how it works clearly and concisely\n\n2.  **Deconstruct Speculative Decoding:**\n   - What is it? A technique to speed up LLM inference.\n   - Core idea: Use a smaller, faster \"draft\" model to propose multiple tokens, then verify them with the larger, slower \"target\" model in parallel.\n   - Process: \n    \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n   - **Task:** Summarize in exactly two sentences.\n\n2.  **Identify Key Information:**\n   - Origin: Britain, late 18th century\n   - Core changes: Mechanization, steam power, factory system transformed manufacturing\n   - Spread: Continental\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Question:** \"What are the four fundamental forces of physics?\"\n   - **Expected Answer:** A clear, concise list of the four fundamental forces with brief explanations if needed.\n\n2.  **Identify Key Concepts:**\n   - The four fundamental forces in physics are:\n     1. Gravitational force\n     2. Electromagnetic force\n     3. Strong nuclear force (or strong interaction)\n     4. Weak nuclear force (or weak interaction)\n\n3.  **Verify Accuracy:**\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "Here's a thinking process:\n\n1.  **Analyze the Input:**\n   - Source text: \"The quick brown fox jumps over the lazy dog.\"\n   - Target language: French\n   - Note: This is a famous pangram (contains every letter of the alphabet in English). In French, the equivalent pangram is \"Portez ce vieux whisky au juge blond qui fume.\" But the user specifically asked to translate the given English sentence, not to provide the French pangram equivalent.\n\n2.  **Identify Key Components:**\n   - The quick brown fox -> Le renard brun rapide / Le\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "Thinking Process:\n\n1.  **Deconstruct the request:**\n    *   Topic: Old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n2.  **Brainstorming imagery and themes:**\n    *   *Old:* Weathered, stone, salt, time, standing guard, silent, gray, moss, cracks.\n    *   *Lighthouse:* Light, beam, sea, waves, ships, guiding, warning, tower, lens, rotation.\n    *   *Action:* Standing, watching, shining, enduring.\n\n3.  **Drafting - Attempt \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "Here's a thinking process:\n\n1.  **Understand the User's Request:**\n   - Problem: A train leaves station A at 60 km/h. Two hours later, a second train leaves the same station on the same track at 90 km/h. How long until the second train catches the first?\n   - Requirement: Solve step by step.\n\n2.  **Identify Key Variables & Concepts:**\n   - Train 1 speed: $v_1 = 60$ km/h\n   - Train 2 speed: $v_2 = 90$ km/h\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Context:** Backend service with intermittent latency spikes in production.\n   - **Tech Stack:** Python FastAPI, Postgres 15, 4 replicas behind nginx load balancer.\n   - **Input:** The user mentions \"Below is the relevant code and a description of the system\" but *didn't actually provide any code*. I need to handle this gracefully.\n   - **Output Requirements:** Structured review with three sections:\n     1. Likely root causes ranked by probability\n     2. Concrete code/configuration\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Qwen3.6 new

{
  "results": [
    {
      "name": "code_python",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Task:** Write a Python function to return the n-th Fibonacci number.\n   - **Requirement:** Use memoization.\n   - **Requirement:** Include a docstring.\n\n2.  **Define Fibonacci Sequence:**\n   - F(0) = 0\n   - F(1) = 1\n   - F(n) = F(n-1) + F(n-2) for n >= 2\n   - Note: Sometimes F(1)=1, F(2)=1 is used, but\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "Here's a thinking process:\n\n1.  **Understand the User Request:**\n   - **Goal:** Write a C++ template function `clamp(x, lo, hi)`\n   - **Behavior:** Returns `x` clamped to the range `[lo, hi]` (i.e., if `x < lo` return `lo`, if `x > hi` return `hi`, else return `x`)\n   - **Constraint:** Do NOT use `std::clamp`\n   - **Language:** C++\n\n2.  **Identify Key Requirements for a Template Function:**\n   - Needs to\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Topic:** Speculative decoding in large language model (LLM) inference\n   - **Format:** Three short paragraphs\n   - **Goal:** Explain how it works clearly and concisely\n\n2.  **Deconstruct Speculative Decoding:**\n   - What is it? A technique to speed up LLM inference.\n   - Core idea: Use a smaller, faster \"draft\" model to propose multiple tokens, then verify them with the larger, slower \"target\" model in parallel.\n   - Process: \n    \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Input Text:** \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n   - **Task:** Summarize in exactly two sentences.\n\n2.  **Identify Key Information:**\n   - Origin: Britain, late 18th century\n   - Core changes: Mechanization, steam power, factory system transformed manufacturing\n   - Spread: Continental\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "Here's a thinking process:\n\n1.  **Analyze User Input:**\n   - **Question:** \"What are the four fundamental forces of physics?\"\n   - **Expected Answer:** A clear, concise list of the four fundamental forces with brief explanations if needed.\n\n2.  **Identify Key Concepts:**\n   - The four fundamental forces in physics are:\n     1. Gravitational force\n     2. Electromagnetic force\n     3. Strong nuclear force (or strong interaction)\n     4. Weak nuclear force (or weak interaction)\n\n3.  **Verify Accuracy:**\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "Here's a thinking process:\n\n1.  **Analyze the Input:**\n   - Source text: \"The quick brown fox jumps over the lazy dog.\"\n   - Target language: French\n   - Note: This is a famous pangram (contains every letter of the alphabet in English). In French, the equivalent pangram is \"Portez ce vieux whisky au juge blond qui fume.\" But the user specifically asked to translate the given English sentence, not to provide the French pangram equivalent.\n\n2.  **Identify Key Components:**\n   - The quick brown fox -> Le renard brun rapide / Le\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "Thinking Process:\n\n1.  **Deconstruct the request:**\n    *   Topic: Old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n2.  **Brainstorming imagery and themes:**\n    *   *Old:* Weathered, stone, salt, time, standing guard, silent, gray, moss, cracks.\n    *   *Lighthouse:* Light, beam, sea, waves, ships, guiding, warning, tower, lens, rotation.\n    *   *Action:* Standing, watching, shining, enduring.\n\n3.  **Drafting - Attempt \n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "Here's a thinking process:\n\n1.  **Understand the User's Request:**\n   - Problem: A train leaves station A at 60 km/h. Two hours later, a second train leaves the same station on the same track at 90 km/h. How long until the second train catches the first?\n   - Requirement: Solve step by step.\n\n2.  **Identify Key Variables & Concepts:**\n   - Train 1 speed: $v_1 = 60$ km/h\n   - Train 2 speed: $v_2 = 90$ km/h\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Here's a thinking process:\n\n1.  **Understand User Request:**\n   - **Context:** Backend service with intermittent latency spikes in production.\n   - **Tech Stack:** Python FastAPI, Postgres 15, 4 replicas behind nginx load balancer.\n   - **Input:** The user mentions \"Below is the relevant code and a description of the system\" but *didn't actually provide any code*. I need to handle this gracefully.\n   - **Output Requirements:** Structured review with three sections:\n     1. Likely root causes ranked by probability\n     2. Concrete code/configuration\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Gemma4 master

{
  "results": [
    {
      "name": "code_python",
      "text": "The objective is to write a Python function that calculates the $n$-th Fibonacci number using *memoization*.\n\n    *   $F(0) = 0$\n    *   $F(1) = 1$\n    *   $F(n) = F(n-1) + F(n-2)$ for $n > 1$.\n\n    *   Standard recursion is $O(2^n)$ because it recalculates the same values many times.\n    *   Memoization stores the results of expensive function calls and returns the cached result when the same inputs occur\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "*   Goal: Write a C++ template function `clamp(x, lo, hi)`.\n    *   Functionality: Return `x` if it's between `lo` and `hi` (inclusive). If `x < lo`, return `lo`. If `x > hi`, return `hi`.\n    *   Constraint: Do not use `std::clamp` (which was introduced in C++17).\n    *   Requirement: Must be a template function to handle different numeric types.\n\n    *   Input: `x` (value to clamp), `lo` (\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "*   Topic: Speculative Decoding in LLM inference.\n*   Constraint 1: Explain how it works.\n*   Constraint 2: Three short paragraphs.\n\n    *   *What is it?* A technique to speed up LLM generation.\n    *   *Why is it needed?* LLMs are memory-bound; generating one token at a time is slow because the whole model must be loaded for every single token.\n    *   *How does it work?* Use a small, fast \"draft\" model to predict several tokens ahead, then use the large \"target\"\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "*   Input text: \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n    *   Constraint: Summarize in exactly two sentences.\n\n    *   Origin: Britain, late 18th century.\n    *   Key changes: Mechanization, steam power, factory system (manufacturing transformation).\n    *   Expansion: Continental Europe and North America, 19th century.\n\n    *   *Draft 1:*\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "*   Question: \"What are the four fundamental forces of physics?\"\n    *   Goal: Provide a clear, accurate, and concise answer.\n\n    *   Gravity (Gravitational force)\n    *   Electromagnetism (Electromagnetic force)\n    *   Strong Nuclear Force (Strong interaction)\n    *   Weak Nuclear Force (Weak interaction)\n\n    *   *Gravity:* Attracts masses. Long range. Weakest.\n    *   *Electromagnetism:* Attracts/repels charges. Long range. Stronger than gravity.\n    *   *Strong Nuclear Force\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "*   Source sentence: \"The quick brown fox jumps over the lazy dog.\"\n    *   Target language: French.\n    *   Context: This is a famous English pangram (contains every letter of the alphabet).\n\n    *   The: Le (masculine) / La (feminine)\n    *   quick: rapide (adj)\n    *   brown: brun / marron (adj)\n    *   fox: renard (masculine noun)\n    *   jumps: saute (verb *sauter*)\n    *   over: par-dessus\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "*   Topic: An old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n    *   Lighthouse: beam, light, tower, stone, salt, spray, ocean, waves, guide, warning, rust, weathered, lonely, sentinel, night, storm.\n    *   Old: crumbling, ancient, faded, tired, timeless, forgotten.\n\n    A tower of stone by the crashing sea,\n    Guiding the ships so they can be free.\n    Its light is dim but it still glows bright,\n    Watching the ocean through the dark night.\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "*   Train 1 (T1) speed: $60\\text{ km/h}$.\n    *   Train 2 (T2) speed: $90\\text{ km/h}$.\n    *   T2 starts $2\\text{ hours}$ after T1.\n    *   Goal: Find the time it takes for T2 to catch T1.\n\n    *   T1 has a head start of $2\\text{ hours}$.\n    *   Distance = $\\text{Speed} \\times \\text{Time}$.\n    *   Distance T1 traveled\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Backend service with intermittent latency spikes.\nPython FastAPI, Postgres 15, 4 replicas, Nginx load balancer.\nStructured review with three sections:\n        1.  Likely root causes (ranked by probability).\n        2.  Concrete code/config changes.\n        3.  Telemetry to add.\nThe user *hasn't provided the code yet*.\n\n    *   Wait, the prompt says \"Below is the relevant code,\" but there is no code provided.\n    *   *Self-Correction:* I cannot analyze specific code that isn't there. However, as\n====CONTENT====\n",
      "n": 128
    }
  ]
}

Gemma4 new

{
  "results": [
    {
      "name": "code_python",
      "text": "The objective is to write a Python function that calculates the $n$-th Fibonacci number using *memoization*.\n\n    *   $F(0) = 0$\n    *   $F(1) = 1$\n    *   $F(n) = F(n-1) + F(n-2)$ for $n > 1$.\n\n    *   Standard recursion is $O(2^n)$ because it recalculates the same values many times.\n    *   Memoization stores the results of expensive function calls and returns the cached result when the same inputs occur\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "code_cpp",
      "text": "*   Goal: Write a C++ template function `clamp(x, lo, hi)`.\n    *   Functionality: Return `x` if it's between `lo` and `hi` (inclusive). If `x < lo`, return `lo`. If `x > hi`, return `hi`.\n    *   Constraint: Do not use `std::clamp` (which was introduced in C++17).\n    *   Requirement: Must be a template function to handle different numeric types.\n\n    *   Input: `x` (value to clamp), `lo` (\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "explain_concept",
      "text": "*   Topic: Speculative Decoding in LLM inference.\n*   Constraint 1: Explain how it works.\n*   Constraint 2: Three short paragraphs.\n\n    *   *What is it?* A technique to speed up LLM generation.\n    *   *Why is it needed?* LLMs are memory-bound; generating one token at a time is slow because the whole model must be loaded for every single token.\n    *   *How does it work?* Use a small, fast \"draft\" model to predict several tokens ahead, then use the large \"target\"\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "summarize",
      "text": "*   Input text: \"The Industrial Revolution began in Britain in the late 18th century, transforming manufacturing through mechanization, steam power, and the factory system. It spread to continental Europe and North America during the 19th century.\"\n    *   Constraint: Summarize in exactly two sentences.\n\n    *   Origin: Britain, late 18th century.\n    *   Key changes: Mechanization, steam power, factory system (manufacturing transformation).\n    *   Expansion: Continental Europe and North America, 19th century.\n\n    *   *Draft 1:*\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "qa_factual",
      "text": "*   Question: \"What are the four fundamental forces of physics?\"\n    *   Goal: Provide a clear, accurate, and concise answer.\n\n    *   Gravity (Gravitational force)\n    *   Electromagnetism (Electromagnetic force)\n    *   Strong Nuclear Force (Strong interaction)\n    *   Weak Nuclear Force (Weak interaction)\n\n    *   *Gravity:* Attracts masses. Long range. Weakest.\n    *   *Electromagnetism:* Attracts/repels charges. Long range. Stronger than gravity.\n    *   *Strong Nuclear Force\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "translation",
      "text": "*   Source sentence: \"The quick brown fox jumps over the lazy dog.\"\n    *   Target language: French.\n    *   Context: This is a famous English pangram (contains every letter of the alphabet).\n\n    *   The: Le (masculine) / La (feminine)\n    *   quick: rapide (adj)\n    *   brown: brun / marron (adj)\n    *   fox: renard (masculine noun)\n    *   jumps: saute (verb *sauter*)\n    *   over: par-dessus\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "creative_short",
      "text": "*   Topic: An old lighthouse.\n    *   Format: Four-line poem (quatrain).\n\n    *   Lighthouse: beam, light, tower, stone, salt, spray, ocean, waves, guide, warning, rust, weathered, lonely, sentinel, night, storm.\n    *   Old: crumbling, ancient, faded, tired, timeless, forgotten.\n\n    A tower of stone by the crashing sea,\n    Guiding the ships so they can be free.\n    Its light is dim but it still glows bright,\n    Watching the ocean through the dark night.\n\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "stepwise_math",
      "text": "*   Train 1 (T1) speed: $60\\text{ km/h}$.\n    *   Train 2 (T2) speed: $90\\text{ km/h}$.\n    *   T2 starts $2\\text{ hours}$ after T1.\n    *   Goal: Find the time it takes for T2 to catch T1.\n\n    *   T1 has a head start of $2\\text{ hours}$.\n    *   Distance = $\\text{Speed} \\times \\text{Time}$.\n    *   Distance T1 traveled\n====CONTENT====\n",
      "n": 128
    },
    {
      "name": "long_code_review",
      "text": "Backend service with intermittent latency spikes.\nPython FastAPI, Postgres 15, 4 replicas, Nginx load balancer.\nStructured review with three sections:\n        1.  Likely root causes (ranked by probability).\n        2.  Concrete code/config changes.\n        3.  Telemetry to add.\nThe user *hasn't provided the code yet*.\n\n    *   Wait, the prompt says \"Below is the relevant code,\" but there is no code provided.\n    *   *Self-Correction:* I cannot analyze specific code that isn't there. However, as\n====CONTENT====\n",
      "n": 128
    }
  ]
}

@forforever73

Copy link
Copy Markdown
Contributor Author

@CISC could you take a look :)

@CISC

CISC commented Jun 14, 2026

Copy link
Copy Markdown
Member

@CISC could you take a look :)

LGTM, but @ggerganov should sign off on the API.

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks incorrect with multiple sequences.

@forforever73

Copy link
Copy Markdown
Contributor Author

You're right. Have fixed it (matching the eagle3 pattern), verified single-sequence output unchanged and concurrent runs now hit 0 decode failures on both unified and non-unified caches

Comment thread common/speculative.cpp
Comment on lines +1025 to +1050
auto * mem_dft = llama_get_memory(ctx_dft);

bool ok = true;
for (int head = 0; head < n_mtp_layers; ++head) {
if (chain_heads) {
for (llama_seq_id seq_id = 0; seq_id < (llama_seq_id) n_seq; ++seq_id) {
if (i_batch_beg[seq_id] < 0) {
continue;
}
llama_memory_seq_rm(mem_dft, seq_id, batch_in.pos[i_batch_beg[seq_id]], -1);
}
llama_set_nextn_layer_offset(ctx_dft, head);
}

const int32_t rc = llama_decode(ctx_dft, batch);
if (rc != 0) {
LOG_ERR("%s: llama_decode(ctx_dft) head=%d failed rc=%d (pos=%d)\n",
__func__, head, (int) rc, (int) batch_in.pos[0]);
ok = false;
break;
}
}

if (chain_heads) {
llama_set_nextn_layer_offset(ctx_dft, 0); // restore default for non-draft decodes
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the logic here - seems incorrect. Every head iteration will basically erase the result of the previous iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each head runs a different layer, set_nextn_layer_offset(head) makes graph_mtp build layer n_layer()+head, i.e. 45/46/47. So each head writes its own k_l[il]/v_l[il].
seq_rm here doesn't drop any KV data — it just clears the cell metadata so find_slot hands back the same cells at the same positions for every head. Without it, head 46/47 would land on fresh cells and we'd get duplicate positions in v_cells .
So after the loop those cells hold valid KV for all three MTP layers at once. it's the teacher-forcing catch-up that seeds each head's layer so the next draft() round attends to a correct, target-aligned cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, got it. That's interesting.

@ggerganov ggerganov self-assigned this Jun 16, 2026
@forforever73

Copy link
Copy Markdown
Contributor Author

@ggerganov Hey, just a quick ping on this pr when you have a chance :)

@ggerganov

Copy link
Copy Markdown
Member

Here is a minor patch I wanted to push, but don't have the permission:

diff --git a/common/speculative.cpp b/common/speculative.cpp
index fd0cf138f..d7a177b7b 100644
--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@@ -1027,6 +1027,7 @@ struct common_speculative_impl_draft_mtp : public common_speculative_impl {
             bool ok = true;
             for (int head = 0; head < n_mtp_layers; ++head) {
                 if (chain_heads) {
+                    // ref: https://github.com/ggml-org/llama.cpp/pull/24340/changes#r3413498544
                     for (llama_seq_id seq_id = 0; seq_id < (llama_seq_id) n_seq; ++seq_id) {
                         if (i_batch_beg[seq_id] < 0) {
                             continue;
@@ -1837,7 +1838,7 @@ common_speculative * common_speculative_init(common_params_speculative & params,
 
         bool has_draft_simple = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE));
         bool has_draft_eagle3 = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3)) && params.draft.ctx_dft != nullptr;
-        bool has_mtp = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP)) && params.draft.ctx_dft != nullptr;
+        bool has_draft_mtp    = (enabled_configs & (1u << COMMON_SPECULATIVE_TYPE_DRAFT_MTP))    && params.draft.ctx_dft != nullptr;
 
 
 
@@ -1875,7 +1876,7 @@ common_speculative * common_speculative_init(common_params_speculative & params,
         if (has_draft_eagle3) {
             configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_EAGLE3, params));
         }
-        if (has_mtp) {
+        if (has_draft_mtp) {
             configs.push_back(common_speculative_config(COMMON_SPECULATIVE_TYPE_DRAFT_MTP, params));
         }
     }

Comment thread common/speculative.cpp
Comment on lines +1185 to +1194
if (chain_heads) {
chain_h[seq_id].insert(chain_h[seq_id].end(), h_row, h_row + n_embd);

const int n_rows = (int) result.size() + 1; // id_last + tokens drafted so far
for (int t = 0; t < n_rows; ++t) {
const llama_token tok = (t == 0) ? dp.id_last : result[t - 1];
common_batch_add(batch, tok, dp.n_past + t, { seq_id }, t == n_rows - 1);
std::memcpy(batch.embd + (size_t) (batch.n_tokens - 1) * n_embd,
chain_h[seq_id].data() + (size_t) t * n_embd, row_bytes);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect - every next draft, we decode all previous tokens again. Why is that? Normally, we should decode just the latest token.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that every draft step re-decodes the whole prefix — that's intentional. This path isn't a normal AR draft; it's an accumulating-batch + per-step-seq_rm flow, because each step runs a different head.

After process(), every head's KV is already filled for the prompt + accepted prefix (positions < n_past), and the per-step seq_rm lower bound is n_past, so we never re-decode the prompt. What gets replayed each step is only the draft region — id_last plus the few tokens drafted so far (capped at n_max=3).

Why replay it under each head: since we switch heads per step, the current head hasn't written its own KV for any draft-region position yet this round. If we only decoded the latest token, its attention over the earlier draft positions would read cells that only another head ever wrote → garbage. So each step we seq_rm the draft region (so find_slot reuses the same slots and positions stay aligned), switch the head, and replay the accumulated prefix so this head fills its own KV.

For contrast, the other two branches reuse a single head across steps, so their prefix KV is already valid and they just append the latest token

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. I'm still not used to that approach, but it seems correct. Add a reference to this explanation in the code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed it's not intuitive, it took me quite a while to design this too :) Added a comment and committed the other suggestions as well.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a reference:

diff --git a/common/speculative.cpp b/common/speculative.cpp
index d7a177b7b..f8a6287c2 100644
--- a/common/speculative.cpp
+++ b/common/speculative.cpp
@@ -1184,6 +1184,7 @@ struct common_speculative_impl_draft_mtp : public common_speculative_impl {
                 }
 
                 if (chain_heads) {
+                    // ref: https://github.com/ggml-org/llama.cpp/pull/24340#discussion_r3448031546
                     chain_h[seq_id].insert(chain_h[seq_id].end(), h_row, h_row + n_embd);
 
                     const int n_rows = (int) result.size() + 1; // id_last + tokens drafted so far

Comment thread common/speculative.cpp Outdated

std::vector<int> i_last(n_seq, -1);

std::vector<std::vector<float>> chain_h;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be allocated and reserved once at construction time.

Comment thread src/llama-graph.h Outdated
cparams.embeddings == other.cparams.embeddings &&
cparams.embeddings_nextn == other.cparams.embeddings_nextn &&
cparams.embeddings_nextn_masked == other.cparams.embeddings_nextn_masked &&
cparams.nextn_layer_offset == other.cparams.nextn_layer_offset &&

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but effectively it will disable graph reuse during drafting. However, there isn't a better way to do it for now as we don't have a mechanism to do layer selection at compute time. It's something to think about in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add TODO to not forget:

diff --git a/src/llama-graph.h b/src/llama-graph.h
index d2a1b39d4..ac00d6cc6 100644
--- a/src/llama-graph.h
+++ b/src/llama-graph.h
@@ -682,11 +682,15 @@ struct llm_graph_params {
             }
         }
 
+        // TODO: https://github.com/ggml-org/llama.cpp/pull/24340#discussion_r3448035248
+        if (cparams.nextn_layer_offset != other.cparams.nextn_layer_offset) {
+            return false;
+        }
+
         return
             cparams.embeddings              == other.cparams.embeddings              &&
             cparams.embeddings_nextn        == other.cparams.embeddings_nextn        &&
             cparams.embeddings_nextn_masked == other.cparams.embeddings_nextn_masked &&
-            cparams.nextn_layer_offset      == other.cparams.nextn_layer_offset      &&
             cparams.causal_attn             == other.cparams.causal_attn             &&
             arch  == other.arch  &&
             gtype == other.gtype &&

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

@ggerganov ggerganov merged commit d789527 into ggml-org:master Jun 21, 2026
24 of 25 checks passed
@CISC

CISC commented Jun 21, 2026

Copy link
Copy Markdown
Member

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

For some reason the line number is the would-be-merged line number, so sometimes it's a bit off.

@ggerganov

Copy link
Copy Markdown
Member

Not sure why the EditorConfig check is failing - seems like a false-positive. Should be good to merge.

For some reason the line number is the would-be-merged line number, so sometimes it's a bit off.

Yes, though I merged master into my working copy of this branch and it didn't result into whitespaces. Not sure what caused this.

@remeh

remeh commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

A big thank you for this PR folks! I've been from ~18tok/s to ~30tok/s with coding tasks on a Strix Halo (with ROCm, the IQ4_XS quant and --spec-draft-n-max 3). I'll see now if it still hallucinates typos with bigger contexts.
Thanks a bunch @forforever73 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants