spec: support eagle3 for qwen3.5 & 3.6#24593
Conversation
|
GGUFs are in demand XD |
You can convert it using the |
|
How much faster is this method than MTP? |
|
i converted the Ex0bit/Qwen3.6-27B-PRISM-EAGLE3 to GGUF and ran it with Qwen3.6-27B-Q4_K_M_MTP.gguf and i'm getting about 16 t/s knowing that base is 22.5 t/s i'll be trying rjmalagon/specdrift-qwen3.6-27b-eagle3 shortly. Update: couldn't convert rjmalagon/specdrift-qwen3.6-27b-eagle3 |
It’s hard to define “how much faster,” since speculative decoding speedups depend heavily on the use cases and prompts. Different models and methods perform better in different scenarios. We have introduced |
|
"specdrift-qwen3.6-27b-eagle3" is a different eagle3 model AFAIK, see https://vllm.ai/blog/2026-05-26-eagle-3-1 |
yes i just discovered that it won't convert and as of now i cannot find a compatible eagle3 for unsloth/Qwen3.6-27B-Q4_K_M_MTP.gguf |
|
thanks, gave a try to this branch with Ex0bit GGUF generated as per EAGLE3 PR doc. Text generation crashes instantly like reported in #24541 After patching if writing something as second turn, this crash occurs: |
|
On top of your branch yes |
ebf46af to
fd50e23
Compare
fd50e23 to
df3bc6d
Compare
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
@ggerganov Fixed, let me know if any other changes are needed :) |
|
Looks all checks passed. @ggerganov @CISC |
|
Is this expected to work with -sm tensor like MTP now does? I get an assert on 2xMI50 0.22.383.898 W set_sampler: backend sampling not supported with SPLIT_MODE_TENSOR; using CPU Note that with -sm layer I do get a speed up fro about 16t/s to 24t/s or so with 2xMI50 and Q8_K_M model + Q8 draft |
Check this comment #18039 (comment) |

Overview
Support third-party eagle3 draft models for Qwen3.5 & 3.6
Fix issue: #24541
Running steps:
Performance on DGX Spark with SpeedBench:
Deferred boundary in eagle3 across context checkpoints for hybrid models
Eagle's draft trails the target by one position (input at
Pis (token[P+1],g_embd[P])), andg_embdfrom eagle3 encoder is a transient target activation not stored in any KV cache. On recurrent/hybrid targets a checkpoint is single-position, so on restore the draft is atpos_max-1, the target resumes atpos_max+1, andg_embd[pos_max]is lost → the draft can't fillpos_max→llama_decode(ctx_dft)fails (rc=-1).Solution: stash that one
g_embd[pos_max]row with the checkpoint and restore it on load, so the existing bridge fills the boundary and the draft keeps full context.It's the cheapest fix: recomputing
g_embdneeds an extra target decode and eagle3 encoder (and is impossible on a restored recurrent state), full re-processing re-runs the prefix, and re-seeding the draft loses context — whereas stashing one row (~20 KB/checkpoint, recurrent/hybrid only) adds no decode and no quality loss.Example (pos_max = 13)
At checkpoint creation:
0..13.0..12— one behind, because decoding draft pos13needstoken[14], which isn't available yet (the deferred boundary).g_embd[13]exists only as a transient activation at this moment.On restore:
12; target resumes at14(pos_max+1, since the checkpoint already holds13). Reprocessing14, 15, …producesg_embd[14], g_embd[15], …— but notg_embd[13](13isn't reprocessed, and it was never saved).12tries to decode13: it hastoken[14]but notg_embd[13]→ can't. So the draft batch jumps12 → 14, leaving a hole at13 → llama_decode(ctx_dft)returnsrc=-1.g_embd[13]in the checkpoint, restore it, and the bridge decodes draft pos13from (token[14], g_embd[13])→ draft goes12 → 13 → 14 …, contiguous, full context preserved.Requirements