You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generalize the MTP draft path to support chain length K > 1, where each
chain step conditions on the previous step's MTP block output instead of
the target's pre-output-norm hidden.
Pieces:
- llama_graph_result gains t_mtp_out: the MTP block's post-FFN hidden
(pre-LM-head). qwen35_mtp's graph builder sets it.
- llama_context::get_t_mtp_out() exposes the most recent decode's value.
- llama_mtp_relay_h_self(ctx_mtp, n_rows): on-device copy of the LAST
n_rows of t_mtp_out into the FIRST n_rows of t_inp_h. Same machinery
as llama_mtp_relay_h, just self-source.
- common_speculative_state_mtp::draft chains n_max calls. Step 0 relays
from ctx_target's t_h_pre_norm (existing). Steps 1..K-1 self-relay
from ctx_mtp's previous t_mtp_out. Each step argmaxes its logits and
feeds the result to the next.
- accept(n_accepted) trims any rejected trailing draft positions from
ctx_mtp's KV via seq_rm so the next draft writes K/V at the right
slots. Tracks last_n_drafted to know how many to potentially drop.
Smoke results on Qwen3.6-q8_0-mtp.gguf, --spec-draft-n-max 2:
fibonacci: K=1 → 13.17 tok/s, 100% accept
K=2 → 15.40 tok/s, 75% accept (12/16)
K=2 wins because the prompt is highly canonical and even
chain step 1 stays accepted most of the time.
send_req: K=1 → 11.44 tok/s, 83.9% accept (182/217)
K=2 → 9.48 tok/s, 29.7% accept (148/499)
K=2 loses on dense code: chain step 1 accept falls off a
cliff because Qwen3.6's MTP head is trained one-step-ahead
and feeding it its own previous output is out-of-distribution
(the FastMTP problem; also discussed in DeepSeek V3 paper).
The infrastructure works correctly; the model doesn't
benefit without retraining.
Practical guidance: keep --spec-draft-n-max 1 for code/dense workloads.
K > 1 only helps when the head was either trained for chain prediction
(FastMTP-style) or when the workload is canonical enough that vanilla
self-rolling stays in-distribution.
0 commit comments