You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mtp: stage h relay through set_input for graph-rebuild safety
Refactor the h relay path so the device-to-device copy from ctx_target's
t_h_pre_norm into ctx_mtp's t_inp_h happens during set_input on the next
decode, rather than immediately at the relay call. This prepares the
ground for batched MTP prompt prefill, where ctx_mtp has to switch
between n_tokens=N (prefill) and n_tokens=1 (chain step) graphs and the
old "copy now" path could not survive the rebuild — t_inp_h's tensor
identity changes when the graph rebuilds, but the relay had already
written to the prior graph's tensor.
Mechanism:
- llama_context gains an mtp_h_source_t staging slot (ctx_src + src
tensor + row range), set by llama_mtp_relay_h{,_self} and consumed
during set_inputs on the next decode.
- llm_graph_input_h_pre_norm now holds a llama_context* and reads the
staged source in its set_input. The actual ggml_backend_tensor_copy_
async lives there (synchronizes ctx_src, builds row views with
manually-wired buffers, sched-resolves backends per side, then async
copies). After the copy the staging is cleared so a stray decode
without a fresh relay call doesn't replay stale data.
- llm_graph_params carries a llama_context* so the graph builder can
wire it onto the input class. graph_params() in llama-context.cpp
passes `this`.
- llama_mtp_relay_h gains an n_rows parameter (1 by default for per-
step drafting; N for an upcoming batched-prefill caller).
No behavior change at K=1/K=2 — relay still fires every draft step,
still copies the same rows. Verified send_req on Qwen3.6-q8_0-mtp:
K=1: 88.2% accept (187/212), 12.0 tok/s (was 88%, 12.5)
K=2: 85.7% accept (252/294), 16.2 tok/s (was 86%, 16.9)
Within noise — the slight tok/s dip is the extra synchronize + view
allocation per set_input call; trivially recoverable later.
Why this matters: with the relay flowing through set_input, the next
commit can do batched MTP prompt prefill (single n_tokens=N decode)
followed by the existing single-token chain steps without the t_inp_h
identity gymnastics. That fixes the long-context issue where MTP's KV
currently holds only [BOS, draft_1, ..., draft_M] and MTP attention
cannot see prompt context, plus the position drift where MTP applies
RoPE at local positions 1..M+1 while the trunk is at absolute position
N..N+M (for a 4K prompt those rotations diverge enough to wreck
attention quality).
0 commit comments