mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221
Open
dusterbloom wants to merge 12 commits into
Open
mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221dusterbloom wants to merge 12 commits into
dusterbloom wants to merge 12 commits into
Conversation
Contributor
There was a problem hiding this comment.
4 issues found across 34 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
0170c40 to
0d4a531
Compare
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.
- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
hoisted out of the backend. Three KV-reconciliation paths
(accept-all / fast rollback / recommit) share a single post-iter
invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
to chain runner. Owns only control flow; all compute lives in
DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
generate() delegates to common::mtp::warm_and_decode when MTP is
configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
command surface unchanged.
Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
Plugs MTP requests into the existing server.py prefix-cache protocol so
agent loops with repeated prefixes skip the cold prefill + warm path.
- common/prefix_snap.h: single source for the inline snap ack format
("[snap] inline slot=N cur_pos=M") that server.py's bus.await_reply
matches. DFlash do_prefill and the MTP orchestrator both funnel
through emit_inline_snap_ack so the format can't drift.
- mtp_orchestrator: chunked prefill clips to snap_at when the server
requests a mid-prompt cut (prepare_inline_snap picks the
second-to-last <|im_start|> boundary). Partial warm_head_kv fills
slots [1..snap_at], snapshot_save fires through ModelBackend, ack
emits. Full warm runs after to cover [1..prompt_len). End-of-prompt
snap supported but rarely picked by the server.
- snapshot_save captures h_{snap_pos-1} via target->last_hidden() and a
shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects
mismatches before touching state.
- INativeMtp gains warm_head_kv_range(prompt, n_prompt, start_slot,
n_chunk, prefill_next, hiddens) — same warm graph as warm_head_kv,
with caller-controlled slot_start so partial-WARM restore can fill
[snap_pos..prompt_len] using pre_warm_hidden + delta_hiddens.
- restore_and_generate unifies perfect-WARM and partial-WARM under a
single tok/shape/hidden contract. Perfect-WARM (prompt_len ==
snap_pos): restore head_kv, no prefill. Partial-WARM (prompt_len >
snap_pos): restore head_kv, prefill delta (kv_offset=snap_pos),
range-warm [snap_pos..prompt_len]. Slot snap_pos is overwritten with
the new prompt[snap_pos] so cross-request first-after-cut divergence
cannot corrupt decode. Contract failures fall through to a
cold-restart fallback that discards the snapshot's head_kv and runs
full cold prefill — byte-correct against the chain runner's
[0..prompt_len) requirement.
- qwen36_mtp: thin head_kv snapshot ([0..head_kv_pos+1] slice instead
of full [key_len, n_ctx, n_head_kv]); ~10× payload reduction at
max_ctx=65536. Lazy GPU alloc grows head_kv tensors from 8K initial
to n_ctx_max on first warm that needs more (saves ~256 MiB VRAM on a
daemon serving short prompts). F16 dtype guard on snapshot+restore.
- qwen35_backend: head_kv_warm_ flag gates snapshot_save's MTP capture
so a pre-warm snapshot can't round-trip as valid. init_mtp_ passes
cfg_.device.max_ctx to the module so head_kv tracks the backbone
(previous hardcoded 8192 broke any prompt > 8K).
Tests: 4/4 test_prefix_cache_mtp (perfect bit-equal round-trip, prefill
mismatch fallback, int32 boundary, shape-contract rejection) + 4/4
test_common_mtp_orchestrator. Full build green. Harness probe 7/7 at
--max-ctx 65536; zero chain failures / ack-missing / capacity overflows.
R3 integration bench (dflash/bench/bench_mtp_warm_hit_2turn.py) drives
a 2-turn warm-hit gate against a live daemon: prefill_ratio < 0.3 +
bit-equal first 16 tokens between turns.
`n_layer = block_count - nextn_predict_layers` is correct for backbone-graph
iteration and the divisibility check, but `plan.layer_end` was also defaulting
to this reduced value — silently filtering blk.{n_layer}.* out of the GPU
load. The MTP loader's `find_tensor(meta_ctx, ...)` then resolved the
descriptor with `data==nullptr` and failed with "14 required NextN tensor(s)
missing".
Fix: default `plan.layer_end` to `n_block_raw` so MTP head blocks are loaded
alongside backbone. Validation upper-bound widened to match. No-op for
non-MTP GGUFs where `nextn_predict_layers=0` (n_block_raw == n_layer).
`n_last_chunk = committed % PREFILL_UBATCH` only equals the last prefill chunk's actual size when prefill started at kv_offset=0. With prefix-cache partial restore, `restore_and_generate` runs delta-prefill from kv_offset>0, so the last chunk's `n_tokens` is `prompt_len - kv_offset`, not the modulo of `committed` over PREFILL_UBATCH. The read offset was then larger than sg_.argmax_tokens->ne[0], firing the "tensor read out of bounds" assert on the first DFlash spec-decode request against any prompt the cache had already seen. Read the actual last-chunk size from sg_.argmax_tokens->ne[0], which the graph builder sized to match the bound chunk. No-op when kv_offset==0 (`committed % UBATCH == ne[0]`).
0d4a531 to
5e7594c
Compare
When --prefill-skip-park is set, compress_text_via_daemon correctly skips its own `park target` / `park draft` sends, but the C++ handle_compress parses an independent `nopark` trailing token and parks target+draft itself when it's absent. The two paths were out of sync: Python honored skip_park but the daemon ignored it. This is a no-op for the PFlash-only path (unpark target rebinds nothing that has stale references). But the MTP path holds tensor pointers into the backbone's ggml_context across requests — the internal park frees those tensors, and the immediate unpark rebuilds the context with new addresses, which makes the MTP graph crash with GGML_ASSERT(ggml_can_repeat(b, a)) on the next forward. Pass " nopark" through to the C++ command when skip_park is true, so both layers agree.
Extends yesterday's 2026-05-17_f031f08 matrix with the workload classes that
matrix didn't cover:
• Agent suite (2k/8k/24k buckets) across 4 configs (MTP+DFlash × q8/tq3 KV)
plus the stacked PFlash+MTP+TQ3 path. MTP wins agent (53.98) by 4-14%
over DFlash because the small drafter's accept collapses from 70-90% on
code/math to 28-29% on chat/tool-use prompts; MTP's accept stays at 0.69.
• PFlash + MTP + TQ3 stack verified end-to-end for the first time:
36K NIAH passes in 22.8s wall (20.3× compression, decode 52 tok/s, needle
recall correct) on a single 3090. All 7 OpenAI-compatible clients
(claude_code, codex, hermes, openclaw, openwebui, opencode, pi) pass
every probe against the stacked server.
• he/gsm/math reproduced under today's branch — DFlash AL is unchanged
from f031f08 (speculator is healthy); absolute tok/s are lower because
today's bench harness wraps the full streaming HTTP response while
f031f08's parsed the daemon's internal tok/s timer.
Three branch fixes that made the matrix runnable:
- 230c303 MTP loader: include NextN block in GPU load by default
- 5e7594c do_spec_decode: use sg_.argmax_tokens->ne[0] for last-chunk size
- af05a23 prefill_hook: propagate skip_park to daemon compress (MTP+PFlash)
Raw per-suite JSONs included alongside summary.md.
…fter-arch protocol
Option A from thoughts/2026-05-19/dual-speculator-architecture.md.
Backend keeps BOTH speculators resident when --draft and --mtp-gguf are
both provided. GenerateRequest grows a `speculator` field (string):
- "dflash" forces the DFlash drafter + DDTree spec-decode path
- "mtp" forces the MTP γ-chain native-heads path
- "auto" (default) selects based on prompt size
Rationale for the auto threshold (4096 tokens), pulled from today's bench:
- DFlash beats MTP 1.75–2.6× on code/math under ~16K ctx (HE 173 vs 64,
Math 115 vs 61, GSM 102 vs 59 tok/s on bench_matrix)
- MTP beats DFlash on agent prompts (DFlash drafter accept rate 0.29
vs MTP 0.69)
- MTP beats DFlash 2.4–5.7× on PFlash-compressed long context (DFlash
drafter accept collapses to 0.14–0.21 on gapped compressed sequences;
MTP heads consume backbone hidden states and are unaffected)
VRAM on a 24 GB 3090 with dual-load: ~19.9 GB (15.3 target + 1 DFlash
drafter + 0.5 MTP heads + 2 KV TQ3 + 1 activations). Fits.
Boot banner now prints `[speculators] dual-mode: DFlash+MTP both loaded
(auto_select threshold=4096 tokens)` when both are configured.
Same commit also extends the PFlash compress protocol to carry an
optional drafter_arch token so the server can route `qwen3-0.6b` vs
`qwen35-0.8b` per request. New CLI flag --prefill-drafter-arch on the
Python side; daemon's handle_compress parses an optional 4th positional
token between drafter_gguf and the trailing "nopark" marker. Backward
compatible: omitting the token defaults to qwen3-0.6b, which matches the
previous hard-coded behavior.
Smoke tested:
- speculator=dflash → DFlash path confirmed via [spec-decode] log line
- speculator=mtp → MTP path via [mtp_decode] log line
- speculator=auto on 16-token prompt → routed to DFlash (< 4096)
- speculator=auto on 5415-token prompt → routed to MTP (> 4096)
- All 9 existing MTP test binaries still pass
bench_matrix.py + the matrix/ subpackage were added in dflash commit
f031f08 (bench: matrix orchestrator + power sweep + DFlash optimality
audit) but landed on a feature branch that never made it to main, so
they're absent from our HEAD. Restored verbatim from f031f08 (with the
HumanEval/GSM8K/Math500 workloads from follow-up commit 59cd0fa) so
today's apples-to-apples re-runs against yesterday's matrix have an
in-tree harness.
This is the orchestrator that drives test_dflash directly (positional
args, no server), parses the daemon's own decode-only tok/s figure
([dflash] generated N tokens in T s -> X tok/s), and emits per-cell
JSON artifacts with bootstrap CI 95% (1000 resamples, seed=42).
Subsumes the older scattered bench scripts (bench_agent.py,
bench_agent_mtp.py, bench_llm.py) for cross-comparison.
Layout:
bench_matrix.py — entry point
render_matrix.py — markdown summary writer
matrix/workload.py — Workload base
matrix/speculator.py — Speculator base
matrix/speculators/{ar,dflash,mtp}.py
matrix/workloads/{humaneval,gsm8k,math500,swe_bench}.py
Usage:
python3 dflash/scripts/bench_matrix.py \
--workloads humaneval,gsm8k,math500 \
--speculators ar,dflash_b22,mtp_d3 \
--n-gen 256 --n-runs 8 --n-sample 8
…ary update
Three new run directories plus a yesterday-vs-today comparison snapshot:
2026-05-17T17-40-56_f031f08/ — preserved yesterday's reference matrix
(n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40,
GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR
baseline ~34 tok/s across suites. Was untracked in tree because the
bench_matrix orchestrator landed on a stale branch.
2026-05-19T11-43-13_83e19d9/ — first matrix re-run on HEAD (HE only,
MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s.
Confirms no DFlash kernel regression vs f031f08.
2026-05-19T11-54-32_83e19d9/ — full apples-to-apples re-run on HEAD
with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22,
MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE
−2.0%, etc. No regression.
2026-05-19_mtp-prefix-warm-ghost/summary.md — updated with the
apples-to-apples table above, an agent bucket-label-vs-actual-token
audit (agent_24k prompts are actually ~2.6K), a known-gaps section
documenting what is NOT yet tested (real CLI agentic loops, NIAH
> 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).
…resume plan
Three working docs that capture what we learned shipping Option A:
dual-speculator-architecture.md — full pipeline diagrams (PFlash →
target prefill → auto_select → DFlash/MTP → verify → stream),
component cost matrix on RTX 3090, workload-to-config cheat sheet,
synergy/conflict grid (the one real conflict is DFlash drafter
accept collapsing on PFlash-compressed prompts), recommended 3090
default for agentic coding via hermes/opencode/pi.
pflash-drafter-unification-plan.md — experiment plan for "can we
use a smaller / different / shared PFlash drafter." Today's E1
swap to Qwen3.5-0.8B blocked on two loader gaps (DFlash-draft
variant strips lm_head; full base GGUF has tied embeddings the
target loader rejects). Documents the structural finding that a
DFlash draft and a PFlash scorer can't be the same file by
construction.
dual-monster-resume-plan.md — milestone-based roadmap (M1..M8)
organising what's committed, what's uncommitted, what's tested,
and what to ship next. M1 (this commit set) → M2 real CLI
sessions → M3 PR → M4 mid-stream Option B → ...
Contributor
There was a problem hiding this comment.
1 issue found across 48 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/bench_matrix.py">
<violation number="1" location="dflash/scripts/bench_matrix.py:446">
P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| ar_cache=ar_cache.get(wname), | ||
| tmpdir=tmpdir, | ||
| ) | ||
| ar_res = ar_cache.get(wname) if not args.no_ar_cache else None |
Contributor
There was a problem hiding this comment.
P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/bench_matrix.py, line 446:
<comment>--no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</comment>
<file context>
@@ -0,0 +1,468 @@
+ ar_cache=ar_cache.get(wname),
+ tmpdir=tmpdir,
+ )
+ ar_res = ar_cache.get(wname) if not args.no_ar_cache else None
+ ap = _write_artifact(
+ run_dir, wl, spec, meta,
</file context>
Reproducible per-CLI setup for driving the dual-speculator server (feat/mtp-prefix-warm-ghost) from real client binaries instead of synthetic POSTs. Tested 2026-05-19 against the server running on http://127.0.0.1:18080 with PFlash + MTP γ=3 + DFlash b=22 + Q8 KV + prefix-cache. Per-CLI working configurations: - claude: --bare is mandatory; without it OAuth keychain wins and ANTHROPIC_BASE_URL is ignored. Set ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL, DISABLE_AUTOUPDATER, DISABLE_TELEMETRY. Real 3-turn session confirmed auto-route DFlash → MTP crossing the 4096-token threshold mid-conversation; 1815-token long response streamed at 49.3 tok/s via MTP. - codex: CODEX_HOME must be outside /tmp (codex refuses to write helper binaries there). wire_api = "responses" in config.toml (chat is deprecated). Falls back to chat endpoint automatically when responses is unavailable. - hermes: context_length: 65536 required (hermes refuses < 64K). Server must launch with --max-ctx 65536 to match. config.yaml + .env in an isolated HOME dir; provider named "lucebox" with api_mode "chat_completions". - pi: Existing ~/.pi/agent/models.json had lucebox pre-wired but pointing at stale port 8000. Patch to port 18080 + api "openai-completions" (NOT openai-chat — pi doesn't register that name). Pi's --provider lucebox path then works directly with --mode text. - opencode: Old install (0.5.x on Node 22) silently failed at provider load. Fix: nvm use --lts (Node 24) then npm install -g opencode-ai (1.15.5+). Provider lucebox registered in ~/.config/opencode/opencode.json with @ai-sdk/openai-compatible. Provider config is loaded (opencode models lucebox prints the model) but the parallel title+main request pattern stalls the second request due to a server-side daemon stdin serialization limit — orthogonal to wiring. Includes server-log signatures so users can verify each CLI is hitting the right speculator path: [generate] speculator=dflash|mtp lines and [spec-decode] / [mtp_decode] accept rate lines.
Engineering memo on porting sapientinc/HRM-Text-1B as a Luce target. HRM is a dual-timescale recurrent transformer (1B params, two stacks H and L iterated H_cycles × (L_cycles+1) = 8 passes per token, with state injection z_L + z_H). Different forward contract from anything in luce today — needs a new graph builder, 128 effective KV cache slots (16 layers × 8 invocations), embedding scaling, prefix-LM bidirectional mask. Bottom line: don't port now. ~6 engineer-days for AR-only HRM via a new qwen35-style backend; spec decode (DFlash drafter, MTP heads) is multi-week + needs training compute since no aligned drafter exists. Pre-alignment checkpoint, useless for agentic coding without SFT. The interesting cross-pollination is the OPPOSITE direction: use HRM as a SPECULATOR for a larger target. The dual-timescale recurrence may produce drafts that align with target hidden-state evolution better than a single-pass Qwen3-0.6B does. Worth a separate research spike. Model is downloaded to /home/peppi/models/hrm-text-1b (2.3 GB safetensors + custom modeling_hrm_text.py + tokenizer). Memo includes a 12-line reproducer that runs HRM via transformers main branch for evaluation without any luce work.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacks on top of #214 (
feat/mtp-via-daemon). The two new commits in this PR are:8409b01 mtp: native-heads MTP speculator (Qwen3.6 NextN, γ-chain)— identical to the head of feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214; merges via feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214.0170c40 mtp: prefix-cache WARM hit (perfect + partial via range-warm)— the new work in this PR. After feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214 merges, this PR's effective diff collapses to just this commit.What's in the WARM-hit commit
Plugs MTP requests into the existing
server.pyprefix-cache protocol so agent loops with repeated prefixes skip the cold prefill + warm path.common/prefix_snap.h) — single source for[snap] inline slot=N cur_pos=M. DFlashdo_prefilland the MTP orchestrator both funnel throughemit_inline_snap_ackso the format can't drift.snap_at(server'sprepare_inline_snappicks the second-to-last<|im_start|>boundary, almost always mid-prompt). Partialwarm_head_kvfills slots[1..snap_at],snapshot_savefires throughModelBackend, ack emits. Full warm runs after to cover[1..prompt_len).h_{snap_pos-1}viatarget->last_hidden()+ a shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects mismatches before touching state.INativeMtp::warm_head_kv_range— same warm graph aswarm_head_kv, caller-controlledslot_start, used by partial-WARM restore.restore_and_generateunifies perfect-WARM and partial-WARM:prompt_len == snap_pos): restore head_kv, no prefill.prompt_len > snap_pos): restore head_kv, prefill delta, range-warm[snap_pos..prompt_len]usingpre_warm_hidden + delta_hiddens. Slotsnap_posis overwritten with the new request'sprompt[snap_pos]so cross-request first-after-cut divergence cannot corrupt decode.[0..prompt_len)requirement.[0..head_kv_pos+1]slice instead of full[key_len, n_ctx, n_head_kv]. ~10× payload reduction atmax_ctx=65536.n_ctx_maxon the first warm that needs more (saves ~256 MiB VRAM on a daemon serving short prompts).Qwen35Backend::init_mtp_passescfg_.device.max_ctxto the module so head_kv tracks the backbone (previous hardcoded 8192 broke any prompt > 8K).Tests
test_prefix_cache_mtp(perfect bit-equal round-trip, prefill_next mismatch fallback, int32 boundary, shape-contract rejection).test_common_mtp_orchestrator(unchanged from feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214).cmake --buildgreen.--max-ctx 65536(claude_code, codex, opencode, openwebui, pi, hermes, openclaw). Zero chain failures, zero ack-missing warnings, zero capacity overflows.Bench
dflash/bench/bench_mtp_warm_hit_2turn.pyis a 2-turn warm-hit gate that drives a live daemon:prefill_ratio < 0.3+ bit-equal first 16 tokens between turns.Follow-ups (not in this PR)