mtp: prefix-cache WARM hit (perfect + partial via range-warm) by dusterbloom · Pull Request #221 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-18T13:22:28Z

Stacks on top of #214 (feat/mtp-via-daemon). The two new commits in this PR are:

8409b01 mtp: native-heads MTP speculator (Qwen3.6 NextN, γ-chain) — identical to the head of feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214; merges via feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214.
0170c40 mtp: prefix-cache WARM hit (perfect + partial via range-warm) — the new work in this PR. After feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214 merges, this PR's effective diff collapses to just this commit.

What's in the WARM-hit commit

Plugs MTP requests into the existing server.py prefix-cache protocol so agent loops with repeated prefixes skip the cold prefill + warm path.

Inline snap ack (common/prefix_snap.h) — single source for [snap] inline slot=N cur_pos=M. DFlash do_prefill and the MTP orchestrator both funnel through emit_inline_snap_ack so the format can't drift.
Mid-prompt snap in the MTP orchestrator — chunked prefill clips to snap_at (server's prepare_inline_snap picks the second-to-last <|im_start|> boundary, almost always mid-prompt). Partial warm_head_kv fills slots [1..snap_at], snapshot_save fires through ModelBackend, ack emits. Full warm runs after to cover [1..prompt_len).
Snapshot captures h_{snap_pos-1} via target->last_hidden() + a shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects mismatches before touching state.
INativeMtp::warm_head_kv_range — same warm graph as warm_head_kv, caller-controlled slot_start, used by partial-WARM restore.
restore_and_generate unifies perfect-WARM and partial-WARM:
- Perfect-WARM (prompt_len == snap_pos): restore head_kv, no prefill.
- Partial-WARM (prompt_len > snap_pos): restore head_kv, prefill delta, range-warm [snap_pos..prompt_len] using pre_warm_hidden + delta_hiddens. Slot snap_pos is overwritten with the new request's prompt[snap_pos] so cross-request first-after-cut divergence cannot corrupt decode.
- Contract failures fall through to a cold-restart fallback that discards the snapshot's head_kv and runs full cold prefill — byte-correct against the chain runner's [0..prompt_len) requirement.
Thin head_kv snapshot — [0..head_kv_pos+1] slice instead of full [key_len, n_ctx, n_head_kv]. ~10× payload reduction at max_ctx=65536.
Lazy GPU head_kv alloc — grows from 8K initial slots to n_ctx_max on the first warm that needs more (saves ~256 MiB VRAM on a daemon serving short prompts).
Qwen35Backend::init_mtp_ passes cfg_.device.max_ctx to the module so head_kv tracks the backbone (previous hardcoded 8192 broke any prompt > 8K).

Tests

4/4 test_prefix_cache_mtp (perfect bit-equal round-trip, prefill_next mismatch fallback, int32 boundary, shape-contract rejection).
4/4 test_common_mtp_orchestrator (unchanged from feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214).
Full cmake --build green.
Harness probe 7/7 at --max-ctx 65536 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw). Zero chain failures, zero ack-missing warnings, zero capacity overflows.

Bench

dflash/bench/bench_mtp_warm_hit_2turn.py is a 2-turn warm-hit gate that drives a live daemon: prefill_ratio < 0.3 + bit-equal first 16 tokens between turns.

Follow-ups (not in this PR)

Tree-MTP arena (B≥2 sibling drafts) is being prototyped on a separate branch.

cubic-dev-ai

4 issues found across 34 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN block at layer n_layer-1). Speculation runs through a new common chain runner; the existing DFlashTarget adapter handles verify/snapshot/restore. - common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp / IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via IExternalDrafterMtp without touching the chain runner. - common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop, hoisted out of the backend. Three KV-reconciliation paths (accept-all / fast rollback / recommit) share a single post-iter invariant so AR equivalence holds under recommit. - common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch to chain runner. Owns only control flow; all compute lives in DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the backend device. - qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound at attach time as the single source of truth. - qwen35: supports_mtp()/mtp() exposed through ModelBackend; generate() delegates to common::mtp::warm_and_decode when MTP is configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens. - server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon command surface unchanged. Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe 7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at --max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short agentic prompts.

Plugs MTP requests into the existing server.py prefix-cache protocol so agent loops with repeated prefixes skip the cold prefill + warm path. - common/prefix_snap.h: single source for the inline snap ack format ("[snap] inline slot=N cur_pos=M") that server.py's bus.await_reply matches. DFlash do_prefill and the MTP orchestrator both funnel through emit_inline_snap_ack so the format can't drift. - mtp_orchestrator: chunked prefill clips to snap_at when the server requests a mid-prompt cut (prepare_inline_snap picks the second-to-last <|im_start|> boundary). Partial warm_head_kv fills slots [1..snap_at], snapshot_save fires through ModelBackend, ack emits. Full warm runs after to cover [1..prompt_len). End-of-prompt snap supported but rarely picked by the server. - snapshot_save captures h_{snap_pos-1} via target->last_hidden() and a shape contract (γ / n_head_kv / n_ctx / n_embd). Restore rejects mismatches before touching state. - INativeMtp gains warm_head_kv_range(prompt, n_prompt, start_slot, n_chunk, prefill_next, hiddens) — same warm graph as warm_head_kv, with caller-controlled slot_start so partial-WARM restore can fill [snap_pos..prompt_len] using pre_warm_hidden + delta_hiddens. - restore_and_generate unifies perfect-WARM and partial-WARM under a single tok/shape/hidden contract. Perfect-WARM (prompt_len == snap_pos): restore head_kv, no prefill. Partial-WARM (prompt_len > snap_pos): restore head_kv, prefill delta (kv_offset=snap_pos), range-warm [snap_pos..prompt_len]. Slot snap_pos is overwritten with the new prompt[snap_pos] so cross-request first-after-cut divergence cannot corrupt decode. Contract failures fall through to a cold-restart fallback that discards the snapshot's head_kv and runs full cold prefill — byte-correct against the chain runner's [0..prompt_len) requirement. - qwen36_mtp: thin head_kv snapshot ([0..head_kv_pos+1] slice instead of full [key_len, n_ctx, n_head_kv]); ~10× payload reduction at max_ctx=65536. Lazy GPU alloc grows head_kv tensors from 8K initial to n_ctx_max on first warm that needs more (saves ~256 MiB VRAM on a daemon serving short prompts). F16 dtype guard on snapshot+restore. - qwen35_backend: head_kv_warm_ flag gates snapshot_save's MTP capture so a pre-warm snapshot can't round-trip as valid. init_mtp_ passes cfg_.device.max_ctx to the module so head_kv tracks the backbone (previous hardcoded 8192 broke any prompt > 8K). Tests: 4/4 test_prefix_cache_mtp (perfect bit-equal round-trip, prefill mismatch fallback, int32 boundary, shape-contract rejection) + 4/4 test_common_mtp_orchestrator. Full build green. Harness probe 7/7 at --max-ctx 65536; zero chain failures / ack-missing / capacity overflows. R3 integration bench (dflash/bench/bench_mtp_warm_hit_2turn.py) drives a 2-turn warm-hit gate against a live daemon: prefill_ratio < 0.3 + bit-equal first 16 tokens between turns.

`n_layer = block_count - nextn_predict_layers` is correct for backbone-graph iteration and the divisibility check, but `plan.layer_end` was also defaulting to this reduced value — silently filtering blk.{n_layer}.* out of the GPU load. The MTP loader's `find_tensor(meta_ctx, ...)` then resolved the descriptor with `data==nullptr` and failed with "14 required NextN tensor(s) missing". Fix: default `plan.layer_end` to `n_block_raw` so MTP head blocks are loaded alongside backbone. Validation upper-bound widened to match. No-op for non-MTP GGUFs where `nextn_predict_layers=0` (n_block_raw == n_layer).

`n_last_chunk = committed % PREFILL_UBATCH` only equals the last prefill chunk's actual size when prefill started at kv_offset=0. With prefix-cache partial restore, `restore_and_generate` runs delta-prefill from kv_offset>0, so the last chunk's `n_tokens` is `prompt_len - kv_offset`, not the modulo of `committed` over PREFILL_UBATCH. The read offset was then larger than sg_.argmax_tokens->ne[0], firing the "tensor read out of bounds" assert on the first DFlash spec-decode request against any prompt the cache had already seen. Read the actual last-chunk size from sg_.argmax_tokens->ne[0], which the graph builder sized to match the bound chunk. No-op when kv_offset==0 (`committed % UBATCH == ne[0]`).

When --prefill-skip-park is set, compress_text_via_daemon correctly skips its own `park target` / `park draft` sends, but the C++ handle_compress parses an independent `nopark` trailing token and parks target+draft itself when it's absent. The two paths were out of sync: Python honored skip_park but the daemon ignored it. This is a no-op for the PFlash-only path (unpark target rebinds nothing that has stale references). But the MTP path holds tensor pointers into the backbone's ggml_context across requests — the internal park frees those tensors, and the immediate unpark rebuilds the context with new addresses, which makes the MTP graph crash with GGML_ASSERT(ggml_can_repeat(b, a)) on the next forward. Pass " nopark" through to the C++ command when skip_park is true, so both layers agree.

Extends yesterday's 2026-05-17_f031f08 matrix with the workload classes that matrix didn't cover: • Agent suite (2k/8k/24k buckets) across 4 configs (MTP+DFlash × q8/tq3 KV) plus the stacked PFlash+MTP+TQ3 path. MTP wins agent (53.98) by 4-14% over DFlash because the small drafter's accept collapses from 70-90% on code/math to 28-29% on chat/tool-use prompts; MTP's accept stays at 0.69. • PFlash + MTP + TQ3 stack verified end-to-end for the first time: 36K NIAH passes in 22.8s wall (20.3× compression, decode 52 tok/s, needle recall correct) on a single 3090. All 7 OpenAI-compatible clients (claude_code, codex, hermes, openclaw, openwebui, opencode, pi) pass every probe against the stacked server. • he/gsm/math reproduced under today's branch — DFlash AL is unchanged from f031f08 (speculator is healthy); absolute tok/s are lower because today's bench harness wraps the full streaming HTTP response while f031f08's parsed the daemon's internal tok/s timer. Three branch fixes that made the matrix runnable: - 230c303 MTP loader: include NextN block in GPU load by default - 5e7594c do_spec_decode: use sg_.argmax_tokens->ne[0] for last-chunk size - af05a23 prefill_hook: propagate skip_park to daemon compress (MTP+PFlash) Raw per-suite JSONs included alongside summary.md.

…fter-arch protocol Option A from thoughts/2026-05-19/dual-speculator-architecture.md. Backend keeps BOTH speculators resident when --draft and --mtp-gguf are both provided. GenerateRequest grows a `speculator` field (string): - "dflash" forces the DFlash drafter + DDTree spec-decode path - "mtp" forces the MTP γ-chain native-heads path - "auto" (default) selects based on prompt size Rationale for the auto threshold (4096 tokens), pulled from today's bench: - DFlash beats MTP 1.75–2.6× on code/math under ~16K ctx (HE 173 vs 64, Math 115 vs 61, GSM 102 vs 59 tok/s on bench_matrix) - MTP beats DFlash on agent prompts (DFlash drafter accept rate 0.29 vs MTP 0.69) - MTP beats DFlash 2.4–5.7× on PFlash-compressed long context (DFlash drafter accept collapses to 0.14–0.21 on gapped compressed sequences; MTP heads consume backbone hidden states and are unaffected) VRAM on a 24 GB 3090 with dual-load: ~19.9 GB (15.3 target + 1 DFlash drafter + 0.5 MTP heads + 2 KV TQ3 + 1 activations). Fits. Boot banner now prints `[speculators] dual-mode: DFlash+MTP both loaded (auto_select threshold=4096 tokens)` when both are configured. Same commit also extends the PFlash compress protocol to carry an optional drafter_arch token so the server can route `qwen3-0.6b` vs `qwen35-0.8b` per request. New CLI flag --prefill-drafter-arch on the Python side; daemon's handle_compress parses an optional 4th positional token between drafter_gguf and the trailing "nopark" marker. Backward compatible: omitting the token defaults to qwen3-0.6b, which matches the previous hard-coded behavior. Smoke tested: - speculator=dflash → DFlash path confirmed via [spec-decode] log line - speculator=mtp → MTP path via [mtp_decode] log line - speculator=auto on 16-token prompt → routed to DFlash (< 4096) - speculator=auto on 5415-token prompt → routed to MTP (> 4096) - All 9 existing MTP test binaries still pass

bench_matrix.py + the matrix/ subpackage were added in dflash commit f031f08 (bench: matrix orchestrator + power sweep + DFlash optimality audit) but landed on a feature branch that never made it to main, so they're absent from our HEAD. Restored verbatim from f031f08 (with the HumanEval/GSM8K/Math500 workloads from follow-up commit 59cd0fa) so today's apples-to-apples re-runs against yesterday's matrix have an in-tree harness. This is the orchestrator that drives test_dflash directly (positional args, no server), parses the daemon's own decode-only tok/s figure ([dflash] generated N tokens in T s -> X tok/s), and emits per-cell JSON artifacts with bootstrap CI 95% (1000 resamples, seed=42). Subsumes the older scattered bench scripts (bench_agent.py, bench_agent_mtp.py, bench_llm.py) for cross-comparison. Layout: bench_matrix.py — entry point render_matrix.py — markdown summary writer matrix/workload.py — Workload base matrix/speculator.py — Speculator base matrix/speculators/{ar,dflash,mtp}.py matrix/workloads/{humaneval,gsm8k,math500,swe_bench}.py Usage: python3 dflash/scripts/bench_matrix.py \ --workloads humaneval,gsm8k,math500 \ --speculators ar,dflash_b22,mtp_d3 \ --n-gen 256 --n-runs 8 --n-sample 8

…ary update Three new run directories plus a yesterday-vs-today comparison snapshot: 2026-05-17T17-40-56_f031f08/ — preserved yesterday's reference matrix (n_sample=8, n_runs=8, bootstrap CI 95%). DFlash b22: HE 169.40, GSM 104.32, Math 119.36 tok/s. MTP d3: 65.62 / 61.00 / 61.89. AR baseline ~34 tok/s across suites. Was untracked in tree because the bench_matrix orchestrator landed on a stale branch. 2026-05-19T11-43-13_83e19d9/ — first matrix re-run on HEAD (HE only, MTP_GGUF env unset → mtp_d3 cell empty). DFlash b22: 173.81 tok/s. Confirms no DFlash kernel regression vs f031f08. 2026-05-19T11-54-32_83e19d9/ — full apples-to-apples re-run on HEAD with MTP_GGUF set. Result: all 9 cells (3 suites × {AR, DFlash b22, MTP d3}) within ±5% of f031f08 mean tok/s. DFlash HE +2.6%, MTP HE −2.0%, etc. No regression. 2026-05-19_mtp-prefix-warm-ghost/summary.md — updated with the apples-to-apples table above, an agent bucket-label-vs-actual-token audit (agent_24k prompts are actually ~2.6K), a known-gaps section documenting what is NOT yet tested (real CLI agentic loops, NIAH > 131K, concurrent sessions, sustained throughput, PR Luce-Org#195 merge).

…resume plan Three working docs that capture what we learned shipping Option A: dual-speculator-architecture.md — full pipeline diagrams (PFlash → target prefill → auto_select → DFlash/MTP → verify → stream), component cost matrix on RTX 3090, workload-to-config cheat sheet, synergy/conflict grid (the one real conflict is DFlash drafter accept collapsing on PFlash-compressed prompts), recommended 3090 default for agentic coding via hermes/opencode/pi. pflash-drafter-unification-plan.md — experiment plan for "can we use a smaller / different / shared PFlash drafter." Today's E1 swap to Qwen3.5-0.8B blocked on two loader gaps (DFlash-draft variant strips lm_head; full base GGUF has tied embeddings the target loader rejects). Documents the structural finding that a DFlash draft and a PFlash scorer can't be the same file by construction. dual-monster-resume-plan.md — milestone-based roadmap (M1..M8) organising what's committed, what's uncommitted, what's tested, and what to ship next. M1 (this commit set) → M2 real CLI sessions → M3 PR → M4 mid-stream Option B → ...

cubic-dev-ai

1 issue found across 48 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/scripts/bench_matrix.py">

<violation number="1" location="dflash/scripts/bench_matrix.py:446">
P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-05-19T16:30:37Z

+                ar_cache=ar_cache.get(wname),
+                tmpdir=tmpdir,
+            )
+            ar_res = ar_cache.get(wname) if not args.no_ar_cache else None


P2: --no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/bench_matrix.py, line 446: <comment>--no-ar-cache flag prevents speedup computation for all speculators despite help text promising per-pair AR re-runs</comment> <file context> @@ -0,0 +1,468 @@ + ar_cache=ar_cache.get(wname), + tmpdir=tmpdir, + ) + ar_res = ar_cache.get(wname) if not args.no_ar_cache else None + ap = _write_artifact( + run_dir, wl, spec, meta, </file context>

Reproducible per-CLI setup for driving the dual-speculator server (feat/mtp-prefix-warm-ghost) from real client binaries instead of synthetic POSTs. Tested 2026-05-19 against the server running on http://127.0.0.1:18080 with PFlash + MTP γ=3 + DFlash b=22 + Q8 KV + prefix-cache. Per-CLI working configurations: - claude: --bare is mandatory; without it OAuth keychain wins and ANTHROPIC_BASE_URL is ignored. Set ANTHROPIC_AUTH_TOKEN, ANTHROPIC_MODEL, DISABLE_AUTOUPDATER, DISABLE_TELEMETRY. Real 3-turn session confirmed auto-route DFlash → MTP crossing the 4096-token threshold mid-conversation; 1815-token long response streamed at 49.3 tok/s via MTP. - codex: CODEX_HOME must be outside /tmp (codex refuses to write helper binaries there). wire_api = "responses" in config.toml (chat is deprecated). Falls back to chat endpoint automatically when responses is unavailable. - hermes: context_length: 65536 required (hermes refuses < 64K). Server must launch with --max-ctx 65536 to match. config.yaml + .env in an isolated HOME dir; provider named "lucebox" with api_mode "chat_completions". - pi: Existing ~/.pi/agent/models.json had lucebox pre-wired but pointing at stale port 8000. Patch to port 18080 + api "openai-completions" (NOT openai-chat — pi doesn't register that name). Pi's --provider lucebox path then works directly with --mode text. - opencode: Old install (0.5.x on Node 22) silently failed at provider load. Fix: nvm use --lts (Node 24) then npm install -g opencode-ai (1.15.5+). Provider lucebox registered in ~/.config/opencode/opencode.json with @ai-sdk/openai-compatible. Provider config is loaded (opencode models lucebox prints the model) but the parallel title+main request pattern stalls the second request due to a server-side daemon stdin serialization limit — orthogonal to wiring. Includes server-log signatures so users can verify each CLI is hitting the right speculator path: [generate] speculator=dflash|mtp lines and [spec-decode] / [mtp_decode] accept rate lines.

Engineering memo on porting sapientinc/HRM-Text-1B as a Luce target. HRM is a dual-timescale recurrent transformer (1B params, two stacks H and L iterated H_cycles × (L_cycles+1) = 8 passes per token, with state injection z_L + z_H). Different forward contract from anything in luce today — needs a new graph builder, 128 effective KV cache slots (16 layers × 8 invocations), embedding scaling, prefix-LM bidirectional mask. Bottom line: don't port now. ~6 engineer-days for AR-only HRM via a new qwen35-style backend; spec decode (DFlash drafter, MTP heads) is multi-week + needs training compute since no aligned drafter exists. Pre-alignment checkpoint, useless for agentic coding without SFT. The interesting cross-pollination is the OPPOSITE direction: use HRM as a SPECULATOR for a larger target. The dual-timescale recurrence may produce drafts that align with target hidden-state evolution better than a single-pass Qwen3-0.6B does. Worth a separate research spike. Model is downloaded to /home/peppi/models/hrm-text-1b (2.3 GB safetensors + custom modeling_hrm_text.py + tokenizer). Memo includes a 12-line reproducer that runs HRM via transformers main branch for evaluation without any luce work.

cubic-dev-ai Bot reviewed May 18, 2026

View reviewed changes

Comment thread dflash/src/common/gguf_mmap.h

Comment thread dflash/src/common/gguf_mmap.h

Comment thread dflash/bench/bench_mtp_warm_hit_2turn.py

Comment thread dflash/scripts/server.py

dusterbloom force-pushed the feat/mtp-prefix-warm-ghost branch from 0170c40 to 0d4a531 Compare May 18, 2026 16:05

dusterbloom added 4 commits May 19, 2026 09:34

dusterbloom force-pushed the feat/mtp-prefix-warm-ghost branch from 0d4a531 to 5e7594c Compare May 19, 2026 08:47

dusterbloom added 6 commits May 19, 2026 13:31

cubic-dev-ai Bot reviewed May 19, 2026

View reviewed changes

dusterbloom added 2 commits May 19, 2026 22:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221

mtp: prefix-cache WARM hit (perfect + partial via range-warm)#221
dusterbloom wants to merge 12 commits into
Luce-Org:mainfrom
dusterbloom:feat/mtp-prefix-warm-ghost

dusterbloom commented May 18, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented May 18, 2026

What's in the WARM-hit commit

Tests

Bench

Follow-ups (not in this PR)

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading