feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) by dusterbloom · Pull Request #214 · Luce-Org/lucebox

dusterbloom · 2026-05-17T23:10:38Z

Scope: full MTP stack + daemon wiring + PR #213 daemon fixes

This PR brings TWO things to main that didn't exist there before:

The Qwen3.6 MTP module (dflash/src/qwen36/qwen36_mtp.{cpp,h} and qwen36_mtp_graph.{cpp,h}, ~2.8k LOC). This was developed on feature/mtp-foundation-v2 over the past weeks but never landed on main.
MTP-via-daemon wiring (~500 LOC) — what was originally scoped here: server.py --mtp-gguf, test_dflash dispatch, Qwen35Backend MTP integration.
PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 daemon fixes included so this PR is self-contained against main (those fixes are also up as PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 for narrower review).

Total: +3324 / -60 LOC. Above the 3k cap. Recommend reviewers process this in two passes:

First: read PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213's narrower diff (the 4 daemon fixes) — that's ~140 LOC and already reviewable
Then: this PR diff minus fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 = ~500 LOC of MTP daemon wiring on top of the (existing on feature branch but new to main) ~2.8k LOC of MTP module

If preferred, this PR can be split into:

PR-A: MTP module land (~2.8k LOC, just the infrastructure, no daemon use)
PR-B: MTP daemon wiring (~500 LOC, atop PR-A)
PR-fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213: separate daemon fixes (already up)

Say the word and I'll re-split.

Headline measurements

Claude Code on 24K-token system prompt:

Speculator	decode	accept	comment
DFlash (DDTree b=22)	22.0 tok/s	n/a	PR #213 baseline
MTP (γ=3, daemon)	35.8 tok/s	0.41	+63% vs DFlash

Mean across 7 harness clients (3 back-to-back probes, 21 client-cells):

Speculator	mean decode	median	max	out=0 cells
DFlash	22.0 tok/s	—	—	0 / 57
MTP	29.3 tok/s	29.6	50.6	0 / 57

vs Lucebox blog (RTX 3090, Qwen3.6 Q4_K_M, DDTree, published 22.6-29.6 tok/s range):

Our MTP mean 29.3, max 50.6 → at or above the top of the published range

Architecture summary

Same daemon binary, same bare-prompt protocol. MTP is a startup mode (--mtp-gguf flag). Backend extends Qwen35Backend with optional Qwen36MtpModule (no new backend class — keeps target loader/snapshot/park/cache-reset shared). Per-request mtp_module_->reset_chain() mirrors PR #213's reset_target_cache(cache_) so MTP also avoids the state-leak cascade under sustained load.

Validated

pytest dflash/scripts/test_server.py — 56/56 pass
harness/client_test_runner.py probe × 3 back-to-back × 7 clients = 21/21 cells green on MTP
Same probe set on DFlash (no --mtp-gguf) — 7/7 (no regression from PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 baseline)
bench_agent.py --bucket 2k --n-sample 8 --skip-ar --budget 22 — mean 53.37 tok/s vs canonical 46.70 = +14% (no regression on file-mode harness path)
claude --print --model luce-dflash real coding task on MTP — valid Heron's formula output
DFlash regression smoke — "lucebox lucebox lucebox" identical to PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213

Known follow-ups (separate PR after this lands)

Pre-norm hidden capture path warning ("hidden_at_pos_pre_norm returned null"). Currently degrades accept-rate but functional. Fix would push accept from ~0.3 → model-card-recommended 0.83 for γ≤2, lifting MTP decode mean toward 50+ tok/s.
Prefix-cache RESTORE in MTP mode — disabled because MTP head KV isn't snapshotted. Adding to PrefixSnapshot would unlock TTFT savings (DFlash already gets 22s → 6s warm).
mtp_topk draft-source path plumbed but not implemented in do_mtp_decode_ — chain path is production default per matrix bench D=3.

Test plan

All tests pass
Harness probes green on MTP + DFlash (no regression)
Matrix bench no regression
Real client end-to-end (Claude Code generates valid code)
Beat blog decode (35.8 tok/s Claude Code vs 22.6-29.6 published)
(review) Confirm 2.8k LOC MTP module review acceptable in one pass, or request split
(next PR) Pre-norm fix → accept_rate → 0.83+
(next PR) MTP head KV snapshot → prefix-cache works in MTP mode

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

davide221 · 2026-05-17T23:56:35Z

@dusterbloom thanks for the great feature! Can you fix merge conflics?

howard0su

The PR is not ready for checkin:

Please leverage ggml for CPU code (or make it generic to support any ggml backend)
MTP should be simple as additional weights of modelbackend
If a model contains MTP support (no matter gemma4 or qwen3.5), the logic can handle it. In other word, the logic should be in /common which can potentially leverage by any modelbackend if they support mtp.
You may want to introduce some additional interface inside modelbackend class.

howard0su · 2026-05-18T00:39:36Z

+}
+
+// Argmax over a float vector; returns index of max element.
+static int32_t argmax(const float * logits, int n) {


this is dup code of ggml, which has cpu version of these.

howard0su · 2026-05-18T00:40:55Z

@@ -0,0 +1,2100 @@
+// qwen36_mtp.cpp — see qwen36_mtp.h for contract.
+//
+// PR 2d-bis (Shape B): implements the full DeepSeek-V3 NextN per-head forward


please leverage ggml ops to CPU side of calculation. the reason is not just reduce duplicate code but also enable multi-gpu scenerio that we can offload path A to GPU-B

howard0su · 2026-05-18T00:44:26Z

    std::unique_ptr<DFlashTarget> dflash_target_;

+    // ── MTP speculator (optional, set when cfg_.mtp_gguf_path != nullptr) ──
+    std::unique_ptr<mtp::Qwen36MtpModule> mtp_module_;


what's the concept of Module here? I prefer that MTP is another set of weights. The current MTP code should be generic enough to handle gemma4 as well.

howard0su · 2026-05-18T00:47:54Z

also, after the change of RoPE, please retest 24k context baseline (it was the bug).

…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. ## What landed ### Foundation (rebase port, ~5k LOC) - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict ### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green) Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS ## Steps 3.2-3.5 (separate commits, this PR) 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice

cubic-dev-ai

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="dflash/src/qwen35/qwen35_backend.cpp:361">
P2: MTP generate ignores the bool result of `migrate_prefill_cache`, so a cache promotion/allocation failure can fall through into MTP decode with incomplete state.</violation>
</file>

<file name="dflash/src/common/mtp_orchestrator.cpp">

<violation number="1" location="dflash/src/common/mtp_orchestrator.cpp:77">
P1: Clearing `all_prefill_hidden` as a sentinel makes later `memcpy` writes undefined behavior if a subsequent chunk still returns hidden rows.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…ate to init P1: capture-invariant violation now fails loud instead of clearing all_prefill_hidden mid-loop, which let the next chunk's memcpy write past freed memory (heap UB). P2: migrate_prefill_cache moved out of generate() into init_mtp_(); max_ctx and gamma are config-time constants, so checking the bool return where backend init can fail cleanly removes the OOM-on-first-request → null ssm_intermediate → segfault path. PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).

dusterbloom · 2026-05-18T09:17:11Z

@cubic-dev-ai addressed both review comments in 186bccc:

P1 (common/mtp_orchestrator.cpp:77 heap UB) — the all_prefill_hidden.clear() branch is gone. Capture is enabled+pinned at MTP attach, so the only way verify_batch returns a short chunk is a contract break; we now error out via result.error = "hidden seq capture invariant violated" and bail before the next chunk's memcpy can write past freed memory.

P2 (qwen35_backend.cpp:361 ignored return) — hoisted the migrate_prefill_cache call out of generate() into init_mtp_(). max_ctx and γ are config-time constants, so a single call at attach is correct by construction; the bool return is now checked and a failure tears the module down and fails backend init cleanly (no per-request OOM path → no null ssm_intermediate → no segfault).

γ=3 re-bench unchanged: 49.7 tok/s decode, accept 0.78, zero pre_norm warnings, 7/7 harness clients green. γ=2 also verified: 49.5 tok/s, accept 0.81, 7/7 green.

dusterbloom · 2026-05-18T09:17:20Z

@howard0su 24K Heron retest with γ=3 on the refactored stack (MTP orchestrator in common/, abstract ModelBackend, capture-mode pinned at attach):

decode 49.7 tok/s
accept rate 0.78 (model card ceiling 0.83)
zero pre_norm null warnings (correct by construction — capture mode is pinned at attach, runtime toggle is a no-op)
7/7 harness clients green (claude_code, codex, opencode, openwebui, pi, openclaw, hermes)

γ=2 also tested: 49.5 tok/s, accept 0.81 (within noise of γ=3 on throughput, slightly higher accept). Keeping γ=3 default — tie is in noise, no reason to churn config.

…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice

…ate to init P1: capture-invariant violation now fails loud instead of clearing all_prefill_hidden mid-loop, which let the next chunk's memcpy write past freed memory (heap UB). P2: migrate_prefill_cache moved out of generate() into init_mtp_(); max_ctx and gamma are config-time constants, so checking the bool return where backend init can fail cleanly removes the OOM-on-first-request → null ssm_intermediate → segfault path. PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).

Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN block at layer n_layer-1). Speculation runs through a new common chain runner; the existing DFlashTarget adapter handles verify/snapshot/restore. - common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp / IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via IExternalDrafterMtp without touching the chain runner. - common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop, hoisted out of the backend. Three KV-reconciliation paths (accept-all / fast rollback / recommit) share a single post-iter invariant so AR equivalence holds under recommit. - common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch to chain runner. Owns only control flow; all compute lives in DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the backend device. - qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound at attach time as the single source of truth. - qwen35: supports_mtp()/mtp() exposed through ModelBackend; generate() delegates to common::mtp::warm_and_decode when MTP is configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens. - server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon command surface unchanged. Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe 7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at --max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short agentic prompts.

dusterbloom · 2026-05-20T18:57:09Z

superseded by #237

dusterbloom force-pushed the feat/mtp-via-daemon branch from 6e53b68 to 129dbe7 Compare May 17, 2026 23:11

dusterbloom changed the title ~~feat(mtp): MTP-via-daemon — beat blog decode by +33% via real agents~~ feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure, ~3.3k LOC) May 17, 2026

cubic-dev-ai Bot reviewed May 17, 2026

View reviewed changes

howard0su suggested changes May 18, 2026

View reviewed changes

dusterbloom force-pushed the feat/mtp-via-daemon branch from 129dbe7 to 274d41e Compare May 18, 2026 07:32

cubic-dev-ai Bot reviewed May 18, 2026

View reviewed changes

Comment thread dflash/src/common/mtp_orchestrator.cpp Outdated

Comment thread dflash/src/qwen35/qwen35_backend.cpp Outdated

dusterbloom force-pushed the feat/mtp-via-daemon branch from 186bccc to 2b92878 Compare May 18, 2026 10:28

dusterbloom changed the title ~~feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure, ~3.3k LOC)~~ feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) May 18, 2026

dusterbloom force-pushed the feat/mtp-via-daemon branch from 2b92878 to 8409b01 Compare May 18, 2026 13:21

dusterbloom mentioned this pull request May 18, 2026

mtp: prefix-cache WARM hit (perfect + partial via range-warm) #221

Closed

dusterbloom force-pushed the feat/mtp-via-daemon branch from 8409b01 to e9cd58f Compare May 18, 2026 15:51

dusterbloom closed this May 20, 2026

Uh oh!

Conversation

dusterbloom commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scope: full MTP stack + daemon wiring + PR #213 daemon fixes

Headline measurements

Architecture summary

Validated

Known follow-ups (separate PR after this lands)

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

davide221 commented May 17, 2026

Uh oh!

howard0su left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

howard0su May 18, 2026

Choose a reason for hiding this comment

Uh oh!

howard0su May 18, 2026

Choose a reason for hiding this comment

Uh oh!

howard0su May 18, 2026

Choose a reason for hiding this comment

Uh oh!

howard0su commented May 18, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dusterbloom commented May 18, 2026

Uh oh!

dusterbloom commented May 18, 2026

Uh oh!

dusterbloom commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dusterbloom commented May 17, 2026 •

edited

Loading