Skip to content

feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure)#214

Closed
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:feat/mtp-via-daemon
Closed

feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure)#214
dusterbloom wants to merge 1 commit into
Luce-Org:mainfrom
dusterbloom:feat/mtp-via-daemon

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

@dusterbloom dusterbloom commented May 17, 2026

Scope: full MTP stack + daemon wiring + PR #213 daemon fixes

This PR brings TWO things to main that didn't exist there before:

  1. The Qwen3.6 MTP module (dflash/src/qwen36/qwen36_mtp.{cpp,h} and qwen36_mtp_graph.{cpp,h}, ~2.8k LOC). This was developed on feature/mtp-foundation-v2 over the past weeks but never landed on main.
  2. MTP-via-daemon wiring (~500 LOC) — what was originally scoped here: server.py --mtp-gguf, test_dflash dispatch, Qwen35Backend MTP integration.
  3. PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 daemon fixes included so this PR is self-contained against main (those fixes are also up as PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 for narrower review).

Total: +3324 / -60 LOC. Above the 3k cap. Recommend reviewers process this in two passes:

If preferred, this PR can be split into:

Say the word and I'll re-split.

Headline measurements

Claude Code on 24K-token system prompt:

Speculator decode accept comment
DFlash (DDTree b=22) 22.0 tok/s n/a PR #213 baseline
MTP (γ=3, daemon) 35.8 tok/s 0.41 +63% vs DFlash

Mean across 7 harness clients (3 back-to-back probes, 21 client-cells):

Speculator mean decode median max out=0 cells
DFlash 22.0 tok/s 0 / 57
MTP 29.3 tok/s 29.6 50.6 0 / 57

vs Lucebox blog (RTX 3090, Qwen3.6 Q4_K_M, DDTree, published 22.6-29.6 tok/s range):

  • Our MTP mean 29.3, max 50.6 → at or above the top of the published range

Architecture summary

Same daemon binary, same bare-prompt protocol. MTP is a startup mode (--mtp-gguf flag). Backend extends Qwen35Backend with optional Qwen36MtpModule (no new backend class — keeps target loader/snapshot/park/cache-reset shared). Per-request mtp_module_->reset_chain() mirrors PR #213's reset_target_cache(cache_) so MTP also avoids the state-leak cascade under sustained load.

Validated

Known follow-ups (separate PR after this lands)

  1. Pre-norm hidden capture path warning ("hidden_at_pos_pre_norm returned null"). Currently degrades accept-rate but functional. Fix would push accept from ~0.3 → model-card-recommended 0.83 for γ≤2, lifting MTP decode mean toward 50+ tok/s.
  2. Prefix-cache RESTORE in MTP mode — disabled because MTP head KV isn't snapshotted. Adding to PrefixSnapshot would unlock TTFT savings (DFlash already gets 22s → 6s warm).
  3. mtp_topk draft-source path plumbed but not implemented in do_mtp_decode_ — chain path is production default per matrix bench D=3.

Test plan

  • All tests pass
  • Harness probes green on MTP + DFlash (no regression)
  • Matrix bench no regression
  • Real client end-to-end (Claude Code generates valid code)
  • Beat blog decode (35.8 tok/s Claude Code vs 22.6-29.6 published)
  • (review) Confirm 2.8k LOC MTP module review acceptable in one pass, or request split
  • (next PR) Pre-norm fix → accept_rate → 0.83+
  • (next PR) MTP head KV snapshot → prefix-cache works in MTP mode

@dusterbloom dusterbloom force-pushed the feat/mtp-via-daemon branch from 6e53b68 to 129dbe7 Compare May 17, 2026 23:11
@dusterbloom dusterbloom changed the title feat(mtp): MTP-via-daemon — beat blog decode by +33% via real agents feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure, ~3.3k LOC) May 17, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

@davide221
Copy link
Copy Markdown
Contributor

@dusterbloom thanks for the great feature! Can you fix merge conflics?

Copy link
Copy Markdown
Contributor

@howard0su howard0su left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is not ready for checkin:

  1. Please leverage ggml for CPU code (or make it generic to support any ggml backend)
  2. MTP should be simple as additional weights of modelbackend
  3. If a model contains MTP support (no matter gemma4 or qwen3.5), the logic can handle it. In other word, the logic should be in /common which can potentially leverage by any modelbackend if they support mtp.
  4. You may want to introduce some additional interface inside modelbackend class.

Comment thread dflash/src/qwen36/qwen36_mtp.cpp
}

// Argmax over a float vector; returns index of max element.
static int32_t argmax(const float * logits, int n) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is dup code of ggml, which has cpu version of these.

@@ -0,0 +1,2100 @@
// qwen36_mtp.cpp — see qwen36_mtp.h for contract.
//
// PR 2d-bis (Shape B): implements the full DeepSeek-V3 NextN per-head forward
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please leverage ggml ops to CPU side of calculation. the reason is not just reduce duplicate code but also enable multi-gpu scenerio that we can offload path A to GPU-B

std::unique_ptr<DFlashTarget> dflash_target_;

// ── MTP speculator (optional, set when cfg_.mtp_gguf_path != nullptr) ──
std::unique_ptr<mtp::Qwen36MtpModule> mtp_module_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the concept of Module here? I prefer that MTP is another set of weights. The current MTP code should be generic enough to handle gemma4 as well.

@howard0su
Copy link
Copy Markdown
Contributor

also, after the change of RoPE, please retest 24k context baseline (it was the bug).

@dusterbloom dusterbloom force-pushed the feat/mtp-via-daemon branch from 129dbe7 to 274d41e Compare May 18, 2026 07:32
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 18, 2026
…and_decode (step 3.1)

Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208,
Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review
request: move MTP orchestration into dflash/src/common/ behind a generic
entry point any ModelBackend can call.

## What landed

### Foundation (rebase port, ~5k LOC)

- `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads
  MTP module (Qwen36MtpModule, implements INativeMtp)
- `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph
- `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF
- `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins
- `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner
- `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h`
  + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure
- `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional
  Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_,
  do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3)
- `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields
- `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp`
  — hidden-sequence capture path for MTP head warming
- `dflash/test/test_dflash.cpp` — daemon dispatch routes
  `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved)
- `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source`
  CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's
  thinking-default fixes (commit 998b280) without conflict

### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green)

Howard's review:
> "MTP should be simple as additional weights of modelbackend. If a model
>  contains MTP support (gemma4 or qwen3.5), the logic can handle it. In
>  other words, the logic should be in /common which can potentially
>  leverage by any modelbackend if they support mtp."

Carved out the public surface for the future orchestrator:

  GenerateResult dflash27b::common::mtp::warm_and_decode(
      ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io);

New files:
- `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature,
  cpp is a minimal stub that only handles guard cases (null backend, no
  MTP support, empty prompt). Real warm + decode body lands in step 3.2,
  driven by additional red→green tests.
- `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests
  written and watched fail BEFORE the stub existed (compile-time RED:
  "common/mtp_orchestrator.h: No such file or directory"), then GREEN
  after the stub returned matching error strings.

Test results:
  T1 null_backend PASS
  T2 backend_without_mtp PASS
  T3 empty_prompt PASS
  ALL PASS

## Steps 3.2-3.5 (separate commits, this PR)

3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch
    + hidden capture + MtpChainRunner.run); red test = identical token IDs
    vs reference run_qwen36_mtp_harness on a fixed prompt.
3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to
    common::mtp::warm_and_decode; delete the qwen35-local helpers.
3.4 stub Gemma4Backend MTP override using the same common entry point to
    prove the interface is generic (not Qwen35-specific).
3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled
    CPU loops; replace with ggml primitives per howard's point #1.

Then retest 24K baseline post-RoPE-fix (howard's other comment) and update
PR description with current numbers.

Addresses:
- davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased
- howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="dflash/src/qwen35/qwen35_backend.cpp:361">
P2: MTP generate ignores the bool result of `migrate_prefill_cache`, so a cache promotion/allocation failure can fall through into MTP decode with incomplete state.</violation>
</file>

<file name="dflash/src/common/mtp_orchestrator.cpp">

<violation number="1" location="dflash/src/common/mtp_orchestrator.cpp:77">
P1: Clearing `all_prefill_hidden` as a sentinel makes later `memcpy` writes undefined behavior if a subsequent chunk still returns hidden rows.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/mtp_orchestrator.cpp Outdated
Comment thread dflash/src/qwen35/qwen35_backend.cpp Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 18, 2026
…ate to init

P1: capture-invariant violation now fails loud instead of clearing
all_prefill_hidden mid-loop, which let the next chunk's memcpy
write past freed memory (heap UB).

P2: migrate_prefill_cache moved out of generate() into init_mtp_();
max_ctx and gamma are config-time constants, so checking the bool
return where backend init can fail cleanly removes the
OOM-on-first-request → null ssm_intermediate → segfault path.

PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).
@dusterbloom
Copy link
Copy Markdown
Contributor Author

@cubic-dev-ai addressed both review comments in 186bccc:

P1 (common/mtp_orchestrator.cpp:77 heap UB) — the all_prefill_hidden.clear() branch is gone. Capture is enabled+pinned at MTP attach, so the only way verify_batch returns a short chunk is a contract break; we now error out via result.error = "hidden seq capture invariant violated" and bail before the next chunk's memcpy can write past freed memory.

P2 (qwen35_backend.cpp:361 ignored return) — hoisted the migrate_prefill_cache call out of generate() into init_mtp_(). max_ctx and γ are config-time constants, so a single call at attach is correct by construction; the bool return is now checked and a failure tears the module down and fails backend init cleanly (no per-request OOM path → no null ssm_intermediate → no segfault).

γ=3 re-bench unchanged: 49.7 tok/s decode, accept 0.78, zero pre_norm warnings, 7/7 harness clients green. γ=2 also verified: 49.5 tok/s, accept 0.81, 7/7 green.

@dusterbloom
Copy link
Copy Markdown
Contributor Author

@howard0su 24K Heron retest with γ=3 on the refactored stack (MTP orchestrator in common/, abstract ModelBackend, capture-mode pinned at attach):

  • decode 49.7 tok/s
  • accept rate 0.78 (model card ceiling 0.83)
  • zero pre_norm null warnings (correct by construction — capture mode is pinned at attach, runtime toggle is a no-op)
  • 7/7 harness clients green (claude_code, codex, opencode, openwebui, pi, openclaw, hermes)

γ=2 also tested: 49.5 tok/s, accept 0.81 (within noise of γ=3 on throughput, slightly higher accept). Keeping γ=3 default — tie is in noise, no reason to churn config.

@dusterbloom dusterbloom force-pushed the feat/mtp-via-daemon branch from 186bccc to 2b92878 Compare May 18, 2026 10:28
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 18, 2026
…and_decode (step 3.1)

Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208,
request: move MTP orchestration into dflash/src/common/ behind a generic
entry point any ModelBackend can call.

- `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads
  MTP module (Qwen36MtpModule, implements INativeMtp)
- `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph
- `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF
- `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins
- `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner
- `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h`
  + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure
- `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional
  Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_,
  do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3)
- `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields
- `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp`
  — hidden-sequence capture path for MTP head warming
- `dflash/test/test_dflash.cpp` — daemon dispatch routes
  `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved)
- `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source`
  CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's
  thinking-default fixes (commit 998b280) without conflict

Howard's review:
> "MTP should be simple as additional weights of modelbackend. If a model
>  contains MTP support (gemma4 or qwen3.5), the logic can handle it. In
>  other words, the logic should be in /common which can potentially
>  leverage by any modelbackend if they support mtp."

Carved out the public surface for the future orchestrator:

  GenerateResult dflash27b::common::mtp::warm_and_decode(
      ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io);

New files:
- `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature,
  cpp is a minimal stub that only handles guard cases (null backend, no
  MTP support, empty prompt). Real warm + decode body lands in step 3.2,
  driven by additional red→green tests.
- `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests
  written and watched fail BEFORE the stub existed (compile-time RED:
  "common/mtp_orchestrator.h: No such file or directory"), then GREEN
  after the stub returned matching error strings.

Test results:
  T1 null_backend PASS
  T2 backend_without_mtp PASS
  T3 empty_prompt PASS
  ALL PASS

3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch
    + hidden capture + MtpChainRunner.run); red test = identical token IDs
    vs reference run_qwen36_mtp_harness on a fixed prompt.
3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to
    common::mtp::warm_and_decode; delete the qwen35-local helpers.
3.4 stub Gemma4Backend MTP override using the same common entry point to
    prove the interface is generic (not Qwen35-specific).
3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled
    CPU loops; replace with ggml primitives per howard's point #1.

Then retest 24K baseline post-RoPE-fix (howard's other comment) and update
PR description with current numbers.

Addresses:
- davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased
- howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 18, 2026
…ate to init

P1: capture-invariant violation now fails loud instead of clearing
all_prefill_hidden mid-loop, which let the next chunk's memcpy
write past freed memory (heap UB).

P2: migrate_prefill_cache moved out of generate() into init_mtp_();
max_ctx and gamma are config-time constants, so checking the bool
return where backend init can fail cleanly removes the
OOM-on-first-request → null ssm_intermediate → segfault path.

PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).
@dusterbloom dusterbloom changed the title feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure, ~3.3k LOC) feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) May 18, 2026
@dusterbloom dusterbloom force-pushed the feat/mtp-via-daemon branch from 2b92878 to 8409b01 Compare May 18, 2026 13:21
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.

- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
  IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
  IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
  hoisted out of the backend. Three KV-reconciliation paths
  (accept-all / fast rollback / recommit) share a single post-iter
  invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
  to chain runner. Owns only control flow; all compute lives in
  DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
  backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
  inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
  cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
  at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
  generate() delegates to common::mtp::warm_and_decode when MTP is
  configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
  command surface unchanged.

Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
@dusterbloom dusterbloom force-pushed the feat/mtp-via-daemon branch from 8409b01 to e9cd58f Compare May 18, 2026 15:51
@dusterbloom
Copy link
Copy Markdown
Contributor Author

superseded by #237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants