feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure)#214
feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure)#214dusterbloom wants to merge 1 commit into
Conversation
6e53b68 to
129dbe7
Compare
|
@dusterbloom thanks for the great feature! Can you fix merge conflics? |
howard0su
left a comment
There was a problem hiding this comment.
The PR is not ready for checkin:
- Please leverage ggml for CPU code (or make it generic to support any ggml backend)
- MTP should be simple as additional weights of modelbackend
- If a model contains MTP support (no matter gemma4 or qwen3.5), the logic can handle it. In other word, the logic should be in /common which can potentially leverage by any modelbackend if they support mtp.
- You may want to introduce some additional interface inside modelbackend class.
| } | ||
|
|
||
| // Argmax over a float vector; returns index of max element. | ||
| static int32_t argmax(const float * logits, int n) { |
There was a problem hiding this comment.
this is dup code of ggml, which has cpu version of these.
| @@ -0,0 +1,2100 @@ | |||
| // qwen36_mtp.cpp — see qwen36_mtp.h for contract. | |||
| // | |||
| // PR 2d-bis (Shape B): implements the full DeepSeek-V3 NextN per-head forward | |||
There was a problem hiding this comment.
please leverage ggml ops to CPU side of calculation. the reason is not just reduce duplicate code but also enable multi-gpu scenerio that we can offload path A to GPU-B
| std::unique_ptr<DFlashTarget> dflash_target_; | ||
|
|
||
| // ── MTP speculator (optional, set when cfg_.mtp_gguf_path != nullptr) ── | ||
| std::unique_ptr<mtp::Qwen36MtpModule> mtp_module_; |
There was a problem hiding this comment.
what's the concept of Module here? I prefer that MTP is another set of weights. The current MTP code should be generic enough to handle gemma4 as well.
|
also, after the change of RoPE, please retest 24k context baseline (it was the bug). |
129dbe7 to
274d41e
Compare
…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. ## What landed ### Foundation (rebase port, ~5k LOC) - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict ### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green) Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS ## Steps 3.2-3.5 (separate commits, this PR) 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
There was a problem hiding this comment.
2 issues found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="dflash/src/qwen35/qwen35_backend.cpp:361">
P2: MTP generate ignores the bool result of `migrate_prefill_cache`, so a cache promotion/allocation failure can fall through into MTP decode with incomplete state.</violation>
</file>
<file name="dflash/src/common/mtp_orchestrator.cpp">
<violation number="1" location="dflash/src/common/mtp_orchestrator.cpp:77">
P1: Clearing `all_prefill_hidden` as a sentinel makes later `memcpy` writes undefined behavior if a subsequent chunk still returns hidden rows.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
…ate to init P1: capture-invariant violation now fails loud instead of clearing all_prefill_hidden mid-loop, which let the next chunk's memcpy write past freed memory (heap UB). P2: migrate_prefill_cache moved out of generate() into init_mtp_(); max_ctx and gamma are config-time constants, so checking the bool return where backend init can fail cleanly removes the OOM-on-first-request → null ssm_intermediate → segfault path. PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).
|
@cubic-dev-ai addressed both review comments in 186bccc: P1 ( P2 ( γ=3 re-bench unchanged: 49.7 tok/s decode, accept 0.78, zero |
|
@howard0su 24K Heron retest with γ=3 on the refactored stack (MTP orchestrator in
γ=2 also tested: 49.5 tok/s, accept 0.81 (within noise of γ=3 on throughput, slightly higher accept). Keeping γ=3 default — tie is in noise, no reason to churn config. |
186bccc to
2b92878
Compare
…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
…ate to init P1: capture-invariant violation now fails loud instead of clearing all_prefill_hidden mid-loop, which let the next chunk's memcpy write past freed memory (heap UB). P2: migrate_prefill_cache moved out of generate() into init_mtp_(); max_ctx and gamma are config-time constants, so checking the bool return where backend init can fail cleanly removes the OOM-on-first-request → null ssm_intermediate → segfault path. PR Luce-Org#214 review-4308296776 (cubic-bot P1+P2).
2b92878 to
8409b01
Compare
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.
- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
hoisted out of the backend. Three KV-reconciliation paths
(accept-all / fast rollback / recommit) share a single post-iter
invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
to chain runner. Owns only control flow; all compute lives in
DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
generate() delegates to common::mtp::warm_and_decode when MTP is
configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
command surface unchanged.
Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
8409b01 to
e9cd58f
Compare
|
superseded by #237 |
Scope: full MTP stack + daemon wiring + PR #213 daemon fixes
This PR brings TWO things to main that didn't exist there before:
dflash/src/qwen36/qwen36_mtp.{cpp,h}andqwen36_mtp_graph.{cpp,h}, ~2.8k LOC). This was developed onfeature/mtp-foundation-v2over the past weeks but never landed onmain.--mtp-gguf, test_dflash dispatch, Qwen35Backend MTP integration.main(those fixes are also up as PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 for narrower review).Total: +3324 / -60 LOC. Above the 3k cap. Recommend reviewers process this in two passes:
If preferred, this PR can be split into:
Say the word and I'll re-split.
Headline measurements
Claude Code on 24K-token system prompt:
Mean across 7 harness clients (3 back-to-back probes, 21 client-cells):
vs Lucebox blog (RTX 3090, Qwen3.6 Q4_K_M, DDTree, published 22.6-29.6 tok/s range):
Architecture summary
Same daemon binary, same bare-prompt protocol. MTP is a startup mode (
--mtp-ggufflag). Backend extendsQwen35Backendwith optionalQwen36MtpModule(no new backend class — keeps target loader/snapshot/park/cache-reset shared). Per-requestmtp_module_->reset_chain()mirrors PR #213'sreset_target_cache(cache_)so MTP also avoids the state-leak cascade under sustained load.Validated
pytest dflash/scripts/test_server.py— 56/56 passharness/client_test_runner.py probe× 3 back-to-back × 7 clients = 21/21 cells green on MTP--mtp-gguf) — 7/7 (no regression from PR fix(qwen35,server): HTTP daemon path generates content + prefix cache works + state reset between requests #213 baseline)bench_agent.py --bucket 2k --n-sample 8 --skip-ar --budget 22— mean 53.37 tok/s vs canonical 46.70 = +14% (no regression on file-mode harness path)claude --print --model luce-dflashreal coding task on MTP — valid Heron's formula outputKnown follow-ups (separate PR after this lands)
PrefixSnapshotwould unlock TTFT savings (DFlash already gets 22s → 6s warm).do_mtp_decode_— chain path is production default per matrix bench D=3.Test plan