You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(mtp): port MTP foundation onto main + extract common::mtp::warm_and_decode (step 3.1)
Rebase of the MTP-via-daemon work onto latest main (PRs #213, #210, #208,
#207 already merged) plus the first slice of howard0su's PR #214 review
request: move MTP orchestration into dflash/src/common/ behind a generic
entry point any ModelBackend can call.
## What landed
### Foundation (rebase port, ~5k LOC)
- `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads
MTP module (Qwen36MtpModule, implements INativeMtp)
- `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph
- `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF
- `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins
- `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner
- `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h`
+ `attn_masks.h` + `dflash_target.h` updates: shared infrastructure
- `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional
Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_,
do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3)
- `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields
- `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp`
— hidden-sequence capture path for MTP head warming
- `dflash/test/test_dflash.cpp` — daemon dispatch routes
`--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved)
- `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source`
CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's
thinking-default fixes (commit 998b280) without conflict
### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green)
Howard's review:
> "MTP should be simple as additional weights of modelbackend. If a model
> contains MTP support (gemma4 or qwen3.5), the logic can handle it. In
> other words, the logic should be in /common which can potentially
> leverage by any modelbackend if they support mtp."
Carved out the public surface for the future orchestrator:
GenerateResult dflash27b::common::mtp::warm_and_decode(
ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io);
New files:
- `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature,
cpp is a minimal stub that only handles guard cases (null backend, no
MTP support, empty prompt). Real warm + decode body lands in step 3.2,
driven by additional red→green tests.
- `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests
written and watched fail BEFORE the stub existed (compile-time RED:
"common/mtp_orchestrator.h: No such file or directory"), then GREEN
after the stub returned matching error strings.
Test results:
T1 null_backend PASS
T2 backend_without_mtp PASS
T3 empty_prompt PASS
ALL PASS
## Steps 3.2-3.5 (separate commits, this PR)
3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch
+ hidden capture + MtpChainRunner.run); red test = identical token IDs
vs reference run_qwen36_mtp_harness on a fixed prompt.
3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to
common::mtp::warm_and_decode; delete the qwen35-local helpers.
3.4 stub Gemma4Backend MTP override using the same common entry point to
prove the interface is generic (not Qwen35-specific).
3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled
CPU loops; replace with ggml primitives per howard's point #1.
Then retest 24K baseline post-RoPE-fix (howard's other comment) and update
PR description with current numbers.
Addresses:
- davide221 #214#issuecomment-4472910706 (merge conflicts) — rebased
- howard0su #214#review (changes requested points 2, 3, 4) — first slice
0 commit comments