feat(dFlash): MTP foundation#237
Conversation
There was a problem hiding this comment.
2 issues found across 34 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
Addressed cubic review in commit 74f708a:
Build clean. All 19 unit tests pass. T7 (partial_accept_rollback) exercises the NativeHeads rollback path only — the ExternalDrafter hidden-sync invariant is not covered by the existing suite (noted for follow-up). E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from 8/10 pre-fix). F2 is confined to ExternalDrafter and does not affect the Qwen3.6/NativeHeads path tested here. he_01/he_02 divergences remain — consistent with prior AR-vs-AR control showing GPU non-determinism on those two prompts. |
|
Two more fixes from real-workload hermes testing (commit A. Silent server death on long-prompt MTP: B. Verified: 25K-token prompt → server stays alive, MTP ran (iters=16 proposed=32 accepted=16 accept_rate=0.50), real response returned. Conflict invocation prints the warning and starts cleanly with |
There was a problem hiding this comment.
6 issues found across 34 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
Addressed cubic re-review in commit f6e8e94: P1:
P2:
P3:
Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical. |
|
|
||
| // MTP (Multi-Token Prediction) speculator — mutually exclusive with --draft. | ||
| // When mtp_gguf_path is set, the backend ignores draft_path. | ||
| const char * mtp_gguf_path = nullptr; |
There was a problem hiding this comment.
shall we use https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF which contains MTP in the main model?
| // When mtp_gguf_path is set, the backend ignores draft_path. | ||
| const char * mtp_gguf_path = nullptr; | ||
| int mtp_gamma = 0; // 0 = MTP loaded but not active; >0 = chain depth | ||
| const char * mtp_draft_source = nullptr; // "chain" (default) | "mtp_topk" |
| @@ -0,0 +1,219 @@ | |||
| // common/gguf_mmap.h — RAII wrapper for platform-conditional mmap of GGUF files. | |||
There was a problem hiding this comment.
this worth a separate PR
| // supports_mtp() returns true. Default returns nullptr. | ||
| // Forward-declared to avoid a header dependency from model_backend.h | ||
| // on mtp_interface.h — backends include both as needed. | ||
| virtual mtp::IMtpModule * mtp() { return nullptr; } |
There was a problem hiding this comment.
why not return nullptr indicate this model doesn't support mtp?
| // matching mixin (IExternalDrafterMtp / INativeMtp). | ||
| enum class MtpFlavor { | ||
| ExternalDrafter, // Gemma4-style: separate drafter, h_prev chain | ||
| NativeHeads, // Qwen3.6-style: MTP heads in the backbone |
There was a problem hiding this comment.
this should hide into ModelBackend.
| const int n = std::min(prefill_ubatch, prompt_len - start); | ||
| std::vector<int32_t> chunk(req.prompt.begin() + start, | ||
| req.prompt.begin() + start + n); | ||
| if (!target->verify_batch(chunk, start, last_tok, nullptr)) { |
There was a problem hiding this comment.
mtp should only change how we generate the batch (v.s. draft), can we normalize the code with the existing logic?
| @@ -0,0 +1,133 @@ | |||
| // qwen35_mtp_graph.h — CUDA cgraph for Qwen3.6 MTP head step forward. | |||
There was a problem hiding this comment.
we should use the existing graph API not a new API unless we want to put MTP header into another ggml backend.
| @@ -0,0 +1,225 @@ | |||
| // qwen35_mtp_loader.cpp — Discovery loader for Qwen3.6 -MTP-GGUF files. | |||
There was a problem hiding this comment.
check the previous comment about GGUF loading.
|
I’m working on a shared hardware/backend/runtime placement foundation for the cpp native server path. #236 Could we align the MTP backend/device selection with that shared placement layer, instead of keeping a separate MTP-specific selection path? This would avoid duplicated CUDA/HIP/device handling across MTP,DFlash, PFlash, and future speculative paths. |
Mechanical rename per weicj's review on PR Luce-Org#237 — the legacy namespace name baked the first backend (Qwen3-27B) into shared code. Renaming to dflash::common removes the backend leak from the substrate so future backends plug into a neutral namespace. Scope: - namespace dflash27b → namespace dflash::common - dflash27b::* → dflash::common::* - CMake static lib dflash27b → dflash_common - CMake project(dflash27b) → project(dflash) - Private CMake vars _dflash27b_* → _dflash_* - Stale comment references Out of scope (deferred to a follow-up): - Public C header dflash/include/dflash27b.h - C symbol dflash27b_last_error() - Preprocessor macros DFLASH27B_* - Env vars DFLASH27B_* Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during CUDA host-compiler ID detection. Workaround: -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 No behavior change. Clean build green. Symbol mangling confirmed as N6dflash6common* via nm; zero residual dflash27b symbols outside the deferred public C ABI.
Mechanical rename per @weicj's review on Luce-Org#237. Renames the legacy backend-baked namespace to a neutral one so future backends plug into shared code without name leakage. Scope: - namespace dflash27b → namespace dflash::common (sources + tests) - dflash27b::* → dflash::common::* - CMake static lib dflash27b → dflash_common - CMake project(dflash27b) → project(dflash) - Private CMake vars _dflash27b_* → _dflash_* - Stale comment references Out of scope (deferred to a follow-up): - Public C header dflash/include/dflash27b.h - C symbol dflash27b_last_error() - Preprocessor macros DFLASH27B_* - Env vars DFLASH27B_* Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during CUDA host-compiler ID detection. Workaround: -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 No behavior change. Symbol mangling confirmed via nm as N6dflash6common*; no residual dflash27b mangled symbols outside the deferred public C ABI.
There was a problem hiding this comment.
1 issue found across 8 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
Per @howard0su's review on Luce-Org#237: gguf_mmap.h/.cpp are platform abstraction (POSIX mmap / Windows MapViewOfFile) that will be reused by multiple loaders. Extracting into a standalone PR ahead of the loader refactor (target/draft/MTP heads all benefit). Includes the cubic P1 fixes from f6e8e94: - open() is now idempotent (releases prior mapping before re-opening, leaves object in default state on any failure path) - Windows release() now calls CloseHandle() before clearing handle_ New: test_gguf_mmap unit test covers open/close, idempotency, missing-file path, RAII destructor. Stacked on Luce-Org#241 (uses dflash::common namespace from the start).
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by the legacy-flag inference because MtpSource::None served both as the default and as the explicit 'none' value. Adds an internal MtpSource::Unset sentinel as the new default. Legacy-flag inference only fires when the field is still Unset. After inference (or if no legacy flag matched), Unset is resolved to None before any backend code sees it. User-visible CLI surface unchanged: --mtp-source still accepts exactly {none, native, external, auto}. Unset is internal-only and never escapes arg parsing. Defensive assert in create_backend() enforces this.
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.
- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
hoisted out of the backend. Three KV-reconciliation paths
(accept-all / fast rollback / recommit) share a single post-iter
invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
to chain runner. Owns only control flow; all compute lives in
DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
generate() delegates to common::mtp::warm_and_decode when MTP is
configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
command surface unchanged.
Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
… paths Adds unit tests exercising MtpChainRunner state machine (gamma propagation, EOS termination, partial-accept rollback, n_gen termination), MtpOrchestrator lifecycle (reset_chain ordering, set_initial_hidden plumbing, gamma derivation), and Qwen36MtpModule input-validation guards (attach(nullptr), out-of-range gamma, pre-attach calls, shutdown idempotency). All new tests are mock-based and require no GGUF.
Both qwen35_backend and the test_dflash harness called the 4-arg overload of Qwen36MtpModule::init() with backend.tensor_context() — the BACKBONE ggml_ctx. That ctx contains only blk.0..63; MTP head tensors live in blk.64.* in the separate MTP GGUF, so the loader could never resolve them and init failed silently before chain decode. Switched both call sites to the 3-arg overload init(gguf_path, target, err) which calls load_gguf_tensor_context(mtp_gguf_path) internally to build the correct context for tensor lookup. Caught by E2E sweep on real Qwen3.6-27B Q4_K_M + MTP-Q4_K_M GGUFs.
… is qwen35)
The GGUF general.architecture field on both 'Qwen3.5-0.8B-Q4_K_M.gguf' and
'Qwen3.6-27B-MTP-Q4_K_M.gguf' is 'qwen35'. The 'Qwen3.6' label lives only in
general.name (model display name). The MTP module is a NextN extension of the
qwen35 architecture, not a separate arch.
Aligns naming with Qwen35Backend (the backbone) and the loader's arch check
in gguf_target_loader.cpp:271. Pure rename — no functional change.
Files renamed (git mv, blame preserved):
src/qwen36/qwen36_mtp.{h,cpp} → src/qwen35/qwen35_mtp.{h,cpp}
src/qwen36/qwen36_mtp_graph.{h,cpp} → src/qwen35/qwen35_mtp_graph.{h,cpp}
src/qwen36/qwen36_mtp_loader.cpp → src/qwen35/qwen35_mtp_loader.cpp
dflash/src/qwen36/ directory removed
E2E re-verified on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):
AR (γ=0): 35.6 tok/s
Chain γ=2: 57.8 tok/s = 1.62× over AR (accept_rate=0.83)
Chain γ=4: 50.9 tok/s = 1.43× over AR (accept_rate=0.58)
Greedy token output byte-identical across all three modes.
Mechanical fixes from style review:
- qwen35_mtp_loader.cpp: rewrite header from stale 'PR 2 skeleton'
narrative to current behavior
- mtp_chain_runner.cpp: remove dead if-block at the post-propose
size check (comment-only body)
- Qwen35DaemonArgs::n renamed to Qwen35DaemonArgs::mtp_draft_source
to match Qwen35Config::mtp_draft_source
- mtp_orchestrator namespace: drop unused 'common' layer
(dflash27b::common::mtp -> dflash27b::mtp)
- mtp_orchestrator: dynamic_cast -> static_cast (flavor() already
discriminates the concrete type)
- qwen35_mtp.cpp: drop redundant 'static' on TU-local helpers inside
anonymous namespace
- qwen35_mtp.{h,cpp}: convert Phase-A/B roadmap comments to TODOs or
remove (completed phases)
- qwen35_backend.h: label ensure_decode_cache/tensor_context as
test-only (they remain public for harness access)
- dflash_target.h: forward-declare DDTree; move ddtree.h include to
.cpp files that use it concretely
- mtp_chain_runner.h: clean stale PR-internal cross-reference
- qwen35_backend.h: terminology consistency (MTP module)
- mtp_orchestrator.cpp: comment that enable_hidden_seq_capture is
no-op for non-MTP targets
- test_common_mtp_orchestrator.cpp: remove PR-review narrative
- qwen35_backend.cpp: add fflush after [mtp] loaded printf
- No functional change. E2E sweep re-verified greedy byte-identical
output across AR / chain γ=2 / chain γ=4.
Adds --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk to the native C++ HTTP server (dflash/src/server/dflash_server). Threads them through BackendArgs into Qwen35Config so the existing Qwen35Backend MTP path (already wired) is reachable through the production HTTP endpoint, not just the test_dflash harness. Mirror of the existing flags in dflash/scripts/server.py. Verified via HTTP /v1/chat/completions: AR baseline and chain γ=2 produce greedy-identical text under temperature=0 on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M. Also confirmed with unsloth single-file combined GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf) passed as --mtp-gguf — identical output, greedy equivalent.
F2 (P1) mtp_chain_runner.cpp: on partial accept (accept_n < g_actual) running_hidden was set from the drafter-internal next_hidden produced after the full rejected suffix. For ExternalDrafter flavor, call set_capture_row(accept_n) + consume_captured_hidden() to thread the target-captured hidden at the committed boundary instead. NativeHeads (Qwen3.6) is unaffected — next_hidden is always empty for that flavor and the new branch is not entered. F1 (P2) mtp_orchestrator.cpp: warm_and_decode chain-runner failure path (line 151) returned without emitting io.emit(-1), leaving the stream without a termination sentinel unlike every other error return in the function. Add the missing io.emit(-1) before return. Cubic review: "On runner failure, stream termination sentinel io.emit(-1) is not emitted, unlike all other error paths in this function" (P2). "External-drafter prev_hidden desynchronizes on partial accept: running_hidden is updated unconditionally from the full draft chain even when suffix drafts are rejected, so the next iteration threads a hidden state from a rejected token path instead of the committed boundary" (P1). Build: clean. All 19 unit tests pass. E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from pre-fix; F2 not active on Qwen3.6/NativeHeads path — he_01/he_02 divergences confirmed GPU noise).
T7 used a NativeHeads stub so it never exercised the IExternalDrafterMtp partial-accept path that commit 74f708a fixed. Add StubExternalDrafter + T20 asserting set_capture_row(accept_n) is called before consume_captured_hidden() when accept_n < gamma_actual.
…--draft/--mtp-gguf conflict Real-workload testing with agentic CLI clients (hermes) exposed two regressions not caught by HumanEval bench or unit tests: A. DFLASH27B_MTP_CTX default was 8192; any prompt longer than that would print '[qwen35_mtp] warm_head_kv: n_prompt=N exceeds head_kv capacity n_ctx=8192' as a warning, then SIGKILL the server during the next prefill operation (CUDA assertion / memory corruption from writing past the buffer). VRAM spiked from 5.8GB to 20GB then process died. Silent — no error logged. Fix: default MTP_CTX to backbone max_ctx + warm_head_kv returns false on overflow so the chain runner falls back to AR cleanly. B. dflash_server accepted --draft and --mtp-gguf together; the harness 'start_dflash_native_server' passes both by default which produced inconsistent state. Fix: at startup, log a WARNING and clear bargs.draft_path when both are set (MTP wins). Verified: server now stays alive when handed a 15K-token prompt under default MTP_CTX, falls back to AR for that request, returns response. --draft+--mtp-gguf invocation logs the conflict warning and loads only MTP.
P1 - gguf_mmap.h open() idempotency: call release() first if already open; on failure paths, reset state to default before returning. P1 - gguf_mmap.h Windows release() HANDLE leak: CloseHandle() before clearing handle_. (Fix by inspection — Linux build does not exercise.) P1 - qwen35_mtp.cpp GPU warmup off-by-one: n_prompt > n_ctx -> >= n_ctx matching CPU warmup. With slot_start=1, exactly-n_ctx-length prompts would have written slot n_ctx (out of bounds). MTP now declines that edge case cleanly and the chain runner falls through to AR. P2 - test_dflash.cpp: missing hidden_seq capture now aborts cleanly (all_prefill_hidden.clear() + break) instead of running warm_head_kv on undefined buffer. P2 - T7 (chain-runner rollback) now asserts restore_kv_at_chain was called > 0 times on partial accept. P3 - T6 (EOS termination) tightened: total_emitted == 1, not <= 10. Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical (within AR-vs-AR noise baseline). Throughput in this run reflects a pre-existing unstaged flashprefill.cpp change (BSA opt-in -> opt-out) unrelated to these four files.
Per @howard0su's review on Luce-Org#237 (lines 57, 59): - 57: 'shall we use the unsloth single-file MTP-in-target GGUF?' - 59: 'why not a enum?' Replaces: const char * mtp_gguf_path = nullptr; const char * mtp_draft_source = nullptr; // "chain" | "mtp_topk" with: enum class MtpSource { None, Native, ExternalDrafter, Auto }; MtpSource mtp_source = MtpSource::None; const char * mtp_gguf_path = nullptr; // only for ExternalDrafter bool mtp_use_topk = false; // false=chain, true=mtp_topk Adds gguf_contains_mtp_tensors() probe (keyed on qwen35.nextn_predict_layers metadata) so --mtp-gguf becomes optional when the primary GGUF embeds MTP tensors (unsloth single-file case). Stacked on Luce-Org#237. dflash_server arg parsing updated to: - --mtp-source [none|native|external|auto] (new explicit flag) - --mtp-gguf PATH (now optional; only needed for ExternalDrafter) - Old --mtp-draft-source string flag warns + ignored (migration aid) - --mtp-gamma alone triggers Auto detection All test_common_mtp_orchestrator tests still pass (mock-based, unaffected by the config-surface change).
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by the legacy-flag inference because MtpSource::None served both as the default and as the explicit 'none' value. Adds an internal MtpSource::Unset sentinel as the new default. Legacy-flag inference only fires when the field is still Unset. After inference (or if no legacy flag matched), Unset is resolved to None before any backend code sees it. User-visible CLI surface unchanged: --mtp-source still accepts exactly {none, native, external, auto}. Unset is internal-only and never escapes arg parsing. Defensive assert in create_backend() enforces this.
1051e71 to
8ec9a95
Compare
- server_main.cpp: gate --draft suppression and DFLASH27B_MTP_CTX on concrete MtpSource (Native|ExternalDrafter) instead of !None. Auto defers to backend GGUF probe; preserves --draft as fallback when Auto resolves to None. - mtp_orchestrator.cpp: wrap io with req.on_token via DaemonIO::with_token_callback (mirrors laguna_backend.cpp:151 and gemma4_backend.cpp:172). MTP requests now get streaming disconnect cancellation and per-token callbacks. 20/20 MTP test suite still passes.
Summary
Adds Qwen3.5/3.6 NextN MTP support:
IMtpModuleinterface + chain runner incommon/,Qwen35MtpModuleconcrete impl, qwen35 backend wiring, and--mtp-ggufflags indflash_server.Works with both GGUF layouts:
--mtp-ggufQwen3.6-27B-Q4_K_M-mtp.gguf): pass the same file as both the primary and--mtp-gguf; the loader reads backbone tensors and MTP head tensors from it. Verified greedy-equivalent + same 1.82× speedup as the separate-files setup.Validation
Single-prompt HTTP (
/v1/chat/completions, temp=0)HumanEval-10 via
harness/benchmarks/generation_benchmark.pyon Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):All 10 prompts pass functional checks. Greedy text byte-identical on 8/10; remaining 2 within the AR-vs-AR non-determinism baseline (verified via control run: same prompts produced 9/10 identical text across two AR-only runs of the same server).
Confirmed equivalent between separate-files and unsloth single-file layouts: 1.82× → 1.82× (statistically identical), 8/10 byte-identical text in both, same 95–100% accept rate range.
Agentic-CLI smoke (OpenAI Responses API)
pi(v0.55.3) viaharness/clients/run_pi.sh: server starts with MTP, SSE streaming delivers tokens,[mtp_decode] iters=2 proposed=4 accepted=4 accept_rate=1.00, marker received exact-match. Proves the full HTTP → Responses API → SSE → MTP chain → tokenizer → client round-trip.Long-prompt resilience
25K-token prompt under
--mtp-gguf+--max-ctx 65536: server stays alive, MTP fires (iters=16, 50% accept), real response returned. Earlier silent SIGKILL on long prompts (DFLASH27B_MTP_CTX default 8192 not tracking max_ctx) fixed in `ada7bb0`;warm_head_kvoverflow now triggers graceful AR fallback.Unit tests
test_common_mtp_orchestrator: 20/20 pass — chain runner state machine (γ propagation, EOS termination, partial-accept rollback, n_gen termination), orchestrator lifecycle, Qwen35MtpModule input-validation guards, T20 ExternalDrafter partial-accept hidden threading (asserts theset_capture_row+consume_captured_hiddeninvariant the `74f708a` fix added). Mock-based, no GGUF required.Scope
IMtpModule+INativeMtp+IExternalDrafterMtpmixinsQwen35MtpModule(INativeMtpimpl) + GGUF loader + step graphMtpChainRunner+MtpOrchestratordflash_server:--mtp-gguf,--mtp-gamma,--mtp-draft-source,--mtp-draft-topkNot in this PR (follow-ups)
IExternalDrafterMtpinterface declared here)IDrafterabstraction unifying DFlash + MTP under one runner