feat(dFlash): MTP foundation by dusterbloom · Pull Request #237 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-20T18:53:07Z

Summary

Adds Qwen3.5/3.6 NextN MTP support: IMtpModule interface + chain runner in common/, Qwen35MtpModule concrete impl, qwen35 backend wiring, and --mtp-gguf flags in dflash_server.

Works with both GGUF layouts:

Separate files: backbone GGUF as primary + MTP-head GGUF via --mtp-gguf
Unsloth single-file combined GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf): pass the same file as both the primary and --mtp-gguf; the loader reads backbone tensors and MTP head tensors from it. Verified greedy-equivalent + same 1.82× speedup as the separate-files setup.

Validation

Single-prompt HTTP (`/v1/chat/completions`, temp=0)

HumanEval-10 via harness/benchmarks/generation_benchmark.py on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):

Mode	mean tok/s	speedup	accept rate
AR	32.46	1.00x	—
chain γ=2	59.24	1.82x	0.93–1.00

All 10 prompts pass functional checks. Greedy text byte-identical on 8/10; remaining 2 within the AR-vs-AR non-determinism baseline (verified via control run: same prompts produced 9/10 identical text across two AR-only runs of the same server).

Confirmed equivalent between separate-files and unsloth single-file layouts: 1.82× → 1.82× (statistically identical), 8/10 byte-identical text in both, same 95–100% accept rate range.

Agentic-CLI smoke (OpenAI Responses API)

pi (v0.55.3) via harness/clients/run_pi.sh: server starts with MTP, SSE streaming delivers tokens, [mtp_decode] iters=2 proposed=4 accepted=4 accept_rate=1.00, marker received exact-match. Proves the full HTTP → Responses API → SSE → MTP chain → tokenizer → client round-trip.

Long-prompt resilience

25K-token prompt under --mtp-gguf + --max-ctx 65536: server stays alive, MTP fires (iters=16, 50% accept), real response returned. Earlier silent SIGKILL on long prompts (DFLASH27B_MTP_CTX default 8192 not tracking max_ctx) fixed in `ada7bb0`; warm_head_kv overflow now triggers graceful AR fallback.

Unit tests

test_common_mtp_orchestrator: 20/20 pass — chain runner state machine (γ propagation, EOS termination, partial-accept rollback, n_gen termination), orchestrator lifecycle, Qwen35MtpModule input-validation guards, T20 ExternalDrafter partial-accept hidden threading (asserts the set_capture_row + consume_captured_hidden invariant the `74f708a` fix added). Mock-based, no GGUF required.

Scope

IMtpModule + INativeMtp + IExternalDrafterMtp mixins
Qwen35MtpModule (INativeMtp impl) + GGUF loader + step graph
MtpChainRunner + MtpOrchestrator
dflash_server: --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk

Not in this PR (follow-ups)

Gemma4 ExternalDrafter MTP module (uses the IExternalDrafterMtp interface declared here)
Per-request DFlash/MTP/auto dispatcher (commit `3b65ac1` on `feat/mtp-prefix-warm-ghost`)
IDrafter abstraction unifying DFlash + MTP under one runner
Tree-mode runner + entropy-adaptive selector
Prefix-cache WARM hit for MTP

cubic-dev-ai

2 issues found across 34 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

dusterbloom · 2026-05-20T19:15:39Z

Addressed cubic review in commit 74f708a:

F1 (P2): add missing io.emit(-1) on chain-runner failure path in warm_and_decode (mtp_orchestrator.cpp:152)
F2 (P1): call set_capture_row(accept_n) + consume_captured_hidden() before threading running_hidden on partial accept for ExternalDrafter flavor; NativeHeads path is unaffected (next_hidden is always empty for that flavor, new branch not entered)

Build clean. All 19 unit tests pass. T7 (partial_accept_rollback) exercises the NativeHeads rollback path only — the ExternalDrafter hidden-sync invariant is not covered by the existing suite (noted for follow-up).

E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from 8/10 pre-fix). F2 is confined to ExternalDrafter and does not affect the Qwen3.6/NativeHeads path tested here. he_01/he_02 divergences remain — consistent with prior AR-vs-AR control showing GPU non-determinism on those two prompts.

dusterbloom · 2026-05-20T19:26:51Z

Coverage gap closed in bf2d3e9: T20 now exercises the ExternalDrafter partial-accept path that the 74f708a fix addressed (T7 used a NativeHeads stub, so it never reached the fixed branch). 20/20 unit tests pass.

dusterbloom · 2026-05-20T19:42:54Z

Two more fixes from real-workload hermes testing (commit ada7bb0):

A. Silent server death on long-prompt MTP: DFLASH27B_MTP_CTX defaulted to 8192 but the server never propagated --max-ctx to that env var. With --max-ctx 65536, the backbone KV was allocated at 65536 slots while the MTP head_kv stayed at 8192 — VRAM mismatch caused OOM on the next prefill. (The warm_head_kv overflow guard already returned false cleanly — the bug was upstream at init time.) Fix: 5-line propagation in server_main.cpp using setenv(..., overwrite=0) so user env still wins.

B. --draft + --mtp-gguf conflict: server now logs a WARNING and clears draft_path when both are set (MTP wins).

Verified: 25K-token prompt → server stays alive, MTP ran (iters=16 proposed=32 accepted=16 accept_rate=0.50), real response returned. Conflict invocation prints the warning and starts cleanly with draft = (none) in the config banner.

cubic-dev-ai

6 issues found across 34 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

dusterbloom · 2026-05-20T22:08:28Z

Addressed cubic re-review in commit f6e8e94:

P1:

gguf_mmap.h open() now idempotent (releases prior mapping first via placement-destroy+reconstruct; leaves object in default state on any failure path)
gguf_mmap.h Windows release() now calls CloseHandle() before clearing handle_ (fix by inspection — Linux build cannot exercise the Windows path)
qwen35_mtp.cpp GPU warmup off-by-one (> -> >=): a prompt of exactly n_ctx tokens would have written slot n_ctx (out-of-bounds; valid range is [0, n_ctx-1] with slot_start=1). Now declines MTP cleanly and chain runner falls through to AR.

P2:

test_dflash.cpp missing hidden_seq capture now clears all_prefill_hidden and breaks out of the prefill loop instead of running warm_head_kv on an undefined buffer
T7 now asserts restore_kv_at_chain_calls >= 1 — rollback path is confirmed to fire on partial accept

P3:

T6 EOS termination assertion tightened from total_emitted <= 10 to total_emitted == 1 (EOS is the first emitted token; runner must stop immediately)

Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical.

howard0su · 2026-05-20T23:06:07Z

+
+    // MTP (Multi-Token Prediction) speculator — mutually exclusive with --draft.
+    // When mtp_gguf_path is set, the backend ignores draft_path.
+    const char * mtp_gguf_path    = nullptr;


shall we use https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF which contains MTP in the main model?

howard0su · 2026-05-20T23:06:16Z

+    // When mtp_gguf_path is set, the backend ignores draft_path.
+    const char * mtp_gguf_path    = nullptr;
+    int          mtp_gamma        = 0;        // 0 = MTP loaded but not active; >0 = chain depth
+    const char * mtp_draft_source = nullptr;  // "chain" (default) | "mtp_topk"


why not a enum?

howard0su · 2026-05-20T23:08:00Z

@@ -0,0 +1,219 @@
+// common/gguf_mmap.h — RAII wrapper for platform-conditional mmap of GGUF files.


this worth a separate PR

howard0su · 2026-05-20T23:08:29Z

+    // supports_mtp() returns true. Default returns nullptr.
+    // Forward-declared to avoid a header dependency from model_backend.h
+    // on mtp_interface.h — backends include both as needed.
+    virtual mtp::IMtpModule * mtp() { return nullptr; }


why not return nullptr indicate this model doesn't support mtp?

howard0su · 2026-05-20T23:09:17Z

+// matching mixin (IExternalDrafterMtp / INativeMtp).
+enum class MtpFlavor {
+    ExternalDrafter,   // Gemma4-style: separate drafter, h_prev chain
+    NativeHeads,       // Qwen3.6-style: MTP heads in the backbone


this should hide into ModelBackend.

howard0su · 2026-05-20T23:10:55Z

+        const int n = std::min(prefill_ubatch, prompt_len - start);
+        std::vector<int32_t> chunk(req.prompt.begin() + start,
+                                   req.prompt.begin() + start + n);
+        if (!target->verify_batch(chunk, start, last_tok, nullptr)) {


mtp should only change how we generate the batch (v.s. draft), can we normalize the code with the existing logic?

howard0su · 2026-05-20T23:11:49Z

@@ -0,0 +1,133 @@
+// qwen35_mtp_graph.h — CUDA cgraph for Qwen3.6 MTP head step forward.


we should use the existing graph API not a new API unless we want to put MTP header into another ggml backend.

howard0su · 2026-05-20T23:12:03Z

@@ -0,0 +1,225 @@
+// qwen35_mtp_loader.cpp — Discovery loader for Qwen3.6 -MTP-GGUF files.


check the previous comment about GGUF loading.

weicj · 2026-05-21T05:57:28Z

I’m working on a shared hardware/backend/runtime placement foundation for the cpp native server path. #236

Could we align the MTP backend/device selection with that shared placement layer, instead of keeping a separate MTP-specific selection path?

This would avoid duplicated CUDA/HIP/device handling across MTP,DFlash, PFlash, and future speculative paths.

Mechanical rename per weicj's review on PR Luce-Org#237 — the legacy namespace name baked the first backend (Qwen3-27B) into shared code. Renaming to dflash::common removes the backend leak from the substrate so future backends plug into a neutral namespace. Scope: - namespace dflash27b → namespace dflash::common - dflash27b::* → dflash::common::* - CMake static lib dflash27b → dflash_common - CMake project(dflash27b) → project(dflash) - Private CMake vars _dflash27b_* → _dflash_* - Stale comment references Out of scope (deferred to a follow-up): - Public C header dflash/include/dflash27b.h - C symbol dflash27b_last_error() - Preprocessor macros DFLASH27B_* - Env vars DFLASH27B_* Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during CUDA host-compiler ID detection. Workaround: -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 No behavior change. Clean build green. Symbol mangling confirmed as N6dflash6common* via nm; zero residual dflash27b symbols outside the deferred public C ABI.

@weicj

Mechanical rename per @weicj's review on Luce-Org#237. Renames the legacy backend-baked namespace to a neutral one so future backends plug into shared code without name leakage. Scope: - namespace dflash27b → namespace dflash::common (sources + tests) - dflash27b::* → dflash::common::* - CMake static lib dflash27b → dflash_common - CMake project(dflash27b) → project(dflash) - Private CMake vars _dflash27b_* → _dflash_* - Stale comment references Out of scope (deferred to a follow-up): - Public C header dflash/include/dflash27b.h - C symbol dflash27b_last_error() - Preprocessor macros DFLASH27B_* - Env vars DFLASH27B_* Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during CUDA host-compiler ID detection. Workaround: -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11 No behavior change. Symbol mangling confirmed via nm as N6dflash6common*; no residual dflash27b mangled symbols outside the deferred public C ABI.

cubic-dev-ai

1 issue found across 8 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

@howard0su

Per @howard0su's review on Luce-Org#237: gguf_mmap.h/.cpp are platform abstraction (POSIX mmap / Windows MapViewOfFile) that will be reused by multiple loaders. Extracting into a standalone PR ahead of the loader refactor (target/draft/MTP heads all benefit). Includes the cubic P1 fixes from f6e8e94: - open() is now idempotent (releases prior mapping before re-opening, leaves object in default state on any failure path) - Windows release() now calls CloseHandle() before clearing handle_ New: test_gguf_mmap unit test covers open/close, idempotency, missing-file path, RAII destructor. Stacked on Luce-Org#241 (uses dflash::common namespace from the start).

Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by the legacy-flag inference because MtpSource::None served both as the default and as the explicit 'none' value. Adds an internal MtpSource::Unset sentinel as the new default. Legacy-flag inference only fires when the field is still Unset. After inference (or if no legacy flag matched), Unset is resolved to None before any backend code sees it. User-visible CLI surface unchanged: --mtp-source still accepts exactly {none, native, external, auto}. Unset is internal-only and never escapes arg parsing. Defensive assert in create_backend() enforces this.

Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN block at layer n_layer-1). Speculation runs through a new common chain runner; the existing DFlashTarget adapter handles verify/snapshot/restore. - common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp / IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via IExternalDrafterMtp without touching the chain runner. - common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop, hoisted out of the backend. Three KV-reconciliation paths (accept-all / fast rollback / recommit) share a single post-iter invariant so AR equivalence holds under recommit. - common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch to chain runner. Owns only control flow; all compute lives in DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the backend device. - qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound at attach time as the single source of truth. - qwen35: supports_mtp()/mtp() exposed through ModelBackend; generate() delegates to common::mtp::warm_and_decode when MTP is configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens. - server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon command surface unchanged. Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe 7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at --max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short agentic prompts.

… paths Adds unit tests exercising MtpChainRunner state machine (gamma propagation, EOS termination, partial-accept rollback, n_gen termination), MtpOrchestrator lifecycle (reset_chain ordering, set_initial_hidden plumbing, gamma derivation), and Qwen36MtpModule input-validation guards (attach(nullptr), out-of-range gamma, pre-attach calls, shutdown idempotency). All new tests are mock-based and require no GGUF.

Both qwen35_backend and the test_dflash harness called the 4-arg overload of Qwen36MtpModule::init() with backend.tensor_context() — the BACKBONE ggml_ctx. That ctx contains only blk.0..63; MTP head tensors live in blk.64.* in the separate MTP GGUF, so the loader could never resolve them and init failed silently before chain decode. Switched both call sites to the 3-arg overload init(gguf_path, target, err) which calls load_gguf_tensor_context(mtp_gguf_path) internally to build the correct context for tensor lookup. Caught by E2E sweep on real Qwen3.6-27B Q4_K_M + MTP-Q4_K_M GGUFs.

… is qwen35) The GGUF general.architecture field on both 'Qwen3.5-0.8B-Q4_K_M.gguf' and 'Qwen3.6-27B-MTP-Q4_K_M.gguf' is 'qwen35'. The 'Qwen3.6' label lives only in general.name (model display name). The MTP module is a NextN extension of the qwen35 architecture, not a separate arch. Aligns naming with Qwen35Backend (the backbone) and the loader's arch check in gguf_target_loader.cpp:271. Pure rename — no functional change. Files renamed (git mv, blame preserved): src/qwen36/qwen36_mtp.{h,cpp} → src/qwen35/qwen35_mtp.{h,cpp} src/qwen36/qwen36_mtp_graph.{h,cpp} → src/qwen35/qwen35_mtp_graph.{h,cpp} src/qwen36/qwen36_mtp_loader.cpp → src/qwen35/qwen35_mtp_loader.cpp dflash/src/qwen36/ directory removed E2E re-verified on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090): AR (γ=0): 35.6 tok/s Chain γ=2: 57.8 tok/s = 1.62× over AR (accept_rate=0.83) Chain γ=4: 50.9 tok/s = 1.43× over AR (accept_rate=0.58) Greedy token output byte-identical across all three modes.

Mechanical fixes from style review: - qwen35_mtp_loader.cpp: rewrite header from stale 'PR 2 skeleton' narrative to current behavior - mtp_chain_runner.cpp: remove dead if-block at the post-propose size check (comment-only body) - Qwen35DaemonArgs::n renamed to Qwen35DaemonArgs::mtp_draft_source to match Qwen35Config::mtp_draft_source - mtp_orchestrator namespace: drop unused 'common' layer (dflash27b::common::mtp -> dflash27b::mtp) - mtp_orchestrator: dynamic_cast -> static_cast (flavor() already discriminates the concrete type) - qwen35_mtp.cpp: drop redundant 'static' on TU-local helpers inside anonymous namespace - qwen35_mtp.{h,cpp}: convert Phase-A/B roadmap comments to TODOs or remove (completed phases) - qwen35_backend.h: label ensure_decode_cache/tensor_context as test-only (they remain public for harness access) - dflash_target.h: forward-declare DDTree; move ddtree.h include to .cpp files that use it concretely - mtp_chain_runner.h: clean stale PR-internal cross-reference - qwen35_backend.h: terminology consistency (MTP module) - mtp_orchestrator.cpp: comment that enable_hidden_seq_capture is no-op for non-MTP targets - test_common_mtp_orchestrator.cpp: remove PR-review narrative - qwen35_backend.cpp: add fflush after [mtp] loaded printf - No functional change. E2E sweep re-verified greedy byte-identical output across AR / chain γ=2 / chain γ=4.

Adds --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk to the native C++ HTTP server (dflash/src/server/dflash_server). Threads them through BackendArgs into Qwen35Config so the existing Qwen35Backend MTP path (already wired) is reachable through the production HTTP endpoint, not just the test_dflash harness. Mirror of the existing flags in dflash/scripts/server.py. Verified via HTTP /v1/chat/completions: AR baseline and chain γ=2 produce greedy-identical text under temperature=0 on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M. Also confirmed with unsloth single-file combined GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf) passed as --mtp-gguf — identical output, greedy equivalent.

F2 (P1) mtp_chain_runner.cpp: on partial accept (accept_n < g_actual) running_hidden was set from the drafter-internal next_hidden produced after the full rejected suffix. For ExternalDrafter flavor, call set_capture_row(accept_n) + consume_captured_hidden() to thread the target-captured hidden at the committed boundary instead. NativeHeads (Qwen3.6) is unaffected — next_hidden is always empty for that flavor and the new branch is not entered. F1 (P2) mtp_orchestrator.cpp: warm_and_decode chain-runner failure path (line 151) returned without emitting io.emit(-1), leaving the stream without a termination sentinel unlike every other error return in the function. Add the missing io.emit(-1) before return. Cubic review: "On runner failure, stream termination sentinel io.emit(-1) is not emitted, unlike all other error paths in this function" (P2). "External-drafter prev_hidden desynchronizes on partial accept: running_hidden is updated unconditionally from the full draft chain even when suffix drafts are rejected, so the next iteration threads a hidden state from a rejected token path instead of the committed boundary" (P1). Build: clean. All 19 unit tests pass. E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from pre-fix; F2 not active on Qwen3.6/NativeHeads path — he_01/he_02 divergences confirmed GPU noise).

T7 used a NativeHeads stub so it never exercised the IExternalDrafterMtp partial-accept path that commit 74f708a fixed. Add StubExternalDrafter + T20 asserting set_capture_row(accept_n) is called before consume_captured_hidden() when accept_n < gamma_actual.

…--draft/--mtp-gguf conflict Real-workload testing with agentic CLI clients (hermes) exposed two regressions not caught by HumanEval bench or unit tests: A. DFLASH27B_MTP_CTX default was 8192; any prompt longer than that would print '[qwen35_mtp] warm_head_kv: n_prompt=N exceeds head_kv capacity n_ctx=8192' as a warning, then SIGKILL the server during the next prefill operation (CUDA assertion / memory corruption from writing past the buffer). VRAM spiked from 5.8GB to 20GB then process died. Silent — no error logged. Fix: default MTP_CTX to backbone max_ctx + warm_head_kv returns false on overflow so the chain runner falls back to AR cleanly. B. dflash_server accepted --draft and --mtp-gguf together; the harness 'start_dflash_native_server' passes both by default which produced inconsistent state. Fix: at startup, log a WARNING and clear bargs.draft_path when both are set (MTP wins). Verified: server now stays alive when handed a 15K-token prompt under default MTP_CTX, falls back to AR for that request, returns response. --draft+--mtp-gguf invocation logs the conflict warning and loads only MTP.

P1 - gguf_mmap.h open() idempotency: call release() first if already open; on failure paths, reset state to default before returning. P1 - gguf_mmap.h Windows release() HANDLE leak: CloseHandle() before clearing handle_. (Fix by inspection — Linux build does not exercise.) P1 - qwen35_mtp.cpp GPU warmup off-by-one: n_prompt > n_ctx -> >= n_ctx matching CPU warmup. With slot_start=1, exactly-n_ctx-length prompts would have written slot n_ctx (out of bounds). MTP now declines that edge case cleanly and the chain runner falls through to AR. P2 - test_dflash.cpp: missing hidden_seq capture now aborts cleanly (all_prefill_hidden.clear() + break) instead of running warm_head_kv on undefined buffer. P2 - T7 (chain-runner rollback) now asserts restore_kv_at_chain was called > 0 times on partial accept. P3 - T6 (EOS termination) tightened: total_emitted == 1, not <= 10. Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical (within AR-vs-AR noise baseline). Throughput in this run reflects a pre-existing unstaged flashprefill.cpp change (BSA opt-in -> opt-out) unrelated to these four files.

@howard0su

Per @howard0su's review on Luce-Org#237 (lines 57, 59): - 57: 'shall we use the unsloth single-file MTP-in-target GGUF?' - 59: 'why not a enum?' Replaces: const char * mtp_gguf_path = nullptr; const char * mtp_draft_source = nullptr; // "chain" | "mtp_topk" with: enum class MtpSource { None, Native, ExternalDrafter, Auto }; MtpSource mtp_source = MtpSource::None; const char * mtp_gguf_path = nullptr; // only for ExternalDrafter bool mtp_use_topk = false; // false=chain, true=mtp_topk Adds gguf_contains_mtp_tensors() probe (keyed on qwen35.nextn_predict_layers metadata) so --mtp-gguf becomes optional when the primary GGUF embeds MTP tensors (unsloth single-file case). Stacked on Luce-Org#237. dflash_server arg parsing updated to: - --mtp-source [none|native|external|auto] (new explicit flag) - --mtp-gguf PATH (now optional; only needed for ExternalDrafter) - Old --mtp-draft-source string flag warns + ignored (migration aid) - --mtp-gamma alone triggers Auto detection All test_common_mtp_orchestrator tests still pass (mock-based, unaffected by the config-surface change).

Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by the legacy-flag inference because MtpSource::None served both as the default and as the explicit 'none' value. Adds an internal MtpSource::Unset sentinel as the new default. Legacy-flag inference only fires when the field is still Unset. After inference (or if no legacy flag matched), Unset is resolved to None before any backend code sees it. User-visible CLI surface unchanged: --mtp-source still accepts exactly {none, native, external, auto}. Unset is internal-only and never escapes arg parsing. Defensive assert in create_backend() enforces this.

- server_main.cpp: gate --draft suppression and DFLASH27B_MTP_CTX on concrete MtpSource (Native|ExternalDrafter) instead of !None. Auto defers to backend GGUF probe; preserves --draft as fallback when Auto resolves to None. - mtp_orchestrator.cpp: wrap io with req.on_token via DaemonIO::with_token_callback (mirrors laguna_backend.cpp:151 and gemma4_backend.cpp:172). MTP requests now get streaming disconnect cancellation and per-token callbacks. 20/20 MTP test suite still passes.

dusterbloom mentioned this pull request May 20, 2026

feat(mtp): MTP-via-daemon end-to-end (incl. MTP infrastructure) #214

Closed

8 tasks

cubic-dev-ai Bot reviewed May 20, 2026

View reviewed changes

Comment thread dflash/src/common/mtp_chain_runner.cpp Outdated

Comment thread dflash/src/common/mtp_orchestrator.cpp

dusterbloom marked this pull request as draft May 20, 2026 19:36

dusterbloom marked this pull request as ready for review May 20, 2026 19:53

cubic-dev-ai Bot reviewed May 20, 2026

View reviewed changes

howard0su suggested changes May 20, 2026

View reviewed changes

dusterbloom mentioned this pull request May 21, 2026

refactor(dflash): rename namespace dflash27b → dflash::common #241

Merged

This was referenced May 21, 2026

refactor(common): extract gguf_mmap RAII wrapper as standalone PR #243

Merged

refactor(mtp): MtpSource enum + auto-detect MTP tensors dusterbloom/lucebox-hub#2

Merged

cubic-dev-ai Bot reviewed May 21, 2026

View reviewed changes

Comment thread dflash/src/server/server_main.cpp Outdated

dusterbloom added 11 commits May 21, 2026 21:53

dusterbloom force-pushed the feat/dflash-mtp-foundation branch from 1051e71 to 8ec9a95 Compare May 21, 2026 20:10

davide221 requested a review from howard0su May 22, 2026 09:07

		@@ -0,0 +1,219 @@
		// common/gguf_mmap.h — RAII wrapper for platform-conditional mmap of GGUF files.

		@@ -0,0 +1,133 @@
		// qwen35_mtp_graph.h — CUDA cgraph for Qwen3.6 MTP head step forward.

		@@ -0,0 +1,225 @@
		// qwen35_mtp_loader.cpp — Discovery loader for Qwen3.6 -MTP-GGUF files.

Conversation

dusterbloom commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Single-prompt HTTP (/v1/chat/completions, temp=0)

Agentic-CLI smoke (OpenAI Responses API)

Long-prompt resilience

Unit tests

Scope

Not in this PR (follow-ups)

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dusterbloom commented May 20, 2026

Uh oh!

dusterbloom commented May 20, 2026

Uh oh!

dusterbloom commented May 20, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dusterbloom commented May 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

weicj commented May 21, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dusterbloom commented May 20, 2026 •

edited

Loading

Single-prompt HTTP (`/v1/chat/completions`, temp=0)

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading