Skip to content

feat(dFlash): MTP foundation#237

Open
dusterbloom wants to merge 13 commits into
Luce-Org:mainfrom
dusterbloom:feat/dflash-mtp-foundation
Open

feat(dFlash): MTP foundation#237
dusterbloom wants to merge 13 commits into
Luce-Org:mainfrom
dusterbloom:feat/dflash-mtp-foundation

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

@dusterbloom dusterbloom commented May 20, 2026

Summary

Adds Qwen3.5/3.6 NextN MTP support: IMtpModule interface + chain runner in common/, Qwen35MtpModule concrete impl, qwen35 backend wiring, and --mtp-gguf flags in dflash_server.

Works with both GGUF layouts:

  • Separate files: backbone GGUF as primary + MTP-head GGUF via --mtp-gguf
  • Unsloth single-file combined GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf): pass the same file as both the primary and --mtp-gguf; the loader reads backbone tensors and MTP head tensors from it. Verified greedy-equivalent + same 1.82× speedup as the separate-files setup.

Validation

Single-prompt HTTP (/v1/chat/completions, temp=0)

HumanEval-10 via harness/benchmarks/generation_benchmark.py on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):

Mode mean tok/s speedup accept rate
AR 32.46 1.00x
chain γ=2 59.24 1.82x 0.93–1.00

All 10 prompts pass functional checks. Greedy text byte-identical on 8/10; remaining 2 within the AR-vs-AR non-determinism baseline (verified via control run: same prompts produced 9/10 identical text across two AR-only runs of the same server).

Confirmed equivalent between separate-files and unsloth single-file layouts: 1.82× → 1.82× (statistically identical), 8/10 byte-identical text in both, same 95–100% accept rate range.

Agentic-CLI smoke (OpenAI Responses API)

pi (v0.55.3) via harness/clients/run_pi.sh: server starts with MTP, SSE streaming delivers tokens, [mtp_decode] iters=2 proposed=4 accepted=4 accept_rate=1.00, marker received exact-match. Proves the full HTTP → Responses API → SSE → MTP chain → tokenizer → client round-trip.

Long-prompt resilience

25K-token prompt under --mtp-gguf + --max-ctx 65536: server stays alive, MTP fires (iters=16, 50% accept), real response returned. Earlier silent SIGKILL on long prompts (DFLASH27B_MTP_CTX default 8192 not tracking max_ctx) fixed in `ada7bb0`; warm_head_kv overflow now triggers graceful AR fallback.

Unit tests

test_common_mtp_orchestrator: 20/20 pass — chain runner state machine (γ propagation, EOS termination, partial-accept rollback, n_gen termination), orchestrator lifecycle, Qwen35MtpModule input-validation guards, T20 ExternalDrafter partial-accept hidden threading (asserts the set_capture_row + consume_captured_hidden invariant the `74f708a` fix added). Mock-based, no GGUF required.

Scope

  • IMtpModule + INativeMtp + IExternalDrafterMtp mixins
  • Qwen35MtpModule (INativeMtp impl) + GGUF loader + step graph
  • MtpChainRunner + MtpOrchestrator
  • dflash_server: --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk

Not in this PR (follow-ups)

  • Gemma4 ExternalDrafter MTP module (uses the IExternalDrafterMtp interface declared here)
  • Per-request DFlash/MTP/auto dispatcher (commit `3b65ac1` on `feat/mtp-prefix-warm-ghost`)
  • IDrafter abstraction unifying DFlash + MTP under one runner
  • Tree-mode runner + entropy-adaptive selector
  • Prefix-cache WARM hit for MTP

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 34 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/mtp_chain_runner.cpp Outdated
Comment thread dflash/src/common/mtp_orchestrator.cpp
@dusterbloom
Copy link
Copy Markdown
Contributor Author

Addressed cubic review in commit 74f708a:

  • F1 (P2): add missing io.emit(-1) on chain-runner failure path in warm_and_decode (mtp_orchestrator.cpp:152)
  • F2 (P1): call set_capture_row(accept_n) + consume_captured_hidden() before threading running_hidden on partial accept for ExternalDrafter flavor; NativeHeads path is unaffected (next_hidden is always empty for that flavor, new branch not entered)

Build clean. All 19 unit tests pass. T7 (partial_accept_rollback) exercises the NativeHeads rollback path only — the ExternalDrafter hidden-sync invariant is not covered by the existing suite (noted for follow-up).

E2E HumanEval-10 AR-vs-MTP same-text rate: 8/10 post-fix (unchanged from 8/10 pre-fix). F2 is confined to ExternalDrafter and does not affect the Qwen3.6/NativeHeads path tested here. he_01/he_02 divergences remain — consistent with prior AR-vs-AR control showing GPU non-determinism on those two prompts.

@dusterbloom
Copy link
Copy Markdown
Contributor Author

Coverage gap closed in bf2d3e9: T20 now exercises the ExternalDrafter partial-accept path that the 74f708a fix addressed (T7 used a NativeHeads stub, so it never reached the fixed branch). 20/20 unit tests pass.

@dusterbloom dusterbloom marked this pull request as draft May 20, 2026 19:36
@dusterbloom
Copy link
Copy Markdown
Contributor Author

Two more fixes from real-workload hermes testing (commit ada7bb0):

A. Silent server death on long-prompt MTP: DFLASH27B_MTP_CTX defaulted to 8192 but the server never propagated --max-ctx to that env var. With --max-ctx 65536, the backbone KV was allocated at 65536 slots while the MTP head_kv stayed at 8192 — VRAM mismatch caused OOM on the next prefill. (The warm_head_kv overflow guard already returned false cleanly — the bug was upstream at init time.) Fix: 5-line propagation in server_main.cpp using setenv(..., overwrite=0) so user env still wins.

B. --draft + --mtp-gguf conflict: server now logs a WARNING and clears draft_path when both are set (MTP wins).

Verified: 25K-token prompt → server stays alive, MTP ran (iters=16 proposed=32 accepted=16 accept_rate=0.50), real response returned. Conflict invocation prints the warning and starts cleanly with draft = (none) in the config banner.

@dusterbloom dusterbloom marked this pull request as ready for review May 20, 2026 19:53
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 34 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/src/common/gguf_mmap.h
Comment thread dflash/src/qwen35/qwen35_mtp.cpp Outdated
Comment thread dflash/test/test_dflash.cpp
Comment thread dflash/test/test_common_mtp_orchestrator.cpp
Comment thread dflash/test/test_common_mtp_orchestrator.cpp Outdated
@dusterbloom
Copy link
Copy Markdown
Contributor Author

Addressed cubic re-review in commit f6e8e94:

P1:

  • gguf_mmap.h open() now idempotent (releases prior mapping first via placement-destroy+reconstruct; leaves object in default state on any failure path)
  • gguf_mmap.h Windows release() now calls CloseHandle() before clearing handle_ (fix by inspection — Linux build cannot exercise the Windows path)
  • qwen35_mtp.cpp GPU warmup off-by-one (> -> >=): a prompt of exactly n_ctx tokens would have written slot n_ctx (out-of-bounds; valid range is [0, n_ctx-1] with slot_start=1). Now declines MTP cleanly and chain runner falls through to AR.

P2:

  • test_dflash.cpp missing hidden_seq capture now clears all_prefill_hidden and breaks out of the prefill loop instead of running warm_head_kv on an undefined buffer
  • T7 now asserts restore_kv_at_chain_calls >= 1 — rollback path is confirmed to fire on partial accept

P3:

  • T6 EOS termination assertion tightened from total_emitted <= 10 to total_emitted == 1 (EOS is the first emitted token; runner must stop immediately)

Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10 byte-identical.

Comment thread dflash/src/common/backend_factory.h Outdated

// MTP (Multi-Token Prediction) speculator — mutually exclusive with --draft.
// When mtp_gguf_path is set, the backend ignores draft_path.
const char * mtp_gguf_path = nullptr;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF which contains MTP in the main model?

Comment thread dflash/src/common/backend_factory.h Outdated
// When mtp_gguf_path is set, the backend ignores draft_path.
const char * mtp_gguf_path = nullptr;
int mtp_gamma = 0; // 0 = MTP loaded but not active; >0 = chain depth
const char * mtp_draft_source = nullptr; // "chain" (default) | "mtp_topk"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not a enum?

@@ -0,0 +1,219 @@
// common/gguf_mmap.h — RAII wrapper for platform-conditional mmap of GGUF files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this worth a separate PR

// supports_mtp() returns true. Default returns nullptr.
// Forward-declared to avoid a header dependency from model_backend.h
// on mtp_interface.h — backends include both as needed.
virtual mtp::IMtpModule * mtp() { return nullptr; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return nullptr indicate this model doesn't support mtp?

// matching mixin (IExternalDrafterMtp / INativeMtp).
enum class MtpFlavor {
ExternalDrafter, // Gemma4-style: separate drafter, h_prev chain
NativeHeads, // Qwen3.6-style: MTP heads in the backbone
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should hide into ModelBackend.

const int n = std::min(prefill_ubatch, prompt_len - start);
std::vector<int32_t> chunk(req.prompt.begin() + start,
req.prompt.begin() + start + n);
if (!target->verify_batch(chunk, start, last_tok, nullptr)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtp should only change how we generate the batch (v.s. draft), can we normalize the code with the existing logic?

@@ -0,0 +1,133 @@
// qwen35_mtp_graph.h — CUDA cgraph for Qwen3.6 MTP head step forward.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use the existing graph API not a new API unless we want to put MTP header into another ggml backend.

@@ -0,0 +1,225 @@
// qwen35_mtp_loader.cpp — Discovery loader for Qwen3.6 -MTP-GGUF files.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check the previous comment about GGUF loading.

@weicj
Copy link
Copy Markdown
Contributor

weicj commented May 21, 2026

I’m working on a shared hardware/backend/runtime placement foundation for the cpp native server path. #236

Could we align the MTP backend/device selection with that shared placement layer, instead of keeping a separate MTP-specific selection path?

This would avoid duplicated CUDA/HIP/device handling across MTP,DFlash, PFlash, and future speculative paths.

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Mechanical rename per weicj's review on PR Luce-Org#237 — the legacy namespace
name baked the first backend (Qwen3-27B) into shared code. Renaming to
dflash::common removes the backend leak from the substrate so future
backends plug into a neutral namespace.

Scope:
- namespace dflash27b → namespace dflash::common
- dflash27b::* → dflash::common::*
- CMake static lib dflash27b → dflash_common
- CMake project(dflash27b) → project(dflash)
- Private CMake vars _dflash27b_* → _dflash_*
- Stale comment references

Out of scope (deferred to a follow-up):
- Public C header dflash/include/dflash27b.h
- C symbol dflash27b_last_error()
- Preprocessor macros DFLASH27B_*
- Env vars DFLASH27B_*

Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during
CUDA host-compiler ID detection. Workaround:
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11

No behavior change. Clean build green. Symbol mangling confirmed as
N6dflash6common* via nm; zero residual dflash27b symbols outside the
deferred public C ABI.
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Mechanical rename per @weicj's review on Luce-Org#237. Renames the legacy
backend-baked namespace to a neutral one so future backends plug into
shared code without name leakage.

Scope:
- namespace dflash27b → namespace dflash::common (sources + tests)
- dflash27b::* → dflash::common::*
- CMake static lib dflash27b → dflash_common
- CMake project(dflash27b) → project(dflash)
- Private CMake vars _dflash27b_* → _dflash_*
- Stale comment references

Out of scope (deferred to a follow-up):
- Public C header dflash/include/dflash27b.h
- C symbol dflash27b_last_error()
- Preprocessor macros DFLASH27B_*
- Env vars DFLASH27B_*

Build note: CUDA 12.6 + GCC 13.3 has a known _Float128 conflict during
CUDA host-compiler ID detection. Workaround:
  -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-11

No behavior change. Symbol mangling confirmed via nm as N6dflash6common*;
no residual dflash27b mangled symbols outside the deferred public C ABI.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread dflash/src/server/server_main.cpp Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per @howard0su's review on Luce-Org#237: gguf_mmap.h/.cpp are platform
abstraction (POSIX mmap / Windows MapViewOfFile) that will be reused
by multiple loaders. Extracting into a standalone PR ahead of the
loader refactor (target/draft/MTP heads all benefit).

Includes the cubic P1 fixes from f6e8e94:
- open() is now idempotent (releases prior mapping before re-opening,
  leaves object in default state on any failure path)
- Windows release() now calls CloseHandle() before clearing handle_

New: test_gguf_mmap unit test covers open/close, idempotency,
missing-file path, RAII destructor.

Stacked on Luce-Org#241 (uses dflash::common namespace from the start).
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 21, 2026
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by
the legacy-flag inference because MtpSource::None served both as the
default and as the explicit 'none' value.

Adds an internal MtpSource::Unset sentinel as the new default.
Legacy-flag inference only fires when the field is still Unset. After
inference (or if no legacy flag matched), Unset is resolved to None
before any backend code sees it.

User-visible CLI surface unchanged: --mtp-source still accepts exactly
{none, native, external, auto}. Unset is internal-only and never
escapes arg parsing. Defensive assert in create_backend() enforces this.
Ports the Qwen3.6 MTP head onto the qwen35 backbone (same arch, NextN
block at layer n_layer-1). Speculation runs through a new common chain
runner; the existing DFlashTarget adapter handles verify/snapshot/restore.

- common/mtp_interface.h: flavor-tagged IMtpModule + INativeMtp /
  IExternalDrafterMtp mixins. Future Gemma4 drafter plugs in via
  IExternalDrafterMtp without touching the chain runner.
- common/mtp_chain_runner.{h,cpp}: γ-chain propose/verify/accept loop,
  hoisted out of the backend. Three KV-reconciliation paths
  (accept-all / fast rollback / recommit) share a single post-iter
  invariant so AR equivalence holds under recommit.
- common/mtp_orchestrator.{h,cpp}: chunked prefill + warm + dispatch
  to chain runner. Owns only control flow; all compute lives in
  DFlashTarget::verify_batch and INativeMtp::step_batch graphs on the
  backend device.
- qwen36/qwen36_mtp.{h,cpp,_graph.cpp,_loader.cpp}: GGUF tensor
  inventory for Qwen3.6 -MTP-GGUF, GPU warm graph, GPU step graph
  cached on (head_idx, fa_window, fused_lm_head, topk_k). γ is bound
  at attach time as the single source of truth.
- qwen35: supports_mtp()/mtp() exposed through ModelBackend;
  generate() delegates to common::mtp::warm_and_decode when MTP is
  configured. Cache sized for max(γ+1, ddtree_budget+1) verify tokens.
- server.py: --mtp-gguf and --mtp-gamma flags routed through; daemon
  command surface unchanged.

Tests: 4/4 test_common_mtp_orchestrator. Full build green; harness probe
7/7 (claude_code, codex, opencode, openwebui, pi, hermes, openclaw) at
--max-ctx 65536; MTP decode reports accept_rate 0.43-0.88 on short
agentic prompts.
… paths

Adds unit tests exercising MtpChainRunner state machine (gamma propagation,
EOS termination, partial-accept rollback, n_gen termination), MtpOrchestrator
lifecycle (reset_chain ordering, set_initial_hidden plumbing, gamma
derivation), and Qwen36MtpModule input-validation guards (attach(nullptr),
out-of-range gamma, pre-attach calls, shutdown idempotency).

All new tests are mock-based and require no GGUF.
Both qwen35_backend and the test_dflash harness called the 4-arg overload of
Qwen36MtpModule::init() with backend.tensor_context() — the BACKBONE ggml_ctx.
That ctx contains only blk.0..63; MTP head tensors live in blk.64.* in the
separate MTP GGUF, so the loader could never resolve them and init failed
silently before chain decode.

Switched both call sites to the 3-arg overload init(gguf_path, target, err)
which calls load_gguf_tensor_context(mtp_gguf_path) internally to build the
correct context for tensor lookup.

Caught by E2E sweep on real Qwen3.6-27B Q4_K_M + MTP-Q4_K_M GGUFs.
… is qwen35)

The GGUF general.architecture field on both 'Qwen3.5-0.8B-Q4_K_M.gguf' and
'Qwen3.6-27B-MTP-Q4_K_M.gguf' is 'qwen35'. The 'Qwen3.6' label lives only in
general.name (model display name). The MTP module is a NextN extension of the
qwen35 architecture, not a separate arch.

Aligns naming with Qwen35Backend (the backbone) and the loader's arch check
in gguf_target_loader.cpp:271. Pure rename — no functional change.

Files renamed (git mv, blame preserved):
  src/qwen36/qwen36_mtp.{h,cpp}         → src/qwen35/qwen35_mtp.{h,cpp}
  src/qwen36/qwen36_mtp_graph.{h,cpp}   → src/qwen35/qwen35_mtp_graph.{h,cpp}
  src/qwen36/qwen36_mtp_loader.cpp      → src/qwen35/qwen35_mtp_loader.cpp
  dflash/src/qwen36/ directory removed

E2E re-verified on Qwen3.6-27B Q4_K_M + MTP-Q4_K_M (RTX 3090):
  AR (γ=0):   35.6 tok/s
  Chain γ=2: 57.8 tok/s = 1.62× over AR  (accept_rate=0.83)
  Chain γ=4: 50.9 tok/s = 1.43× over AR  (accept_rate=0.58)
Greedy token output byte-identical across all three modes.
Mechanical fixes from style review:
- qwen35_mtp_loader.cpp: rewrite header from stale 'PR 2 skeleton'
  narrative to current behavior
- mtp_chain_runner.cpp: remove dead if-block at the post-propose
  size check (comment-only body)
- Qwen35DaemonArgs::n renamed to Qwen35DaemonArgs::mtp_draft_source
  to match Qwen35Config::mtp_draft_source
- mtp_orchestrator namespace: drop unused 'common' layer
  (dflash27b::common::mtp -> dflash27b::mtp)
- mtp_orchestrator: dynamic_cast -> static_cast (flavor() already
  discriminates the concrete type)
- qwen35_mtp.cpp: drop redundant 'static' on TU-local helpers inside
  anonymous namespace
- qwen35_mtp.{h,cpp}: convert Phase-A/B roadmap comments to TODOs or
  remove (completed phases)
- qwen35_backend.h: label ensure_decode_cache/tensor_context as
  test-only (they remain public for harness access)
- dflash_target.h: forward-declare DDTree; move ddtree.h include to
  .cpp files that use it concretely
- mtp_chain_runner.h: clean stale PR-internal cross-reference
- qwen35_backend.h: terminology consistency (MTP module)
- mtp_orchestrator.cpp: comment that enable_hidden_seq_capture is
  no-op for non-MTP targets
- test_common_mtp_orchestrator.cpp: remove PR-review narrative
- qwen35_backend.cpp: add fflush after [mtp] loaded printf
- No functional change. E2E sweep re-verified greedy byte-identical
  output across AR / chain γ=2 / chain γ=4.
Adds --mtp-gguf, --mtp-gamma, --mtp-draft-source, --mtp-draft-topk to
the native C++ HTTP server (dflash/src/server/dflash_server). Threads
them through BackendArgs into Qwen35Config so the existing Qwen35Backend
MTP path (already wired) is reachable through the production HTTP
endpoint, not just the test_dflash harness.

Mirror of the existing flags in dflash/scripts/server.py.

Verified via HTTP /v1/chat/completions: AR baseline and chain γ=2
produce greedy-identical text under temperature=0 on Qwen3.6-27B
Q4_K_M + MTP-Q4_K_M. Also confirmed with unsloth single-file combined
GGUF (Qwen3.6-27B-Q4_K_M-mtp.gguf) passed as --mtp-gguf — identical
output, greedy equivalent.
F2 (P1) mtp_chain_runner.cpp: on partial accept (accept_n < g_actual)
running_hidden was set from the drafter-internal next_hidden produced
after the full rejected suffix.  For ExternalDrafter flavor, call
set_capture_row(accept_n) + consume_captured_hidden() to thread the
target-captured hidden at the committed boundary instead.  NativeHeads
(Qwen3.6) is unaffected — next_hidden is always empty for that flavor
and the new branch is not entered.

F1 (P2) mtp_orchestrator.cpp: warm_and_decode chain-runner failure path
(line 151) returned without emitting io.emit(-1), leaving the stream
without a termination sentinel unlike every other error return in the
function.  Add the missing io.emit(-1) before return.

Cubic review: "On runner failure, stream termination sentinel io.emit(-1)
is not emitted, unlike all other error paths in this function" (P2).
"External-drafter prev_hidden desynchronizes on partial accept: running_hidden
is updated unconditionally from the full draft chain even when suffix drafts
are rejected, so the next iteration threads a hidden state from a rejected
token path instead of the committed boundary" (P1).

Build: clean.  All 19 unit tests pass.  E2E HumanEval-10 AR-vs-MTP
same-text rate: 8/10 post-fix (unchanged from pre-fix; F2 not active on
Qwen3.6/NativeHeads path — he_01/he_02 divergences confirmed GPU noise).
T7 used a NativeHeads stub so it never exercised the IExternalDrafterMtp
partial-accept path that commit 74f708a fixed. Add StubExternalDrafter +
T20 asserting set_capture_row(accept_n) is called before
consume_captured_hidden() when accept_n < gamma_actual.
…--draft/--mtp-gguf conflict

Real-workload testing with agentic CLI clients (hermes) exposed two
regressions not caught by HumanEval bench or unit tests:

A. DFLASH27B_MTP_CTX default was 8192; any prompt longer than that
   would print '[qwen35_mtp] warm_head_kv: n_prompt=N exceeds
   head_kv capacity n_ctx=8192' as a warning, then SIGKILL the server
   during the next prefill operation (CUDA assertion / memory
   corruption from writing past the buffer). VRAM spiked from 5.8GB
   to 20GB then process died. Silent — no error logged.
   Fix: default MTP_CTX to backbone max_ctx + warm_head_kv returns
   false on overflow so the chain runner falls back to AR cleanly.

B. dflash_server accepted --draft and --mtp-gguf together; the
   harness 'start_dflash_native_server' passes both by default which
   produced inconsistent state. Fix: at startup, log a WARNING and
   clear bargs.draft_path when both are set (MTP wins).

Verified: server now stays alive when handed a 15K-token prompt
under default MTP_CTX, falls back to AR for that request, returns
response. --draft+--mtp-gguf invocation logs the conflict warning
and loads only MTP.
P1 - gguf_mmap.h open() idempotency: call release() first if already
  open; on failure paths, reset state to default before returning.
P1 - gguf_mmap.h Windows release() HANDLE leak: CloseHandle() before
  clearing handle_. (Fix by inspection — Linux build does not exercise.)
P1 - qwen35_mtp.cpp GPU warmup off-by-one: n_prompt > n_ctx -> >= n_ctx
  matching CPU warmup. With slot_start=1, exactly-n_ctx-length prompts
  would have written slot n_ctx (out of bounds). MTP now declines that
  edge case cleanly and the chain runner falls through to AR.
P2 - test_dflash.cpp: missing hidden_seq capture now aborts cleanly
  (all_prefill_hidden.clear() + break) instead of running warm_head_kv
  on undefined buffer.
P2 - T7 (chain-runner rollback) now asserts restore_kv_at_chain was
  called > 0 times on partial accept.
P3 - T6 (EOS termination) tightened: total_emitted == 1, not <= 10.

Build clean, 20/20 unit tests pass. HumanEval-10 re-bench: 8/10
byte-identical (within AR-vs-AR noise baseline). Throughput in this
run reflects a pre-existing unstaged flashprefill.cpp change (BSA
opt-in -> opt-out) unrelated to these four files.
Per @howard0su's review on Luce-Org#237 (lines 57, 59):
- 57: 'shall we use the unsloth single-file MTP-in-target GGUF?'
- 59: 'why not a enum?'

Replaces:
  const char * mtp_gguf_path = nullptr;
  const char * mtp_draft_source = nullptr;  // "chain" | "mtp_topk"
with:
  enum class MtpSource { None, Native, ExternalDrafter, Auto };
  MtpSource    mtp_source     = MtpSource::None;
  const char * mtp_gguf_path  = nullptr;  // only for ExternalDrafter
  bool         mtp_use_topk   = false;    // false=chain, true=mtp_topk

Adds gguf_contains_mtp_tensors() probe (keyed on qwen35.nextn_predict_layers
metadata) so --mtp-gguf becomes optional when the primary GGUF embeds
MTP tensors (unsloth single-file case).

Stacked on Luce-Org#237. dflash_server arg parsing updated to:
- --mtp-source [none|native|external|auto] (new explicit flag)
- --mtp-gguf PATH (now optional; only needed for ExternalDrafter)
- Old --mtp-draft-source string flag warns + ignored (migration aid)
- --mtp-gamma alone triggers Auto detection

All test_common_mtp_orchestrator tests still pass (mock-based,
unaffected by the config-surface change).
Per cubic P1 on PR Luce-Org#237: --mtp-source=none was silently overridden by
the legacy-flag inference because MtpSource::None served both as the
default and as the explicit 'none' value.

Adds an internal MtpSource::Unset sentinel as the new default.
Legacy-flag inference only fires when the field is still Unset. After
inference (or if no legacy flag matched), Unset is resolved to None
before any backend code sees it.

User-visible CLI surface unchanged: --mtp-source still accepts exactly
{none, native, external, auto}. Unset is internal-only and never
escapes arg parsing. Defensive assert in create_backend() enforces this.
@dusterbloom dusterbloom force-pushed the feat/dflash-mtp-foundation branch from 1051e71 to 8ec9a95 Compare May 21, 2026 20:10
- server_main.cpp: gate --draft suppression and DFLASH27B_MTP_CTX on
  concrete MtpSource (Native|ExternalDrafter) instead of !None. Auto
  defers to backend GGUF probe; preserves --draft as fallback when
  Auto resolves to None.

- mtp_orchestrator.cpp: wrap io with req.on_token via
  DaemonIO::with_token_callback (mirrors laguna_backend.cpp:151 and
  gemma4_backend.cpp:172). MTP requests now get streaming disconnect
  cancellation and per-token callbacks.

20/20 MTP test suite still passes.
@davide221 davide221 requested a review from howard0su May 22, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants