Merge pull request #14 from AtomicBot-ai/b1-mtp-qwen-rebase

Ooooze · web-flow · commit 0a635dcd92ba · 2026-05-13T20:26:59.000+03:00
Enhance multimodal support and speculative decoding in atomic-llama-c…
diff --git a/NEXTN.md b/NEXTN.md
@@ -235,3 +235,35 @@ What's actually in each repo, and why it's a bit unusual for a quant drop:
 - **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
 
 The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.
+
+---
+
+## 10. Multimodal (`--mmproj`) + speculative decoding (this fork)
+
+Upstream `llama-server` used to disable **all** speculative modes whenever a projector was loaded, so a single Qwen 3.6 / Gemma 4 server could not host vision and a draft head at the same time. In **atomic-llama-cpp-turboquant** the load-time and slot-init gates accept `--mmproj` together with:
+
+- **`--spec-type mtp`** (Gemma 4 assistant)
+- **`--spec-type nextn`** (Qwen3 NextN draft context)
+- **`--spec-type eagle3`** (stub impl; same contract)
+
+These three never look at the flattened `prompt_tgt` token stream — they read target hidden states / KV directly — so they can coexist with mtmd image chunks. Other modes stay disabled with a warning: separate **`draft`** models, all **`ngram_*`** modes, **`ctx_shift`** and **`cache_reuse`**.
+
+### What is and is not accelerated today
+
+- **Text-only turns on a multimodal slot** — draft head runs as usual. Same acceptance rates as the no-mmproj configuration.
+- **Turns that contain an image chunk** — server logs `skipping speculative prime for multimodal prompt` and falls back to plain target decoding for **that turn only**. The slot keeps generating text correctly, just without draft speedup.
+
+The reason for the fallback: NextN / MTP `begin()` needs the target's pre-norm hidden state at every prompt position, but the mtmd image-decode path only writes outputs for the last row of an image batch (`get_embeddings_pre_norm_ith` returns `null` for image-pad positions, see `tools/server/server-context.cpp`). Until image chunks emit per-token outputs, priming on a mixed token stream would leave the draft KV partially seeded and desynced from the target by image-expanded positions. Skipping the prime keeps the slot stable and lets the next pure-text turn re-enable drafting from scratch.
+
+### Verified configurations
+
+| Model | Spec | KV | mmproj | Image | Text reply | Decode |
+|---|---|---|---|---|---|---|
+| Qwen 3.6-35B-A3B-UDT-Q4_K_XL_MTP | `nextn` | turbo3 | F16 | recognised | OK | ~69 t/s |
+| Gemma 4-26B-A4B-it-UD-Q4_K_XL    | `mtp`   | turbo3 | F16 | recognised | OK | ~55 t/s |
+
+Both runs were validated on M4 Max with a single shared model file (no second mmap), `-c 4096`, `-fa on`.
+
+### Roadmap
+
+Real draft acceleration on the vision turn itself requires making mtmd image batches emit per-token outputs (or a teacher-forced replay through target). Tracked as a follow-up; not blocking this fork's release.
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ LLM inference in C/C++
 - [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
 - Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
 - Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
+- **This fork:** `--mmproj` can be loaded **alongside** `mtp` / `nextn` / `eagle3` speculative decoding on a single slot (validated on Qwen 3.6-35B-A3B-UDT + NextN + turbo3 KV and Gemma 4-26B-A4B + MTP + turbo3 KV). Draft acceleration applies to **text-only turns**; image-bearing turns fall back to plain target decoding (image still recognised). Other spec types remain disabled with multimodal. Details: [docs/speculative.md](./docs/speculative.md) and [NEXTN.md](./NEXTN.md) §10.
 - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
 - Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
 - Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
diff --git a/common/speculative.cpp b/common/speculative.cpp
@@ -1700,6 +1700,29 @@ enum common_speculative_type common_speculative_type_from_name(const std::string
     return it->second;
 }
 
+bool common_speculative_is_mtmd_safe(enum common_speculative_type type) {
+    switch (type) {
+        case COMMON_SPECULATIVE_TYPE_MTP:
+        case COMMON_SPECULATIVE_TYPE_NEXTN:
+        case COMMON_SPECULATIVE_TYPE_EAGLE3:
+            return true;
+        default:
+            return false;
+    }
+}
+
+bool common_speculative_all_impls_mtmd_safe(const common_speculative * spec) {
+    if (!spec) {
+        return true;
+    }
+    for (const auto & impl : spec->impls) {
+        if (!common_speculative_is_mtmd_safe(impl->type)) {
+            return false;
+        }
+    }
+    return true;
+}
+
 bool common_speculative_is_compat(llama_context * ctx_tgt) {
     auto * mem = llama_get_memory(ctx_tgt);
     if (mem == nullptr) {
diff --git a/common/speculative.h b/common/speculative.h
@@ -18,6 +18,13 @@ std::string common_speculative_type_to_str(enum common_speculative_type type);
 // note: clears the memory of the context
 bool common_speculative_is_compat(llama_context * ctx_tgt);
 
+// True for speculative modes that do not consume prompt_tgt in common_speculative_draft()
+// (MTP / NextN / EAGLE3 read target KV / hidden states instead). Safe to combine with --mmproj.
+bool common_speculative_is_mtmd_safe(enum common_speculative_type type);
+
+// True iff every registered impl is mtmd-safe (rejects mixed chains e.g. ngram + draft model).
+bool common_speculative_all_impls_mtmd_safe(const common_speculative * spec);
+
 common_speculative * common_speculative_init(
         common_params_speculative & params,
         llama_context             * ctx_tgt);
diff --git a/docs/speculative.md b/docs/speculative.md
@@ -8,6 +8,19 @@ llama.cpp supports speculative decoding, a technique that can significantly acce
 
 The `llama-server` application supports several implementations of speculative decoding. An implementation with draft model can be mixed with an implementation without draft model.
 
+### Multimodal (`--mmproj`) compatibility (atomic-llama-cpp-turboquant)
+
+When `--mmproj` is set, **`mtp`**, **`nextn`**, and **`eagle3`** speculative types remain **enabled at load**: their draft paths do not depend on `get_text_tokens()` / `prompt_tgt` the way `draft` and `ngram_*` do. Other types are auto-disabled at load with a warning. Mixed speculative chains (e.g. `ngram_simple` + `draft`) are rejected at slot init if any impl is not multimodal-safe.
+
+**Per-turn behaviour:**
+
+- **Text-only turns on a multimodal slot** — draft head runs as usual (same acceptance as without `--mmproj`).
+- **Turns containing an image chunk** — the slot logs `skipping speculative prime for multimodal prompt` and falls back to plain target decoding for that turn only. The image is still recognised correctly, just without draft speedup.
+
+The fallback is required because NextN / MTP prime needs per-token target hidden states for every prompt position, but mtmd image-decode currently only emits an output row for the last token of each image batch. Lifting this restriction is on the roadmap (mtmd batches need to mark every position with `logits[i] = true`, or be replayed teacher-forced).
+
+See `common_speculative_is_mtmd_safe` / `common_speculative_all_impls_mtmd_safe` in [`common/speculative.cpp`](../common/speculative.cpp), the `mmproj` gates and the `skip_draft_mtmd` per-turn gate in [`tools/server/server-context.cpp`](../tools/server/server-context.cpp). Validated configurations and the end-to-end recipe are documented in [`NEXTN.md` §10](../NEXTN.md#10-multimodal---mmproj--speculative-decoding-this-fork).
+
 ### Draft Model (`draft`)
 
 A much smaller model (called the _draft model_) generates drafts.
diff --git a/scripts/run-gemma4-e4b-mtp-mmproj-server.sh b/scripts/run-gemma4-e4b-mtp-mmproj-server.sh
@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+# Gemma 4 E4B + MTP draft + multimodal projector (vision) + TurboQuant3 KV.
+# Thin wrapper around run-gemma4-e4b-mtp-server.sh; override MMPROJ_GGUF if your mmproj lives elsewhere.
+#
+# Behaviour: text-only turns get the full MTP draft speedup; turns that include
+# an image chunk fall back to plain target decoding for that turn only and the
+# server logs "skipping speculative prime for multimodal prompt". See NEXTN.md §10.
+
+set -euo pipefail
+
+ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+MMPROJ_GGUF="${MMPROJ_GGUF:-${ROOT}/.scratch/gemma-4-e4b/mmproj-F16.gguf}"
+
+if [[ ! -f "$MMPROJ_GGUF" ]]; then
+  echo "error: mmproj not found: ${MMPROJ_GGUF}" >&2
+  echo "hint: place mmproj-F16.gguf next to your E4B GGUF or export MMPROJ_GGUF=/path/to/mmproj.gguf" >&2
+  exit 1
+fi
+
+exec env SPEC="${SPEC:-mtp}" \
+  bash "${ROOT}/scripts/run-gemma4-e4b-mtp-server.sh" --mmproj "$MMPROJ_GGUF" "$@"
diff --git a/tools/server/server-common.cpp b/tools/server/server-common.cpp
@@ -388,6 +388,15 @@ const llama_tokens & server_tokens::get_text_tokens() const {
     return tokens;
 }
 
+llama_tokens server_tokens::replace_media_null_tokens(llama_token replacement) const {
+    llama_tokens out;
+    out.reserve(tokens.size());
+    for (const llama_token t : tokens) {
+        out.push_back(t == LLAMA_TOKEN_NULL ? replacement : t);
+    }
+    return out;
+}
+
 void server_tokens::set_token(llama_pos pos, llama_token id) {
     GGML_ASSERT(!has_mtmd); // only allow this if mtmd is disabled
     tokens[pos] = id;
diff --git a/tools/server/server-common.h b/tools/server/server-common.h
@@ -189,6 +189,10 @@ struct server_tokens {
     // for compatibility with speculative decoding, ctx shift, slot save/load
     const llama_tokens & get_text_tokens() const;
 
+    // Replace LLAMA_TOKEN_NULL (mtmd media placeholders) for callers that need a flat token stream
+    // (e.g. Qwen NextN speculative begin / prime). Text tokens are unchanged.
+    llama_tokens replace_media_null_tokens(llama_token replacement) const;
+
     // for compatibility with speculative decoding
     void set_token(llama_pos pos, llama_token id);
 
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
@@ -16,6 +16,7 @@
 #include <cstddef>
 #include <cinttypes>
 #include <cstdio>
+#include <cstring>
 #include <vector>
 #include <exception>
 #include <memory>
@@ -33,6 +34,34 @@
 
 using json = nlohmann::ordered_json;
 
+namespace {
+
+// Token used to replace LLAMA_TOKEN_NULL placeholders when priming Qwen NextN draft KV (see common_speculative_begin).
+static llama_token server_nextn_mtmd_fill_token(const llama_model * model) {
+    const llama_vocab * vocab = llama_model_get_vocab(model);
+    if (!vocab) {
+        return 0;
+    }
+    static const char * const k_candidates[] = {
+        "<|image_pad|>",
+        "<|IMAGE_PAD|>",
+        "<|vision_pad|>",
+    };
+    std::vector<llama_token> buf(32);
+    for (const char * piece : k_candidates) {
+        const int32_t n = llama_tokenize(
+                vocab, piece, (int32_t) std::strlen(piece),
+                buf.data(), (int32_t) buf.size(), false, true);
+        if (n == 1) {
+            return buf[0];
+        }
+    }
+    const llama_token pad = llama_vocab_pad(vocab);
+    return pad != LLAMA_TOKEN_NULL ? pad : 0;
+}
+
+} // namespace
+
 constexpr int HTTP_POLLING_SECONDS = 1;
 
 // state diagram: https://github.com/ggml-org/llama.cpp/pull/9283
@@ -820,9 +849,10 @@ struct server_context_impl {
                 SRV_WRN("%s\n", "cache_reuse is not supported by multimodal, it will be disabled");
             }
 
-            if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE) {
-                params_base.speculative.type =  COMMON_SPECULATIVE_TYPE_NONE;
-                SRV_WRN("%s\n", "speculative decoding is not supported by multimodal, it will be disabled");
+            if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE &&
+                    !common_speculative_is_mtmd_safe(params_base.speculative.type)) {
+                params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
+                SRV_WRN("%s\n", "speculative decoding with this type is not supported by multimodal, it will be disabled");
             }
         }
 
@@ -888,8 +918,8 @@ struct server_context_impl {
             if (can_spec) {
                 slot.spec = common_speculative_init(params_base.speculative, slot.ctx);
                 if (slot.spec) {
-                    if (mctx) {
-                        SRV_ERR("%s\n", "speculative decoding is not supported with multimodal");
+                    if (mctx && !common_speculative_all_impls_mtmd_safe(slot.spec)) {
+                        SRV_ERR("%s\n", "speculative decoding with this type is not supported with multimodal");
                         return false;
                     }
                     // MTP reads target's KV memory by sequence id; bind to slot.id (server uses slot.id as seq_id).
@@ -2205,14 +2235,22 @@ struct server_context_impl {
             // generate draft tokens in speculative decoding mode
             // TODO: rework to have a single draft llama_context shared across all slots [TAG_SERVER_SPEC_REWORK]
             //       perform the speculative drafting for all sequences at the same time in a single batch
-            const int n_draft_max = slot.get_n_draft_max();
-            if (n_draft_max > 0) {
-                if (mctx) {
-                    // we should never reach this, as speculative is automatically disabled if mmproj is loaded
-                    GGML_ABORT("not supported by multimodal");
-                }
+            const int n_draft_max_raw = slot.get_n_draft_max();
+            const bool mtmd_safe_spec = slot.spec && common_speculative_all_impls_mtmd_safe(slot.spec);
+            if (mctx && n_draft_max_raw > 0 && !mtmd_safe_spec) {
+                GGML_ABORT("not supported by multimodal");
+            }
 
-                const llama_tokens & cached_text_tokens = slot.prompt.tokens.get_text_tokens();
+            // NextN/MTP prime requires per-token target hidden states which the mtmd image-decode
+            // path does not produce. Until that is wired in, skip drafting for slots whose prompt
+            // contains image chunks - the slot still works as a normal (non-speculative) decode.
+            const bool skip_draft_mtmd = mctx && slot.prompt.tokens.has_mtmd;
+            const int  n_draft_max     = skip_draft_mtmd ? 0 : n_draft_max_raw;
+
+            if (n_draft_max > 0) {
+                static const llama_tokens k_empty_prompt_tgt;
+                const llama_tokens & cached_text_tokens =
+                        (mctx && mtmd_safe_spec) ? k_empty_prompt_tgt : slot.prompt.tokens.get_text_tokens();
 
                 const auto & params_spec = slot.task->params.speculative;
 
@@ -3008,7 +3046,15 @@ struct server_context_impl {
                     slot.state = SLOT_STATE_GENERATING;
 
                     if (slot.can_speculate()) {
-                        common_speculative_begin(slot.spec, slot.prompt.tokens.get_text_tokens());
+                        if (slot.prompt.tokens.has_mtmd) {
+                            // Skip spec begin/prime for mtmd prompts: the per-token target hidden
+                            // states for image positions are not currently produced, which makes
+                            // NextN prime partial and could desync RoPE positions on later drafts.
+                            // The slot will still generate correctly via the non-speculative path.
+                            SLT_INF(slot, "%s", "skipping speculative prime for multimodal prompt\n");
+                        } else {
+                            common_speculative_begin(slot.spec, slot.prompt.tokens.get_text_tokens());
+                        }
                     }
                 } else if (slot.state != SLOT_STATE_GENERATING) {
                     continue; // continue loop of slots
@@ -3099,7 +3145,11 @@ struct server_context_impl {
                 slot.prompt.tokens.keep_first(slot.prompt.n_tokens() - n_draft);
 
                 // add accepted tokens to the prompt
-                slot.prompt.tokens.insert({ids.begin(), ids.end() - 1});
+                // note: use push_back loop instead of insert() so mtmd prompts work too
+                //       (server_tokens::insert asserts !has_mtmd; push_back is mtmd-safe).
+                for (auto it = ids.begin(); it != ids.end() - 1; ++it) {
+                    slot.prompt.tokens.push_back(*it);
+                }
                 slot.sampled = ids.back(); // last accepted token
 
                 llama_context_nextn_seq_rm(ctx, slot.id, slot.prompt.n_tokens(), -1);