Skip to content

Commit 0a635dc

Browse files
authored
Merge pull request #14 from AtomicBot-ai/b1-mtp-qwen-rebase
Enhance multimodal support and speculative decoding in atomic-llama-c…
2 parents 8893692 + ead60fb commit 0a635dc

9 files changed

Lines changed: 174 additions & 14 deletions

File tree

NEXTN.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,3 +235,35 @@ What's actually in each repo, and why it's a bit unusual for a quant drop:
235235
- **Apache-2.0**, attribution: Qwen team (weights), Unsloth (imatrix + BF16 sources), [@TheTom](https://github.com/TheTom) (TurboQuant), AtomicChat (UDT masks + packaging). Fork: [`AtomicBot-ai/atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant).
236236

237237
The whole pipeline (download → quantize on H100 → bench on M4 Max → upload) is scripted in [`docs/qwen-udt/RUNBOOK.md`](../docs/qwen-udt/RUNBOOK.md); re-running it on the same Unsloth sources reproduces the published files bit-identical.
238+
239+
---
240+
241+
## 10. Multimodal (`--mmproj`) + speculative decoding (this fork)
242+
243+
Upstream `llama-server` used to disable **all** speculative modes whenever a projector was loaded, so a single Qwen 3.6 / Gemma 4 server could not host vision and a draft head at the same time. In **atomic-llama-cpp-turboquant** the load-time and slot-init gates accept `--mmproj` together with:
244+
245+
- **`--spec-type mtp`** (Gemma 4 assistant)
246+
- **`--spec-type nextn`** (Qwen3 NextN draft context)
247+
- **`--spec-type eagle3`** (stub impl; same contract)
248+
249+
These three never look at the flattened `prompt_tgt` token stream — they read target hidden states / KV directly — so they can coexist with mtmd image chunks. Other modes stay disabled with a warning: separate **`draft`** models, all **`ngram_*`** modes, **`ctx_shift`** and **`cache_reuse`**.
250+
251+
### What is and is not accelerated today
252+
253+
- **Text-only turns on a multimodal slot** — draft head runs as usual. Same acceptance rates as the no-mmproj configuration.
254+
- **Turns that contain an image chunk** — server logs `skipping speculative prime for multimodal prompt` and falls back to plain target decoding for **that turn only**. The slot keeps generating text correctly, just without draft speedup.
255+
256+
The reason for the fallback: NextN / MTP `begin()` needs the target's pre-norm hidden state at every prompt position, but the mtmd image-decode path only writes outputs for the last row of an image batch (`get_embeddings_pre_norm_ith` returns `null` for image-pad positions, see `tools/server/server-context.cpp`). Until image chunks emit per-token outputs, priming on a mixed token stream would leave the draft KV partially seeded and desynced from the target by image-expanded positions. Skipping the prime keeps the slot stable and lets the next pure-text turn re-enable drafting from scratch.
257+
258+
### Verified configurations
259+
260+
| Model | Spec | KV | mmproj | Image | Text reply | Decode |
261+
|---|---|---|---|---|---|---|
262+
| Qwen 3.6-35B-A3B-UDT-Q4_K_XL_MTP | `nextn` | turbo3 | F16 | recognised | OK | ~69 t/s |
263+
| Gemma 4-26B-A4B-it-UD-Q4_K_XL | `mtp` | turbo3 | F16 | recognised | OK | ~55 t/s |
264+
265+
Both runs were validated on M4 Max with a single shared model file (no second mmap), `-c 4096`, `-fa on`.
266+
267+
### Roadmap
268+
269+
Real draft acceleration on the vision turn itself requires making mtmd image batches emit per-token outputs (or a teacher-forced replay through target). Tracked as a follow-up; not blocking this fork's release.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ LLM inference in C/C++
2626
- [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
2727
- Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
2828
- Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
29+
- **This fork:** `--mmproj` can be loaded **alongside** `mtp` / `nextn` / `eagle3` speculative decoding on a single slot (validated on Qwen 3.6-35B-A3B-UDT + NextN + turbo3 KV and Gemma 4-26B-A4B + MTP + turbo3 KV). Draft acceleration applies to **text-only turns**; image-bearing turns fall back to plain target decoding (image still recognised). Other spec types remain disabled with multimodal. Details: [docs/speculative.md](./docs/speculative.md) and [NEXTN.md](./NEXTN.md) §10.
2930
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
3031
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
3132
- Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669

common/speculative.cpp

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1700,6 +1700,29 @@ enum common_speculative_type common_speculative_type_from_name(const std::string
17001700
return it->second;
17011701
}
17021702

1703+
bool common_speculative_is_mtmd_safe(enum common_speculative_type type) {
1704+
switch (type) {
1705+
case COMMON_SPECULATIVE_TYPE_MTP:
1706+
case COMMON_SPECULATIVE_TYPE_NEXTN:
1707+
case COMMON_SPECULATIVE_TYPE_EAGLE3:
1708+
return true;
1709+
default:
1710+
return false;
1711+
}
1712+
}
1713+
1714+
bool common_speculative_all_impls_mtmd_safe(const common_speculative * spec) {
1715+
if (!spec) {
1716+
return true;
1717+
}
1718+
for (const auto & impl : spec->impls) {
1719+
if (!common_speculative_is_mtmd_safe(impl->type)) {
1720+
return false;
1721+
}
1722+
}
1723+
return true;
1724+
}
1725+
17031726
bool common_speculative_is_compat(llama_context * ctx_tgt) {
17041727
auto * mem = llama_get_memory(ctx_tgt);
17051728
if (mem == nullptr) {

common/speculative.h

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@ std::string common_speculative_type_to_str(enum common_speculative_type type);
1818
// note: clears the memory of the context
1919
bool common_speculative_is_compat(llama_context * ctx_tgt);
2020

21+
// True for speculative modes that do not consume prompt_tgt in common_speculative_draft()
22+
// (MTP / NextN / EAGLE3 read target KV / hidden states instead). Safe to combine with --mmproj.
23+
bool common_speculative_is_mtmd_safe(enum common_speculative_type type);
24+
25+
// True iff every registered impl is mtmd-safe (rejects mixed chains e.g. ngram + draft model).
26+
bool common_speculative_all_impls_mtmd_safe(const common_speculative * spec);
27+
2128
common_speculative * common_speculative_init(
2229
common_params_speculative & params,
2330
llama_context * ctx_tgt);

docs/speculative.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,19 @@ llama.cpp supports speculative decoding, a technique that can significantly acce
88

99
The `llama-server` application supports several implementations of speculative decoding. An implementation with draft model can be mixed with an implementation without draft model.
1010

11+
### Multimodal (`--mmproj`) compatibility (atomic-llama-cpp-turboquant)
12+
13+
When `--mmproj` is set, **`mtp`**, **`nextn`**, and **`eagle3`** speculative types remain **enabled at load**: their draft paths do not depend on `get_text_tokens()` / `prompt_tgt` the way `draft` and `ngram_*` do. Other types are auto-disabled at load with a warning. Mixed speculative chains (e.g. `ngram_simple` + `draft`) are rejected at slot init if any impl is not multimodal-safe.
14+
15+
**Per-turn behaviour:**
16+
17+
- **Text-only turns on a multimodal slot** — draft head runs as usual (same acceptance as without `--mmproj`).
18+
- **Turns containing an image chunk** — the slot logs `skipping speculative prime for multimodal prompt` and falls back to plain target decoding for that turn only. The image is still recognised correctly, just without draft speedup.
19+
20+
The fallback is required because NextN / MTP prime needs per-token target hidden states for every prompt position, but mtmd image-decode currently only emits an output row for the last token of each image batch. Lifting this restriction is on the roadmap (mtmd batches need to mark every position with `logits[i] = true`, or be replayed teacher-forced).
21+
22+
See `common_speculative_is_mtmd_safe` / `common_speculative_all_impls_mtmd_safe` in [`common/speculative.cpp`](../common/speculative.cpp), the `mmproj` gates and the `skip_draft_mtmd` per-turn gate in [`tools/server/server-context.cpp`](../tools/server/server-context.cpp). Validated configurations and the end-to-end recipe are documented in [`NEXTN.md` §10](../NEXTN.md#10-multimodal---mmproj--speculative-decoding-this-fork).
23+
1124
### Draft Model (`draft`)
1225

1326
A much smaller model (called the _draft model_) generates drafts.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/usr/bin/env bash
2+
# Gemma 4 E4B + MTP draft + multimodal projector (vision) + TurboQuant3 KV.
3+
# Thin wrapper around run-gemma4-e4b-mtp-server.sh; override MMPROJ_GGUF if your mmproj lives elsewhere.
4+
#
5+
# Behaviour: text-only turns get the full MTP draft speedup; turns that include
6+
# an image chunk fall back to plain target decoding for that turn only and the
7+
# server logs "skipping speculative prime for multimodal prompt". See NEXTN.md §10.
8+
9+
set -euo pipefail
10+
11+
ROOT="$(cd "$(dirname "$0")/.." && pwd)"
12+
MMPROJ_GGUF="${MMPROJ_GGUF:-${ROOT}/.scratch/gemma-4-e4b/mmproj-F16.gguf}"
13+
14+
if [[ ! -f "$MMPROJ_GGUF" ]]; then
15+
echo "error: mmproj not found: ${MMPROJ_GGUF}" >&2
16+
echo "hint: place mmproj-F16.gguf next to your E4B GGUF or export MMPROJ_GGUF=/path/to/mmproj.gguf" >&2
17+
exit 1
18+
fi
19+
20+
exec env SPEC="${SPEC:-mtp}" \
21+
bash "${ROOT}/scripts/run-gemma4-e4b-mtp-server.sh" --mmproj "$MMPROJ_GGUF" "$@"

tools/server/server-common.cpp

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -388,6 +388,15 @@ const llama_tokens & server_tokens::get_text_tokens() const {
388388
return tokens;
389389
}
390390

391+
llama_tokens server_tokens::replace_media_null_tokens(llama_token replacement) const {
392+
llama_tokens out;
393+
out.reserve(tokens.size());
394+
for (const llama_token t : tokens) {
395+
out.push_back(t == LLAMA_TOKEN_NULL ? replacement : t);
396+
}
397+
return out;
398+
}
399+
391400
void server_tokens::set_token(llama_pos pos, llama_token id) {
392401
GGML_ASSERT(!has_mtmd); // only allow this if mtmd is disabled
393402
tokens[pos] = id;

tools/server/server-common.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -189,6 +189,10 @@ struct server_tokens {
189189
// for compatibility with speculative decoding, ctx shift, slot save/load
190190
const llama_tokens & get_text_tokens() const;
191191

192+
// Replace LLAMA_TOKEN_NULL (mtmd media placeholders) for callers that need a flat token stream
193+
// (e.g. Qwen NextN speculative begin / prime). Text tokens are unchanged.
194+
llama_tokens replace_media_null_tokens(llama_token replacement) const;
195+
192196
// for compatibility with speculative decoding
193197
void set_token(llama_pos pos, llama_token id);
194198

tools/server/server-context.cpp

Lines changed: 64 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
#include <cstddef>
1717
#include <cinttypes>
1818
#include <cstdio>
19+
#include <cstring>
1920
#include <vector>
2021
#include <exception>
2122
#include <memory>
@@ -33,6 +34,34 @@
3334

3435
using json = nlohmann::ordered_json;
3536

37+
namespace {
38+
39+
// Token used to replace LLAMA_TOKEN_NULL placeholders when priming Qwen NextN draft KV (see common_speculative_begin).
40+
static llama_token server_nextn_mtmd_fill_token(const llama_model * model) {
41+
const llama_vocab * vocab = llama_model_get_vocab(model);
42+
if (!vocab) {
43+
return 0;
44+
}
45+
static const char * const k_candidates[] = {
46+
"<|image_pad|>",
47+
"<|IMAGE_PAD|>",
48+
"<|vision_pad|>",
49+
};
50+
std::vector<llama_token> buf(32);
51+
for (const char * piece : k_candidates) {
52+
const int32_t n = llama_tokenize(
53+
vocab, piece, (int32_t) std::strlen(piece),
54+
buf.data(), (int32_t) buf.size(), false, true);
55+
if (n == 1) {
56+
return buf[0];
57+
}
58+
}
59+
const llama_token pad = llama_vocab_pad(vocab);
60+
return pad != LLAMA_TOKEN_NULL ? pad : 0;
61+
}
62+
63+
} // namespace
64+
3665
constexpr int HTTP_POLLING_SECONDS = 1;
3766

3867
// state diagram: https://github.com/ggml-org/llama.cpp/pull/9283
@@ -820,9 +849,10 @@ struct server_context_impl {
820849
SRV_WRN("%s\n", "cache_reuse is not supported by multimodal, it will be disabled");
821850
}
822851

823-
if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE) {
824-
params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
825-
SRV_WRN("%s\n", "speculative decoding is not supported by multimodal, it will be disabled");
852+
if (params_base.speculative.type != COMMON_SPECULATIVE_TYPE_NONE &&
853+
!common_speculative_is_mtmd_safe(params_base.speculative.type)) {
854+
params_base.speculative.type = COMMON_SPECULATIVE_TYPE_NONE;
855+
SRV_WRN("%s\n", "speculative decoding with this type is not supported by multimodal, it will be disabled");
826856
}
827857
}
828858

@@ -888,8 +918,8 @@ struct server_context_impl {
888918
if (can_spec) {
889919
slot.spec = common_speculative_init(params_base.speculative, slot.ctx);
890920
if (slot.spec) {
891-
if (mctx) {
892-
SRV_ERR("%s\n", "speculative decoding is not supported with multimodal");
921+
if (mctx && !common_speculative_all_impls_mtmd_safe(slot.spec)) {
922+
SRV_ERR("%s\n", "speculative decoding with this type is not supported with multimodal");
893923
return false;
894924
}
895925
// MTP reads target's KV memory by sequence id; bind to slot.id (server uses slot.id as seq_id).
@@ -2205,14 +2235,22 @@ struct server_context_impl {
22052235
// generate draft tokens in speculative decoding mode
22062236
// TODO: rework to have a single draft llama_context shared across all slots [TAG_SERVER_SPEC_REWORK]
22072237
// perform the speculative drafting for all sequences at the same time in a single batch
2208-
const int n_draft_max = slot.get_n_draft_max();
2209-
if (n_draft_max > 0) {
2210-
if (mctx) {
2211-
// we should never reach this, as speculative is automatically disabled if mmproj is loaded
2212-
GGML_ABORT("not supported by multimodal");
2213-
}
2238+
const int n_draft_max_raw = slot.get_n_draft_max();
2239+
const bool mtmd_safe_spec = slot.spec && common_speculative_all_impls_mtmd_safe(slot.spec);
2240+
if (mctx && n_draft_max_raw > 0 && !mtmd_safe_spec) {
2241+
GGML_ABORT("not supported by multimodal");
2242+
}
22142243

2215-
const llama_tokens & cached_text_tokens = slot.prompt.tokens.get_text_tokens();
2244+
// NextN/MTP prime requires per-token target hidden states which the mtmd image-decode
2245+
// path does not produce. Until that is wired in, skip drafting for slots whose prompt
2246+
// contains image chunks - the slot still works as a normal (non-speculative) decode.
2247+
const bool skip_draft_mtmd = mctx && slot.prompt.tokens.has_mtmd;
2248+
const int n_draft_max = skip_draft_mtmd ? 0 : n_draft_max_raw;
2249+
2250+
if (n_draft_max > 0) {
2251+
static const llama_tokens k_empty_prompt_tgt;
2252+
const llama_tokens & cached_text_tokens =
2253+
(mctx && mtmd_safe_spec) ? k_empty_prompt_tgt : slot.prompt.tokens.get_text_tokens();
22162254

22172255
const auto & params_spec = slot.task->params.speculative;
22182256

@@ -3008,7 +3046,15 @@ struct server_context_impl {
30083046
slot.state = SLOT_STATE_GENERATING;
30093047

30103048
if (slot.can_speculate()) {
3011-
common_speculative_begin(slot.spec, slot.prompt.tokens.get_text_tokens());
3049+
if (slot.prompt.tokens.has_mtmd) {
3050+
// Skip spec begin/prime for mtmd prompts: the per-token target hidden
3051+
// states for image positions are not currently produced, which makes
3052+
// NextN prime partial and could desync RoPE positions on later drafts.
3053+
// The slot will still generate correctly via the non-speculative path.
3054+
SLT_INF(slot, "%s", "skipping speculative prime for multimodal prompt\n");
3055+
} else {
3056+
common_speculative_begin(slot.spec, slot.prompt.tokens.get_text_tokens());
3057+
}
30123058
}
30133059
} else if (slot.state != SLOT_STATE_GENERATING) {
30143060
continue; // continue loop of slots
@@ -3099,7 +3145,11 @@ struct server_context_impl {
30993145
slot.prompt.tokens.keep_first(slot.prompt.n_tokens() - n_draft);
31003146

31013147
// add accepted tokens to the prompt
3102-
slot.prompt.tokens.insert({ids.begin(), ids.end() - 1});
3148+
// note: use push_back loop instead of insert() so mtmd prompts work too
3149+
// (server_tokens::insert asserts !has_mtmd; push_back is mtmd-safe).
3150+
for (auto it = ids.begin(); it != ids.end() - 1; ++it) {
3151+
slot.prompt.tokens.push_back(*it);
3152+
}
31033153
slot.sampled = ids.back(); // last accepted token
31043154

31053155
llama_context_nextn_seq_rm(ctx, slot.id, slot.prompt.n_tokens(), -1);

0 commit comments

Comments
 (0)