llama + spec: MTP support by am17an · Pull Request #3 · am17an/llama.cpp

am17an · 2026-05-03T16:52:01Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

…ogic (ggml-org#22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting

Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardware) result in loading the full 12 bytes in a single LOAD followed by whatever extraction is needed, at least the ANV Intel driver really can't practically perform this optimization. `mesa`'s unsigned upper bound logic doesn't handle tracking bounds through ternary, resulting in the `(is < 4) ? ... : is - 4` having an infinite upper bound (as it cannot prove `is - 4` doesn't underflow). While this could still be rectified if mesa looked at the array bounds, it currently doesn't and `glslc` currently emits SPIR-V that doesn't allow for this optimization anyway (though maybe it will at some point, see KhronosGroup/glslang#4206). In mul_mat_vecq we took a different approach to loading the same fields. We read the first two bytes we needed from `scale` then took a branch before deciding whether we needed to read a third byte. In mesa this did, indeed, lead to a top-level branch with conditional loads. As such these loads ended up not being coalesced either (at least in the ANV driver) resulting in additional instructions in our hot loop. Instead, here, we go ahead and force loading the full 12 bytes and extract the bits we need from the packed-u32s instead. In mul_mat there's a few less ternaries and only one extra shift, so even on drivers that did optimize the previous loads properly the only material change should be pulling a few extra bytes into registers (which on most hardware won't cost anything anyway, though ironically on Intel it theoretically could). In mul_mat_vecq this requires a bit of extra math and may read bytes from the u32 that weren't needed, but it seems likely avoiding the branch is a win on most platforms. On Intel Xe2/mesa 26.0.4 with the optimizations from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162, for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l: * Instruction Count: 2753 -> 2688 * SEND Count: 269 -> 261 * Cycle Count: 273976 -> 266138 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l: * Instruction Count: 2767 -> 2702 * SEND Count: 271 -> 263 * Cycle Count: 274140 -> 268144 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader mul_mat_vec_id_q4_k_q8_1_f32: * Instruction Count: 1930 -> 1646 * SEND Count: 116 -> 71 * Cycle Count: 1348306 -> 843350 * Max live registers: 78 -> 84 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_id_q5_k_q8_1_f32: * Instruction Count: 2207 -> 1922 * SEND Count: 131 -> 86 * Cycle Count: 1392012 -> 1037836 * Max live registers: 90 -> 90 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_q4_k_q8_1_f32: * Instruction Count: 2029 -> 1749 * SEND Count: 111 -> 66 * Cycle Count: 1347278 -> 840118 * Max live registers: 74 -> 80 * Non SSA regs after NIR: 299 -> 134 for shader mul_mat_vec_q5_k_q8_1_f32: * Instruction Count: 2307 -> 2022 * SEND Count: 126 -> 81 * Cycle Count: 1379820 -> 954042 * Max live registers: 86 -> 86 * Non SSA regs after NIR: 299 -> 134 On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL: * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%) * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%) * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%) On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S: * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%) * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%) * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%) On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with -sm layer (note that -sm tensor improvements will naturally be less): * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%) * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%) * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)

…22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).

This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling this model to be converted to GGUF.

ggml-org#22286) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to ncols2=32 to support GQA ratio 32 only. * Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32 * Apply suggestions from code review Address review comments Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256 * Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…22196)

…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>

…gml-org#22273) * Changed to leak logger singleton to prevent hanging on Windows * Fix comment * Stopped using static vector Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash * Change so all logs are output before exit * Added debug logging * added more logging * Added logging * Explicitly free logger to avoid hanging on Win * Reverted to leak logger instance again * Removed debug log and fixed comment * Fixed comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit

…2317) * ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains inline asm for the vmadot family which requires the xsmtvdotii custom extension.(problem can see in some blogs and make sure in K3 platform) The current CMakeLists does not include xsmtvdotii, so any toolchain that honours the explicit -march (tested with SpacemiT GCC 15.2) fails at the assembler stage: Error: unrecognized opcode `vmadot v16,v14,v0', extension `xsmtvdotii' required Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is enabled so the IME path can actually build with a capable toolchain. No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off. toolchain from https://www.spacemit.com/community/resources-download/Tools * Update ggml/src/ggml-cpu/CMakeLists.txt Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com> --------- Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>

* ggml-cuda: refactor fusion code * apply formatting + make env variable truthy

…gml-org#22293) * ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault vec_xst operations in the tiled path crash on AIX when writing near 4KB page boundaries due to strict memory protection. Fall back to mnpack implementation on AIX for stable execution. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> * Update ggml/src/ggml-cpu/llamafile/sgemm.cpp Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Update sgemm.cpp * Update sgemm.cpp --------- Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* webui: instant mic stop, race-free recorder restart * webui: faster WAV PCM encode via hoisted channels and Int16Array * chore: update webui build output * webui: drop setTimeout(0) hack and harden cancelRecording * chore: update webui build output

* hexagon: allow host to set max vmem size We use a sane default but it's helpful to allow for an override if needed. * hexagon: add support for measuring vmem space and move pinned mmaping management to host * hexagon: update vmem checks to use uint64 * hexagon: bump op buffers to 16 (matches max mmaps) * hexagon: bump default vmem to 3.2GB * hexagon: add support for autodetecting vmem space and some logging cleanup in that area * hexagon: fix whitespace warnings * Update scripts/snapdragon/adb/run-cli.sh Co-authored-by: Pascal <admin@serveurperso.com> * hex-adb: fix run-completion script --------- Co-authored-by: Pascal <admin@serveurperso.com>

…org#22488)

* port ggml-org#22358 PR to examples/speculative/speculative.cpp * use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft model / target model vocabulary mismatch Co-authored-by: Petros Sideris <petros.sideris@nokia.com>

* spec : fix draft model checkpoints * cont : clean-up * cont : gate the ngram-mod reset warning behind verbose flag

…22513) * scripts : add wc2wt.sh - create worktree from current HEAD Add a script to create a git worktree on a new branch from the current HEAD. Similar to pr2wt.sh but for local development branches instead of PRs. Usage: ./scripts/wc2wt.sh gg/new-feature ./scripts/wc2wt.sh gg/new-feature "bash -l" Assisted-by: llama.cpp:local pi * cont : no need to try to delete the branch

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* bump ty to 0.0.33 * update typings

* vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…ggml-org#22004) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef

…ml-org#22654) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead

…elte.ts (ggml-org#22625)

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

reeselevine and others added 30 commits April 28, 2026 07:27

ggml-webgpu: fix buffer aliasing for ssm_scan and refactor aliasing l…

98bb579

…ogic (ggml-org#22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting

convert : add support for Nemotron Nano 3 Omni (ggml-org#22481)

5d56eff

This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling this model to be converted to GGUF.

ggml-cuda: Repost of 21896: Blackwell native NVFP4 support (ggml-org#…

fc2b005

…22196)

TP: fix delayed AllReduce + zero-sized slices (ggml-org#22489)

739393b

ggml : add sve tuned code for gemm_q8_0_4x8_q8_0() kernel (ggml-org#2…

bdc9c74

…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>

ggml-webgpu: Fix bug in FlashAttention support check (ggml-org#22492)

d6a5094

* Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit

ggml-cuda: refactor fusion code (ggml-org#22468)

3142f1d

* ggml-cuda: refactor fusion code * apply formatting + make env variable truthy

ggml : bump version to 0.10.1 (ggml/1469)

4b221b7

sync : ggml

b1d5f5b

spec : disacard last drafted token with low prob (ggml-org#22506)

683c5ac

CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)

098705a

common : do not pass prompt tokens to reasoning budget sampler (ggml-…

d775992

…org#22488)

spec : fix draft model checkpoints (ggml-org#22521)

80afa33

* spec : fix draft model checkpoints * cont : clean-up * cont : gate the ngram-mod reset warning behind verbose flag

add fast matmul iquants (ggml-org#22504)

4515559

CUDA: fix tile FA kernel on Pascal (ggml-org#22541)

e82aaf2

vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)

5f0ab72

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci : bump ty to 0.0.33 (ggml-org#22535)

6118c04

* bump ty to 0.0.33 * update typings

spec: fix argument typo (ggml-org#22552)

c20c445

vulkan: add get/set tensor 2d functions (ggml-org#22514)

660b1b4

* vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : check for null getpwuid in hf-cache (ggml-org#22550)

beb42ff

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

webui: restore missing settings (ggml-org#22666)

fa8feae

am17an force-pushed the mtp-clean branch 3 times, most recently from 02a65ab to 6b40a9f Compare May 4, 2026 08:57

eapache and others added 8 commits May 4, 2026 12:19

server: Add a simple get_datetime server tool (ggml-org#22649)

c84e6d6

common/autoparser: fixes for newline handling / forced tool calls (gg…

a4701c9

…ml-org#22654) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead

webui : fix circular dependency between chat.service.ts and models.sv…

36a694c

…elte.ts (ggml-org#22625)

add enum for part sequence removal to enable checkpoints

589490f

review: rename rollback to rs_seq and remove public API

c5e0227

llama + spec: MTP support

10829db

am17an force-pushed the mtp-clean branch from 6b40a9f to 10829db Compare May 4, 2026 12:33

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU testing ggml Vulkan Apple Metal SYCL IBM zDNN AMD ZenDNN devops server/webui Ascend NPU OpenCL Hexagon WebGPU labels May 4, 2026

add qwen35moe_mtp

f8c6b03

am17an closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama + spec: MTP support#3

llama + spec: MTP support#3
am17an wants to merge 65 commits into
hybrid-mem-slot-rollbackfrom
mtp-clean

am17an commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

am17an commented May 3, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants