Skip to content

llama + spec: MTP support#3

Closed
am17an wants to merge 65 commits into
hybrid-mem-slot-rollbackfrom
mtp-clean
Closed

llama + spec: MTP support#3
am17an wants to merge 65 commits into
hybrid-mem-slot-rollbackfrom
mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Owner

@am17an am17an commented May 3, 2026

Overview

Additional information

Requirements

reeselevine and others added 30 commits April 28, 2026 07:27
…ogic (ggml-org#22456)

* Refactor buffer aliasing to be part of shader lib decisions

* cleanup

* formatting
Some SPIR-V compilers (notably mesa) don't handle the current
vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well.
While reading three `u8`s from the 12-byte scale array should (at
least on some hardware) result in loading the full 12 bytes in a
single LOAD followed by whatever extraction is needed, at least
the ANV Intel driver really can't practically perform this
optimization.

`mesa`'s unsigned upper bound logic doesn't handle tracking bounds
through ternary, resulting in the `(is < 4) ? ... : is - 4` having
an infinite upper bound (as it cannot prove `is - 4` doesn't
underflow). While this could still be rectified if mesa looked at
the array bounds, it currently doesn't and `glslc` currently emits
SPIR-V that doesn't allow for this optimization anyway (though
maybe it will at some point, see
KhronosGroup/glslang#4206).

In mul_mat_vecq we took a different approach to loading the same
fields. We read the first two bytes we needed from `scale` then
took a branch before deciding whether we needed to read a third
byte. In mesa this did, indeed, lead to a top-level branch with
conditional loads. As such these loads ended up not being
coalesced either (at least in the ANV driver) resulting in
additional instructions in our hot loop.

Instead, here, we go ahead and force loading the full 12 bytes and
extract the bits we need from the packed-u32s instead. In mul_mat
there's a few less ternaries and only one extra shift, so even on
drivers that did optimize the previous loads properly the only
material change should be pulling a few extra bytes into registers
(which on most hardware won't cost anything anyway, though
ironically on Intel it theoretically could). In mul_mat_vecq this
requires a bit of extra math and may read bytes from the u32 that
weren't needed, but it seems likely avoiding the branch is a win
on most platforms.

On Intel Xe2/mesa 26.0.4 with the optimizations from
https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162,

for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l:
 * Instruction Count: 2753 -> 2688
 * SEND Count: 269 -> 261
 * Cycle Count: 273976 -> 266138
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l:
 * Instruction Count: 2767 -> 2702
 * SEND Count: 271 -> 263
 * Cycle Count: 274140 -> 268144
 * Max live registers: 248 -> 246
 * Non SSA regs after NIR: 381 -> 382

for shader mul_mat_vec_id_q4_k_q8_1_f32:
 * Instruction Count: 1930 -> 1646
 * SEND Count: 116 -> 71
 * Cycle Count: 1348306 -> 843350
 * Max live registers: 78 -> 84
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_id_q5_k_q8_1_f32:
 * Instruction Count: 2207 -> 1922
 * SEND Count: 131 -> 86
 * Cycle Count: 1392012 -> 1037836
 * Max live registers: 90 -> 90
 * Non SSA regs after NIR: 300 -> 135

for shader mul_mat_vec_q4_k_q8_1_f32:
 * Instruction Count: 2029 -> 1749
 * SEND Count: 111 -> 66
 * Cycle Count: 1347278 -> 840118
 * Max live registers: 74 -> 80
 * Non SSA regs after NIR: 299 -> 134

for shader mul_mat_vec_q5_k_q8_1_f32:
 * Instruction Count: 2307 -> 2022
 * SEND Count: 126 -> 81
 * Cycle Count: 1379820 -> 954042
 * Max live registers: 86 -> 86
 * Non SSA regs after NIR: 299 -> 134

On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL:
 * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%)
 * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%)
 * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%)

On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S:
 * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%)
 * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%)
 * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%)

On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with
-sm layer (note that -sm tensor improvements will naturally be
less):
 * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%)
 * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%)
 * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
…22323)

DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response.

Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling
this model to be converted to GGUF.
ggml-org#22286)

* ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32)

Adds MMA-f16 and tile kernel configs, dispatch logic, template instances,
and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to
ncols2=32 to support GQA ratio 32 only.

* Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32

* Apply suggestions from code review

Address review comments

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256

* Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np)

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
…1916)

* Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel

* Change arrays to static const in repack.cpp

---------

Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…gml-org#22273)

* Changed to leak logger singleton to prevent hanging on Windows

* Fix comment

* Stopped using static vector

Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash

* Change so all logs are output before exit

* Added debug logging

* added more logging

* Added logging

* Explicitly free logger to avoid hanging on Win

* Reverted to leak logger instance again

* Removed debug log and fixed comment

* Fixed comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix flashattention support check for devices that don't support subgroups

* set path to none if kv_tile doesn't fit
…2317)

* ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME

When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains
inline asm for the vmadot family which requires the xsmtvdotii custom
extension.(problem can see in some blogs and make sure in K3 platform)
The current CMakeLists does not include xsmtvdotii, so any toolchain
that honours the explicit -march (tested with SpacemiT GCC 15.2) fails
at the assembler stage:

  Error: unrecognized opcode `vmadot v16,v14,v0',
         extension `xsmtvdotii' required

Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is
enabled so the IME path can actually build with a capable toolchain.
No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off.

toolchain from https://www.spacemit.com/community/resources-download/Tools

* Update ggml/src/ggml-cpu/CMakeLists.txt

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>

---------

Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
* ggml-cuda: refactor fusion code

* apply formatting + make env variable truthy
…gml-org#22293)

* ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault

vec_xst operations in the tiled path crash on AIX when writing
near 4KB page boundaries due to strict memory protection. Fall
back to mnpack implementation on AIX for stable execution.

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

* Update ggml/src/ggml-cpu/llamafile/sgemm.cpp

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* Update sgemm.cpp

* Update sgemm.cpp

---------

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* webui: instant mic stop, race-free recorder restart

* webui: faster WAV PCM encode via hoisted channels and Int16Array

* chore: update webui build output

* webui: drop setTimeout(0) hack and harden cancelRecording

* chore: update webui build output
* hexagon: allow host to set max vmem size

We use a sane default but it's helpful to allow for an override if needed.

* hexagon: add support for measuring vmem space and move pinned mmaping management to host

* hexagon: update vmem checks to use uint64

* hexagon: bump op buffers to 16 (matches max mmaps)

* hexagon: bump default vmem to 3.2GB

* hexagon: add support for autodetecting vmem space and some logging cleanup in that area

* hexagon: fix whitespace warnings

* Update scripts/snapdragon/adb/run-cli.sh

Co-authored-by: Pascal <admin@serveurperso.com>

* hex-adb: fix run-completion script

---------

Co-authored-by: Pascal <admin@serveurperso.com>
* port ggml-org#22358 PR to examples/speculative/speculative.cpp
* use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft
  model / target model vocabulary mismatch

Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* spec : fix draft model checkpoints

* cont : clean-up

* cont : gate the ngram-mod reset warning behind verbose flag
…22513)

* scripts : add wc2wt.sh - create worktree from current HEAD

Add a script to create a git worktree on a new branch from the current
HEAD. Similar to pr2wt.sh but for local development branches instead of
PRs.

Usage:
  ./scripts/wc2wt.sh gg/new-feature
  ./scripts/wc2wt.sh gg/new-feature "bash -l"

Assisted-by: llama.cpp:local pi

* cont : no need to try to delete the branch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* bump ty to 0.0.33

* update typings
* vulkan: add get/set_tensor_2d functions

* fix backend interface comments

* Update ggml/src/ggml-metal/ggml-metal.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@am17an am17an force-pushed the mtp-clean branch 3 times, most recently from 02a65ab to 6b40a9f Compare May 4, 2026 08:57
eapache and others added 8 commits May 4, 2026 12:19
…ggml-org#22004)

* git-friendly migration

* add build_graph

* nits

* exclude old code from build

* wip

* add llm_arch_model_i

* prepare downstream functions

* nits

* nits

* wip

* wip

* add back create_tensor_qkv

* fix files missing include

* enforce one llm_build per arch

* cmake: use glob

* missing model params

* nits

* wip

* wip (2)

* wip (3)

* test-llama-archs is happy

* improve switch case

* move more stuff into llm_arch_model_i

* fix downstream code

* nits

* nits (2)

* fix order

* llama_model_base

* LLAMA_LOAD_LOCALS

* small fix

* fix build errors

* auto

* rm migration script and ifdef
…ml-org#22654)

* chat/autoparser: the fixes

* Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls.

* Trim whitespace on apply instead
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
@am17an am17an closed this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.