llama + spec: MTP support#3
Closed
am17an wants to merge 65 commits into
Closed
Conversation
…ogic (ggml-org#22456) * Refactor buffer aliasing to be part of shader lib decisions * cleanup * formatting
Some SPIR-V compilers (notably mesa) don't handle the current vulkan Q4_K/Q5_K scale load pattern in mul_mat particularly well. While reading three `u8`s from the 12-byte scale array should (at least on some hardware) result in loading the full 12 bytes in a single LOAD followed by whatever extraction is needed, at least the ANV Intel driver really can't practically perform this optimization. `mesa`'s unsigned upper bound logic doesn't handle tracking bounds through ternary, resulting in the `(is < 4) ? ... : is - 4` having an infinite upper bound (as it cannot prove `is - 4` doesn't underflow). While this could still be rectified if mesa looked at the array bounds, it currently doesn't and `glslc` currently emits SPIR-V that doesn't allow for this optimization anyway (though maybe it will at some point, see KhronosGroup/glslang#4206). In mul_mat_vecq we took a different approach to loading the same fields. We read the first two bytes we needed from `scale` then took a branch before deciding whether we needed to read a third byte. In mesa this did, indeed, lead to a top-level branch with conditional loads. As such these loads ended up not being coalesced either (at least in the ANV driver) resulting in additional instructions in our hot loop. Instead, here, we go ahead and force loading the full 12 bytes and extract the bits we need from the packed-u32s instead. In mul_mat there's a few less ternaries and only one extra shift, so even on drivers that did optimize the previous loads properly the only material change should be pulling a few extra bytes into registers (which on most hardware won't cost anything anyway, though ironically on Intel it theoretically could). In mul_mat_vecq this requires a bit of extra math and may read bytes from the u32 that weren't needed, but it seems likely avoiding the branch is a win on most platforms. On Intel Xe2/mesa 26.0.4 with the optimizations from https://gitlab.freedesktop.org/mesa/mesa/-/work_items/15162, for shader matmul_id_subgroup_q4_k_f32_f16acc_aligned_l: * Instruction Count: 2753 -> 2688 * SEND Count: 269 -> 261 * Cycle Count: 273976 -> 266138 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader matmul_id_subgroup_q5_k_f32_f16acc_aligned_l: * Instruction Count: 2767 -> 2702 * SEND Count: 271 -> 263 * Cycle Count: 274140 -> 268144 * Max live registers: 248 -> 246 * Non SSA regs after NIR: 381 -> 382 for shader mul_mat_vec_id_q4_k_q8_1_f32: * Instruction Count: 1930 -> 1646 * SEND Count: 116 -> 71 * Cycle Count: 1348306 -> 843350 * Max live registers: 78 -> 84 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_id_q5_k_q8_1_f32: * Instruction Count: 2207 -> 1922 * SEND Count: 131 -> 86 * Cycle Count: 1392012 -> 1037836 * Max live registers: 90 -> 90 * Non SSA regs after NIR: 300 -> 135 for shader mul_mat_vec_q4_k_q8_1_f32: * Instruction Count: 2029 -> 1749 * SEND Count: 111 -> 66 * Cycle Count: 1347278 -> 840118 * Max live registers: 74 -> 80 * Non SSA regs after NIR: 299 -> 134 for shader mul_mat_vec_q5_k_q8_1_f32: * Instruction Count: 2307 -> 2022 * SEND Count: 126 -> 81 * Cycle Count: 1379820 -> 954042 * Max live registers: 86 -> 86 * Non SSA regs after NIR: 299 -> 134 On one Arc Pro B60, unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL: * pp512: 907.34 ± 9.28 -> 941.94 ± 10.53 (+4%) * pp2048: 897.95 ± 1.82 -> 931.55 ± 1.79 (+4%) * tg128: 49.49 ± 0.02 -> 49.86 ± 0.05 (+ <1%) On one Arc Pro B60, unsloth/Qwen3.5-27B-GGUF:Q4_K_S: * pp512: 324.13 ± 10.52 -> 354.33 ± 6.81 (+9%) * pp2048: 329.80 ± 0.25 -> 357.10 ± 0.06 (+8%) * tg128: 17.11 ± 0.01 -> 18.11 ± 0.01 (+6%) On four Arc Pro B60s, unsloth/Qwen3.5-122B-A10B-GGUF:Q5_K_S with -sm layer (note that -sm tensor improvements will naturally be less): * pp512: 264.55 ± 2.81 -> 280.45 ± 3.94 (+6%) * pp2048: 319.32 ± 2.72 -> 335.70 ± 3.48 (+5%) * tg128: 26.39 ± 0.01 -> 26.67 ± 0.01 (+1%)
…22323) DONE state absorbs all tokens including a new start tag, causing any think blocks after the first to run unbudgeted. Observed on unsloth/Qwen3.6-27B-GGUF which interleaves multiple <think> blocks per response. Fixed by advancing start_matcher in DONE branch and re-arming to COUNTING with a fresh budget on match. Adds regression test (test-reasoning-budget: test 6).
This commit adds support for NVIDIA Nemotron Nano 3 Omni model enabling this model to be converted to GGUF.
ggml-org#22286) * ggml-cuda: add flash-attn support for DKQ=320/DV=256 with ncols2=32 (GQA=32) Adds MMA-f16 and tile kernel configs, dispatch logic, template instances, and tile .cu file for Mistral Small 4 (head sizes 320/256), restricting to ncols2=32 to support GQA ratio 32 only. * Adding check to return BEST_FATTN_KERNEL_NONE in case GQA!=32 * Apply suggestions from code review Address review comments Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Address review comments and making kernel config default to DQK=512, DV=512 instead of DQK=256,DV=256 * Fixed bug with sinks=1, with ncols=32, there are two warp-groups created but sinks index is same(0,...,15) for both the groups hence with sinks=1, output is not matching with CPU output. Added sink_base which will be base index for each warp_group (threadIdx.y / np) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/template-instances/generate_cu_files.py Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
…1916) * Added sve tuned code for gemm_q8_0_4x8_q8_0() kernel * Change arrays to static const in repack.cpp --------- Co-authored-by: Vithulep <prashant.vithule@fujitsu.com>
…gml-org#22273) * Changed to leak logger singleton to prevent hanging on Windows * Fix comment * Stopped using static vector Using std::vector will cause g_col to be released before the logger thread exits, causing the logger thread to touch freed memory causing a crash * Change so all logs are output before exit * Added debug logging * added more logging * Added logging * Explicitly free logger to avoid hanging on Win * Reverted to leak logger instance again * Removed debug log and fixed comment * Fixed comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Fix flashattention support check for devices that don't support subgroups * set path to none if kv_tile doesn't fit
…2317) * ggml-cpu: cmake: append xsmtvdotii march for SpacemiT IME When GGML_CPU_RISCV64_SPACEMIT=ON is set, ime1_kernels.cpp contains inline asm for the vmadot family which requires the xsmtvdotii custom extension.(problem can see in some blogs and make sure in K3 platform) The current CMakeLists does not include xsmtvdotii, so any toolchain that honours the explicit -march (tested with SpacemiT GCC 15.2) fails at the assembler stage: Error: unrecognized opcode `vmadot v16,v14,v0', extension `xsmtvdotii' required Append _xsmtvdotii to MARCH_STR when GGML_CPU_RISCV64_SPACEMIT is enabled so the IME path can actually build with a capable toolchain. No effect on builds that leave GGML_CPU_RISCV64_SPACEMIT off. toolchain from https://www.spacemit.com/community/resources-download/Tools * Update ggml/src/ggml-cpu/CMakeLists.txt Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com> --------- Co-authored-by: alex-spacemit <jinghui.huang@spacemit.com>
* ggml-cuda: refactor fusion code * apply formatting + make env variable truthy
…gml-org#22293) * ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault vec_xst operations in the tiled path crash on AIX when writing near 4KB page boundaries due to strict memory protection. Fall back to mnpack implementation on AIX for stable execution. Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> * Update ggml/src/ggml-cpu/llamafile/sgemm.cpp Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Update sgemm.cpp * Update sgemm.cpp --------- Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com> Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* webui: instant mic stop, race-free recorder restart * webui: faster WAV PCM encode via hoisted channels and Int16Array * chore: update webui build output * webui: drop setTimeout(0) hack and harden cancelRecording * chore: update webui build output
* hexagon: allow host to set max vmem size We use a sane default but it's helpful to allow for an override if needed. * hexagon: add support for measuring vmem space and move pinned mmaping management to host * hexagon: update vmem checks to use uint64 * hexagon: bump op buffers to 16 (matches max mmaps) * hexagon: bump default vmem to 3.2GB * hexagon: add support for autodetecting vmem space and some logging cleanup in that area * hexagon: fix whitespace warnings * Update scripts/snapdragon/adb/run-cli.sh Co-authored-by: Pascal <admin@serveurperso.com> * hex-adb: fix run-completion script --------- Co-authored-by: Pascal <admin@serveurperso.com>
* port ggml-org#22358 PR to examples/speculative/speculative.cpp * use vocab_[tgt,dft] instead of ctx_[tgt,dft] when logging on draft model / target model vocabulary mismatch Co-authored-by: Petros Sideris <petros.sideris@nokia.com>
* spec : fix draft model checkpoints * cont : clean-up * cont : gate the ngram-mod reset warning behind verbose flag
…22513) * scripts : add wc2wt.sh - create worktree from current HEAD Add a script to create a git worktree on a new branch from the current HEAD. Similar to pr2wt.sh but for local development branches instead of PRs. Usage: ./scripts/wc2wt.sh gg/new-feature ./scripts/wc2wt.sh gg/new-feature "bash -l" Assisted-by: llama.cpp:local pi * cont : no need to try to delete the branch
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* bump ty to 0.0.33 * update typings
* vulkan: add get/set_tensor_2d functions * fix backend interface comments * Update ggml/src/ggml-metal/ggml-metal.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
02a65ab to
6b40a9f
Compare
…ggml-org#22004) * git-friendly migration * add build_graph * nits * exclude old code from build * wip * add llm_arch_model_i * prepare downstream functions * nits * nits * wip * wip * add back create_tensor_qkv * fix files missing include * enforce one llm_build per arch * cmake: use glob * missing model params * nits * wip * wip (2) * wip (3) * test-llama-archs is happy * improve switch case * move more stuff into llm_arch_model_i * fix downstream code * nits * nits (2) * fix order * llama_model_base * LLAMA_LOAD_LOCALS * small fix * fix build errors * auto * rm migration script and ifdef
…ml-org#22654) * chat/autoparser: the fixes * Move optspace() to chat-peg-parser, comment out server tests invalidated due to content now allowed with forced tool calls. * Trim whitespace on apply instead
Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements