[pull] master from ggml-org:master by pull[bot] · Pull Request #119 · CrazyForks/llama.cpp

pull · 2026-06-01T15:42:34Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.

* Add EXAONE 4.5 and Add GQA for MMproj * mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps <|vision_start|> and <|vision_end|>. Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention pattern, optional mmproj input norm). Update exaone4_5 projector weights and convert_hf_to_gguf for mmproj export. * mtmd: load EXAONE4 nextn tensors correctly Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF. * Minor fixes * Address PR feedback * Address PR feedback * Fix EXAONE after merge * Fix EXAONE 4.5 conversion * Address PR feedback * Refactor EXAONE 4.5 conversion * Address PR feedback * Fix unintended deletion * Minor fix --------- Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>

* TP: quantized KV cache support * fix partial view * remove overly strict assert

* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer

* vulkan: reduces lock contention * replace unique_lock with lock_guard

* vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed'

Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.

* llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere

winstonma and others added 10 commits June 1, 2026 11:46

vulkan: Removed unused functions (#23175)

f8c0a19

security : disable private disclosures (#23963)

02a5701

TP: quantized KV cache support (#23792)

8e6fff8

* TP: quantized KV cache support * fix partial view * remove overly strict assert

vocab: add normalizer.lowercase support to WPM (#23899)

5aba536

* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer

vulkan: reduce host memory lock contention (#23376)

bef69f1

* vulkan: reduces lock contention * replace unique_lock with lock_guard

llama: limit max outputs of llama_context (#23861)

de6f727

* llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere

pull Bot locked and limited conversation to collaborators Jun 1, 2026

pull Bot added the ⤵️ pull label Jun 1, 2026

pull Bot merged commit de6f727 into CrazyForks:master Jun 1, 2026

github-actions Bot added examples python server ggml Apple Metal Vulkan model labels Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#119

[pull] master from ggml-org:master#119
pull[bot] merged 10 commits into
CrazyForks:masterfrom
ggml-org:master

pull Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

pull Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pull Bot commented Jun 1, 2026 •

edited

Loading