Skip to content

[pull] master from ggml-org:master#119

Merged
pull[bot] merged 10 commits into
CrazyForks:masterfrom
ggml-org:master
Jun 1, 2026
Merged

[pull] master from ggml-org:master#119
pull[bot] merged 10 commits into
CrazyForks:masterfrom
ggml-org:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Jun 1, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

winstonma and others added 10 commits June 1, 2026 11:46
)

Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even
though they're only 2-byte aligned, and Q3_K still wins on
NVIDIA as well.

mesa isn't all that great at coalescing back-to-back loads from
alternating arrays, so we force it instead. Further, we can do
subtraction directly on a full int32_t rather than an i8vec4
with bit twiddling because the high bit is always free to start.

On Intel BMG on mesa, the switch to MMVQ provides an immediate
~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and
~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

The futher switch to block loads leads to a ~24% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in
tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K.

Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA
override for K quants on Xe2 as well.
* Add EXAONE 4.5 and Add GQA for MMproj

* mtmd: EXAONE 4.5 vision markers and projector path

EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps
<|vision_start|> and <|vision_end|>.

Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention
pattern, optional mmproj input norm). Update exaone4_5 projector weights and
convert_hf_to_gguf for mmproj export.

* mtmd: load EXAONE4 nextn tensors correctly

Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF.

* Minor fixes

* Address PR feedback

* Address PR feedback

* Fix EXAONE after merge

* Fix EXAONE 4.5 conversion

* Address PR feedback

* Refactor EXAONE 4.5 conversion

* Address PR feedback

* Fix unintended deletion

* Minor fix

---------

Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>
* TP: quantized KV cache support

* fix partial view

* remove overly strict assert
* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer)

* vocab : add normalizer.lowercase support to WPM

* vocab : default normalizer.lowercase to false for whitespace pre-tokenizer
* vulkan: reduces lock contention

* replace unique_lock with lock_guard
* vulkan: don't hold the device mutex while compiling pipelines

We need to hold a lock while we traverse all pipelines and lazily initialize
them, but we don't need to hold it while the pipeline is being compiled. And
it doesn't need to be the same lock as the device mutex. We call load_shaders
each time a pipeline is needed, so we only need to compile that one pipeline
(and, for example, don't want to end up compiling a pipeline that another
thread should be compiling).

* remove 'needed'
Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.
* llama: save more VRAM by reserving n_outputs == n_seqs when possible

* add n_outputs_per_seq

* move n_outputs_max to server-context

* change ubatch to batch everywhere
@pull pull Bot locked and limited conversation to collaborators Jun 1, 2026
@pull pull Bot added the ⤵️ pull label Jun 1, 2026
@pull pull Bot merged commit de6f727 into CrazyForks:master Jun 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants