[pull] master from ggml-org:master by pull[bot] · Pull Request #107 · CrazyForks/llama.cpp

pull · 2026-05-28T15:42:41Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* allow caching of ui elements in llama-server * use fnv_hash * Update tools/server/server-http.cpp etag has to be set always Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault

* hex-fa: clean up qf32/fp32 handling and stride handling * hex-fa: fix corner case fp NAN issues that were cause bad output from gemma4 on v79 * hex-fa: vectorize leftover handling * hex-fa: avoid HVX fallback during token gen HMX has more FP16 compute capacity * hmx-mm: remove dead code * hmx-mm: use fastdiv in x4x2 dequant * hmx-mm: sandwich dequant and scatter to improve perf * hmx-mm: fixed rebase conflicts * hmx-mm: further improve weight dequant by doing early type dispatch and precomputing fastdiv * hmx-mm: an even earlier dispatch for per-type dequant * hmx-mm: dequant linear types like q4_0 and q4_1 without the LUTs This is a bit faster than LUT. * hex-cmake: one more tweak for lto --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>

* misc(server): add default port to impl RAII * misc(server): register_gcp_compat() can be const * misc(server): use proper cpp const/auto methods * misc(server): do not reset a unique_ptr, use make_unique instead to be exception safe

…3227) * CUDA: per-quant MMVQ/MMQ batch threshold on AMD MFMA hardware The dispatcher uses a single global threshold (MMVQ_MAX_BATCH_SIZE = 8) to choose between mul_mat_vec_q (per-row GEMV) and mul_mat_q (MFMA-tiled GEMM) for quantized matmul. On AMD CDNA, the optimal crossover differs substantially by quant family because the per-row GEMV cost is dominated by dequantisation, not the dot-product itself: K-quants pay a heavier super-block decode and so MMQ wins sooner; legacy and IQ quants have lean decode and stay ahead until the batch fully populates an MFMA tile. This patch introduces ggml_cuda_should_use_mmvq(type, cc, ne11) -> bool, mirroring the existing ggml_cuda_should_use_mmq, and gates per-quant thresholds on amd_mfma_available(cc): Q3_K, Q4_K, Q5_K : MMVQ <= 3 (MMQ wins from batch=4: +5% .. +76%) Q2_K, Q6_K : MMVQ <= 5 (MMQ wins from batch=6: +8% .. +35%) others : MMVQ <= 8 (legacy & IQ regress under MMQ; unchanged) Non-AMD-MFMA paths (NVIDIA, RDNA, CDNA1 without MFMA) are byte-identical to master. GGML_CUDA_FORCE_MMVQ=1 restores the original global threshold for A/B testing. Measured on MI250X (gfx90a, ROCm 7.2.1) with Llama-3.2-3B-Instruct, llama-bench pp512 across all 20 supported quants, ubatch 1..8, 10 reps. Full table in PR description. Selected pp512 throughput (tok/s, ub=8): Q4_K_S: 559 -> 940 (+68%) Q5_K_S: 503 -> 884 (+76%) Q3_K_S: 629 -> 879 (+40%) Q2_K : 615 -> 809 (+32%) Q6_K : 582 -> 776 (+33%) Selected pp512 throughput (tok/s, ub=4): Q4_K_S: 444 -> 480 (+ 8%) Q4_0 : 682 -> 685 (+ 0%) (no regression - retains MMVQ) IQ4_XS: 706 -> 698 (- 1%) (no regression - retains MMVQ) * CUDA: address review — inline MMVQ batch table, drop env hatch & doc block * tune kernel selection logic for CDNA1 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…#23729) * mmvq Optim: add MMVQ_PARAMETERS_TURING(mmvq_parameter_table_id) for SM75 TURING * avoid a mismatch for JIT compilation of Turing device code for Ampere or newer Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…le (#23167)

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

removed AI-generated comment

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ci : releases use Github-hosted builds for the UI * cont : fix name

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

mtavenrath and others added 18 commits May 28, 2026 12:21

vulkan: Fix memory logger unsafe iterator access (#23667)

91eb8f4

vulkan: fix wrong index variable in inner loop (#23665)

7c48fb8

chat : add Granite 4.1 chat template (#23518)

bb771cb

vulkan: fast path for walsh-hadamard transform (#23687)

48e7078

* vulkan: fast path for walsh-hadamard transform * disable for intel due to segfault

ggml: auto apply iGPU flag CUDA/HIP if integrated device (#23007)

30af6e2

test-llama-archs: fix table format [no release] (#23810)

d374e71

arg: Add LLAMA_ARG_API_KEY_FILE environment variable for --api-key-fi…

7fb1e70

…le (#23167)

ci : change Vulkan builds to Release to reduce ccache (#23820)

dd15579

* ci : disable all CPU variant builds for Vulkan workflow * cont : change cache key * cont : change build type

mtmd: fix gemma 4 audio rms norm eps (#23815)

d6be315

* mtmd: fix gemma 4 audio rms norm eps * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

mtmd: n_head_kv defaults to n_head (#23782)

0b56d28

removed AI-generated comment

app : improve help output (#23805)

479a9a1

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

ci : releases use Github-hosted builds for the UI (#23823)

445b7ce

* ci : releases use Github-hosted builds for the UI * cont : fix name

ui: fix audio and video modality detection (#23756)

2f6c815

When model props are fetched asynchronously from the server, modelPropsVersion is incremented to trigger reactivity, but only the vision effect was listening to it.

pull Bot locked and limited conversation to collaborators May 28, 2026

pull Bot added the ⤵️ pull label May 28, 2026

pull Bot merged commit 2f6c815 into CrazyForks:master May 28, 2026

github-actions Bot added Nvidia GPU testing examples python server ggml Vulkan devops Hexagon labels May 28, 2026

github-actions Bot added the server/ui label May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#107

[pull] master from ggml-org:master#107
pull[bot] merged 18 commits into
CrazyForks:masterfrom
ggml-org:master

pull Bot commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

pull Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

pull Bot commented May 28, 2026 •

edited

Loading