Sync master with upstream release b9093 by jan-service-account · Pull Request #511 · janhq/llama.cpp

jan-service-account · 2026-05-10T01:10:17Z

Updates dev branch with latest release (b9093) from ggml-org/llama.cpp

Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

Implement the Gated Delta Net recurrence on HVX with: - 4-row fused kernels for PP (prompt processing) path - 8-row fused kernels for TG (token generation) path, reducing K/Q/gate vector reload overhead by 2x - Separate PP/TG thread functions for I-cache isolation - VTCM state scratchpad with DMA in/out for TG single-cycle access - Vectorized gate exp via hvx_exp_f32

* mimo-v2.5: add flash attention mma/tiles for for d_kq=192 d_v=128 * mimo-v2.5: follow (256, 256) fattn templates * mimo-v2.5: cleanup comments * mimo-v2.5: further comment cleanup * mimo-v2.5: address PR feedback fix GQA handling check for other dangling 320/576 carveouts and mirror them for 192 Add to backend ops test so new paths are covered

…gml-org#22147) * sycl: Battlemage AOT build via spir64_gen + MMQ subgroup annotations Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove unneeded/unnecessary comments and annotations The MMQ subgroup annotations added are on functions gated behind ggml_sycl_supports_mmq(). Revisit the need for these annotations when that function changes. --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

) * sycl: Q5_K reorder MMVQ/dequant + Q8_0 reorder MMVQ path Signed-off-by: Chun Tao <chun.tao@intel.com> * Remove duplicate definitions --------- Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

Add GGML_TYPE_BF16 to the SYCL backend's GET_ROWS operation, both in supports_op and in the kernel dispatch. This fixes a performance regression where models using BF16 embedding tensors (e.g., Gemma4's per_layer_token_embd.weight) fall back to CPU for the GET_ROWS op, causing a full GPU-to-CPU tensor transfer every token. The fix reuses the existing get_rows_sycl_float template with sycl::ext::oneapi::bfloat16, matching the pattern already used for sycl::half (F16) and float (F32).

* SYCL: reduce allocation overhead during flash attention * tidy up whitespace * add a note about the flag * move ggml_sycl_fattn_* into fattn-buffers.hpp * refactor implementation into fattn-buffers.cpp * move new_fattn_kv_buffers back into ggml-sycl.cpp

…#22567)

aicss-genai and others added 11 commits May 9, 2026 08:05

sycl: support non-contiguous input in PAD op (ggml-org#22148)

c5703e0

Signed-off-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Chun Tao <chun.tao@intel.com> Co-authored-by: Todd Malsbary <todd.malsbary@intel.com>

cmake : update BoringSSL to 0.20260508.0 (ggml-org#22839)

5757c4d

docker : upgraded the default intel compute-runtime version (ggml-org…

00d56b1

…#22567)

devops : updated Nix systems (ggml-org#22869)

65d7a8b

model : add sarvam_moe architecture support (ggml-org#20275)

1e5ad35

jan-service-account merged commit 1094164 into dev May 10, 2026
17 checks passed

jan-service-account deleted the update-dev-from-master-2026-05-10-01-10 branch May 10, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with upstream release b9093#511

Sync master with upstream release b9093#511
jan-service-account merged 11 commits into
devfrom
update-dev-from-master-2026-05-10-01-10

jan-service-account commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

jan-service-account commented May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants