Sync Fork 18-04-2026 by OsamaMazhar · Pull Request #4 · OsamaMazhar/llama.cpp

OsamaMazhar · 2026-04-17T22:41:50Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

* restore SYCL build and release, remove github cache * modify for test only * verify the ccache is used * remove debug code change * rm duplicate action, update key in ccache * add action ccache-clear after building in both ubuntu and windows * set %NUMBER_OF_PROCESSORS% in widnows build

* cuda: support concat for scalar types * Update concat.cu * fix metal ci issue

* llama : enable layer input extraction * spec: support eagle3 * eagle3: fix params bug * eagle3: support Gemma4 eagle3 from RedHatAI * eagle3: set sync when get features from target Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> * eagle3 : fix ubatch handling in embd_layer_inp extraction and encoder Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> * eagle3: adapt to upstream changes * eagle3: fix rebase issues and adapt to upstream changes * eagle3:exclude the eagle3 arch from test-llama-archs * eagle3: fix editorconfig check failures * eagle3: fix multi-seq issue in d2t vocab mapping * cont : minor style / clean-up * spec : remove `common_speculative_setup_draft_model()` * llama : clean-up unused API * eagle3: set d2t vocab mapping in decode graph * cont : assert layer inputs are configured * hparams : use n_embd_inp instead of n_embd_target_features * eagle3: make output.weight optional and inherit from target model when needed * haparams : generic norm-before-residual param * llama-ext : consistent names * cont : fix * hparams : remove target_hidden_size * cparams : rename output_layer_inp -> embeddings_layer_inp * arch : reuse ATTN_NORM_2 instead of adding new hidden norm * llama : clean-up names * cont : add assert + comment * Update conversion/llama.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: tnhnyzc <115956684+tnhnyzc@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ui: bake jpeg exif orientation into uploaded images stb_image in mtmd ignores exif metadata, so rotated smartphone photos reach the model with raw pixel orientation. The webui now reads the exif orientation tag at send time and feeds it into the existing capImageDataURLSize canvas pass: the browser applies the rotation when decoding, so capped images come out upright for free, and images under the cap threshold get a single plain redraw when orientation > 1. At most one re-encode ever happens per image. Upright jpegs with capping disabled pass through untouched, bit perfect. Adds jpeg-orientation.ts with a minimal exif parser working on a bounded base64 prefix (both endianness, returns 1 on any malformed input) and unit tests against handcrafted jpeg byte streams. * ui: move jpeg exif constants into lib/constants * ui: add browser test for jpeg orientation and capping Covers capImageDataURLSize end to end in chromium with real Pillow generated jpeg fixtures across exif orientations 1/3/5/6/8: upright quadrant colors checked pixel-wise, expected dimensions with and without capping, no orientation tag left in the output, and strict passthrough when nothing needs rewriting.

Tests are generally prefixed with -test, so rename export-graph-ops accordingly. rpc-server is probably too generic a name for /usr/bin. Because it should work with any ggml application, it is renamed to ggml-rpc-server.

* [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. * Add new tests that execute the new optimized strided copy path * Return unsupported for strided copy in OpenVINO, as new tests are failing

* opencl: rework FA kernel for f16 and f32 * opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_mask_pad_f16 pads the matching mask tile - flash_attn_blk_f16 classifies each KV tile per query block as fully masked / mixed / fully unmasked, so the main kernel can skip fully-masked tiles and the mask lookup for fully-unmasked ones * opencl: FA kernels for q4_0 and q8_0 * opencl: `set_rows` for f32 to q8_0/q4_0 * opencl: dequant kernels for q4_0 and q8_0 * opencl: add FA tile tuning table with override * opencl: wire host side for FA * opencl: q4_0 MoE tensors are also SOA'ed * opencl: cosmetic fix * opencl: refactor, also clarify some code paths in comments * opencl: fix inifity for `-cl-finite-math-only` --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_

Expose the existing --offline flag to `llama download` so a script can run it to check whether a model is already cached and ready to be served without touching the network. Also fix a latent use-after-free in the URL-task on_done callback: first_path is block-scoped and was captured by reference, but invoked after the block ends. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* jinja: add --dump-prog for debugging * Update common/jinja/runtime.cpp Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> --------- Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

* Add minicpm5 tool call parser * Refactor MiniCPM5 PEG parser per review feedback * Fix jinja min/max API to match Jinja2 * modify by review * MiniCPM5: use autoparser for XML tool calls and fix grammar preserved-token triggers * MiniCPM5: fix streaming tool-arg placeholder and remove alt XML markers * skip min/max attribute tests in -py mode * test-jinja: use real expected output for min/max attribute tests * MiniCPM5: revert shared mapper and history fallbacks per review Drop streaming tool-arg placeholder workarounds from the generic PEG mapper and restore strict tool-call argument JSON parsing so MiniCPM5 support stays limited to autoparser/diff-analyzer changes. * chat : refactor minicpm5 back to dedicated parser * cont : simplify grammar * cont : refactor * cont : fixes * cont : rename template to openbmb-MiniCPM5-1B.jinja * cont : add message delimiters * cont : fix tests --------- Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: 张涛 <>

* dflash: refactor draft model conversion * apply fix for eagle3 convert

…sisted by claude(in debugging and tests) (#24727)" (#25098)

* jinja, chat: add --reasoning-preserve flag * correct help message

@fairydreaming

* convert: add dsv4 conversion * add basic setup * add llm_graph_input_dsv4 * add save-load state * add sinkhorn eps - correction by @fairydreaming * add rope fix * cleanup dead code * fix bugs * support pro model: added by @fairydreaming * remove redundant V cache * Chat template * remove debugging leftovers * Add mechanism for inlining templates based on architecture * s/deepseek-v4-flash/deepseek4/g * s/deepseek-v4-flash/deepseek4/g continued * enable graph reuse * enable FA * fix test llama archs * rename * compatibility with antirez ds4 GGUFs * simplified set_gguf_parameters() by calling super class method, replaced moe.score_func with expert_gating_func. * reserve worst-case kv-cache * revert max split inputs * address review comments * add padding to enable FA * pad only the final value of plan.n_kv to 256 * remove built-in cpp chat template * cont: remove cpp built-in template * rm outdated test * replace ggml_view_3d() with ggml_reshape_3d() Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * only support n_seq=1 for now * remove unused var * cont: remove unused var * use scale bias * use correct ptr for can_reuse * remove gen-chat-inline-templates.py * simplify graph reuse * cont: cleanup * remove unused inputs * enable partial checkpointing * add correct shape for kq_mask + set llama_model_n_swa to 0 for dsv4 * precompute source_idx + add comment about dummy write * support multi-seq * remove restored_trim_pos * use split_equal when possible * fix indent * address review comments * use LLM_KV * fix ci --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: fairydreaming <166155368+fairydreaming@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…tic (#25005) * vulkan: extract flops calculation into function * use flops instead of matmul src0 tensor size for submission threshold * use unsigned ints

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

#20793)" (#25138)

* HIP: keep MMQ for gfx900 MoE and Q8_0, use hipBLAS for dense K-quants Assisted-by: GitHub Copilot CLI * HIP: tighten conditional block to be explicitly for gfx900 * HIP: Further simplified gfx900 conditional block * removed unnecessary comment

* vulkan: roll bk loop in matmul for asahi linux * vulkan: fix inline comment * vulkan: revert BK-loop unroll change * vulkan: edit spirv directly for asahi roll bk loop * vulkan: remove trailing whitespace at the end of comments

* CUDA: fix Gemma E4B MTP FlashAttention * remove unused template declaration

…clamp + stride) (#25103)

@ngxson

* common,server: handle bracketed IPv6 literals in URL authority Parse the [host]:port form (RFC 3986) and bracket IPv6 hosts when formatting a URL authority: listening log, proxy Host header, proxy log, client rebuild. The per-request remote_addr stays bare. * common: restore unsupported scheme throw in url parser Address @ngxson review: keep the explicit reject in port resolution so the block stays self-contained. Non-http(s) schemes still throw (also gated at the top of common_http_parse_url).

* Fix input assignment in layer processing loop Fix DFLASH for qwen-coder-next * add line break Added tensor for attention normalization in Qwen3 model.

…ask strides in flash_attn_mask_to_KV_max kernel (#24945) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* opencl: general q1_0 support * opencl: add Adreno GEMM/GEMV for q1_0

…ests (#25174)

github-actions Bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU testing examples devops python script server ggml model OpenCL Hexagon WebGPU server/webui Vulkan IBM zDNN AMD ZenDNN build nix jinja parser Ascend NPU OpenVINO android labels Apr 17, 2026

github-actions Bot added the server/ui label May 16, 2026

arthw and others added 4 commits June 12, 2026 09:30

ggml: support concat for scalar types at cuda backend (#24011)

85f99dc

* cuda: support concat for scalar types * Update concat.cu * fix metal ci issue

ckastner and others added 30 commits June 27, 2026 10:31

vulkan: fix step operator for 0 input (#25036)

0b6529d

sycl : fix failed ut cases of norm (#25044)

9bebfcb

logs : reduce v2 (#25078)

27c8bb4

* server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_

spec : add DFlash support (#22105)

d1b3425

* spec: add DFlash v2 support * dflash: support sliding window attention per layer_types * docs: add dflash section --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

jinja: add --dump-prog for debugging (#25086)

f68a788

* jinja: add --dump-prog for debugging * Update common/jinja/runtime.cpp Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com> --------- Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

dflash: refactor draft model conversion (#25110)

fa72bc6

* dflash: refactor draft model conversion * apply fix for eagle3 convert

ui: fix stop and reasoning skip in single-model mode (#25084)

7cb8576

Revert "ui: fix accessibility for hover-gated interactive elements as…

dbdaece

…sisted by claude(in debugging and tests) (#24727)" (#25098)

jinja, chat: add --reasoning-preserve flag (#25105)

b3fed31

* jinja, chat: add --reasoning-preserve flag * correct help message

common : remove unused regex-partial (#25118)

277a105

tools/ui: restore Tailwind scanning in ignored worktrees (#24879)

6cb18b2

vulkan: use flops instead of weight tensor size for submission heuris…

25a1d63

…tic (#25005) * vulkan: extract flops calculation into function * use flops instead of matmul src0 tensor size for submission threshold * use unsigned ints

common : dedup preset and cached model entries in /v1/models (#25131)

6f4f53f

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Revert "sched : reintroduce less synchronizations during split compute (

86b9470

#20793)" (#25138)

ggml-webgpu: add support for NVFP4 (#25143)

6c5de1c

vulkan: roll bk loop in matmul for asahi linux (#24663)

f708a5b

* vulkan: roll bk loop in matmul for asahi linux * vulkan: fix inline comment * vulkan: revert BK-loop unroll change * vulkan: edit spirv directly for asahi roll bk loop * vulkan: remove trailing whitespace at the end of comments

CUDA: fix Gemma E4B MTP FlashAttention (#25148)

e495d1e

* CUDA: fix Gemma E4B MTP FlashAttention * remove unused template declaration

CUDA: fix get_rows_back for tables with more than 65535 rows (grid-y …

931eb37

…clamp + stride) (#25103)

model : register t_layer_inp for qwen3next (#25141)

4f31eed

* Fix input assignment in layer processing loop Fix DFLASH for qwen-coder-next * add line break Added tensor for attention normalization in Qwen3 model.

cuda : prevent integer truncation and overflow errors when using KQ m…

0eca4d4

…ask strides in flash_attn_mask_to_KV_max kernel (#24945) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

opencl: initial q1_0 support (#25160)

fd1a057

* opencl: general q1_0 support * opencl: add Adreno GEMM/GEMV for q1_0

ui: Remove PWA navigate fallback to prevent caching API endpoint requ…

7af4279

…ests (#25174)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync Fork 18-04-2026#4

Sync Fork 18-04-2026#4
OsamaMazhar wants to merge 1206 commits into
OsamaMazhar:masterfrom
ggml-org:master

OsamaMazhar commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

OsamaMazhar commented Apr 17, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants