[pull] master from ggml-org:master by pull[bot] · Pull Request #121 · CrazyForks/llama.cpp

pull · 2026-06-02T09:42:35Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning

@ngxson

* server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

…tions for latest models (#23989) * hex-mm: initial support for F32 * F32 -> F32 matmuls * hex-rms-norm: fix src1 stride use in fused rms_norm_mul * hex-ops: clear spad pointers in the ops that clober it This fixes an odd case where fused rms-norm-mul was failing but only in qwen3.5-2B and only at searth op-bath sizes. * hmx-mm: add support for F32 * F32 -> F32 matmul_2d on HMX Decided to use Q4_0 * F32 -> F32 matmul for this. Q4_0 gets dequantized and tiled into F16, and here we quantize and tile F32 into F16. Super simple and pretty efficient. * hmx-mm: route f16 2D matmuls through the same kernel used for all other types * hmx-mm: re-introduce pipelined vs non-pipelined mode that we used to have but is much more generic way This update futher improves matmul performance and at the same time removes most of the redudant logic we had in different paths. * hmx-fa: slighlty improved pipeline simimar to matmul updates * hmx-mm: initial version of MAT_MUL_ID support for HMX * hmx-mm: fixed mxfp4 handling for MUL_MAT_ID * hex-gdn: optimize GATED_DELTA_NET DMA prefetch/double-buff, vectorize everything with HVX, in other words -- the usual :) * hmx-mm: missed one more case where we can use fastmod * hexagon: update DCVS settings for a slight perf bump * hmx-fa: use fastdiv in hmx-flash-attn * hmx-fa: precompute slope values to avoid disrupting the inner loop * hvx-utils/fa: new HVX helpers for powf and logf and using those to speed up FA alibi * hex-ops: fixed a bug in fusion logic that was messing up the order of the src tensors when some srcs are empty * hex-fa: correctly fallback to HVX if we have sinks or the dims are not quite right

* llama : deprecate `llama_set_warmup` * cont : fix type Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

@CISC

* feat: support step3.7 * fix: register Step-3.7 BPE pre-tokenizer hash * delete fromjson * register step3.7 arch to Step35Model * drop vit projector in base filter * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * restore blank line --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

…nts for Chat Form Add Action UI (#23434) * feat: Add "Thinking" toggle and status icon + redesign Chat Form Actions Add panel * test: Update test reference * fix: Icon * fix: E2E test command * fix: wait for greeting h1 to be visible in e2e test * fix: remove duplicate PDF option in attachment dropdown * fix: use label-based group toggle to avoid stale references * refactor: inline MCP server and tool toggles in mobile sheet * fix: serve correct build directory in e2e playwright config * feat: add reasoning effort levels selector in model dropdown * feat: Reasoning effort * refactor: Make server origin configurable via environment variable * feat: Add chat template thinking detector utility * feat: Add thinking support detection to models store * refactor: Update model selector components with thinking detection and message-specific indicators * feat: Update chat form components for model selection and thinking support * feat: Improve Reasoning controls UI * refactor: Apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * fix: Model tags * refactor: Cleanup * refactor: Remove unneeded components * refactor: Cleanup

Previously error to string conversion was split in two different files, with one converting errors into strings, and another function analyzing those strings to generate yet another string. Now the the error handling for network fetches has been centralised and uses directly HTTP error codes whereas possible to generate the human-readable error strings. It also fixes an issue where all JSON errors reported from the backend, such as "Invalid API key", would get turned incorrectly in to "Failed to connect to server" due to poor matching logic in the now-gone getErrorMessage function.

* docs: update HOWTO-add-model.md with new model registration and graph-building instructions * docs: improve formatting in HOWTO-add-model.md * Update docs/development/HOWTO-add-model.md Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

yomaytk and others added 12 commits June 1, 2026 16:59

revert to using global_invocation_id for cpy shader (#23955)

b8275a8

opencl: fix compiler warnings for non-adreno path (#23922)

210a657

* opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning

clean up unused variables warnings (#23975)

1fd5f48

hexagon: add gelu_quick (#24007)

d178a11

llama : deprecate llama_set_warmup (#24009)

4f3a4be

* llama : deprecate `llama_set_warmup` * cont : fix type Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>

kv-cache : SWA checkpoints store only non-masked cells (#23981)

2365315

pull Bot locked and limited conversation to collaborators Jun 2, 2026

pull Bot added the ⤵️ pull label Jun 2, 2026

pull Bot merged commit d5ab083 into CrazyForks:master Jun 2, 2026

github-actions Bot added documentation Improvements or additions to documentation Nvidia GPU testing examples python server ggml OpenCL WebGPU Hexagon server/ui labels Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ggml-org:master#121

[pull] master from ggml-org:master#121
pull[bot] merged 12 commits into
CrazyForks:masterfrom
ggml-org:master

pull Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

pull Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

pull Bot commented Jun 2, 2026 •

edited

Loading