llama.cpp SYNC by akapoor3518 · Pull Request #45 · tsisw/llama.cpp

akapoor3518 · 2025-09-02T20:24:37Z

llama.cpp SYNC

akapoor3518 · 2026-05-05T19:55:25Z

closing

akapoor3518 · 2026-05-05T19:56:05Z

closing

akapoor3518 · 2026-05-05T20:13:07Z

close

* ui: add opt-in run_javascript frontend tool Expose a run_javascript tool to the model, executed entirely in the browser through the existing agentic loop. Code runs in a Web Worker inside a sandboxed iframe with an opaque origin, isolated from the WebUI and its API. Console output, errors and the return value are fed back as the tool result. The parent enforces a hard timeout by removing the iframe, which terminates the worker. Disabled by default, toggle in Settings > Developer. * ui: address review feedback from allozaur Use the JsonSchemaType enum for the tool definition parameter types instead of raw string literals, extending it with STRING and NUMBER. Move the worker shim and the iframe harness html into their own files so the service no longer carries inline source blobs. Replace the remaining magic strings with constants: SANDBOX_EMPTY_OUTPUT and SANDBOX_TRUNCATION_NOTICE, and reuse NEWLINE_SEPARATOR for joins. * ui: move sandbox worker shim to a raw imported file Replace the inline worker template string with a real sandbox-worker.js imported as raw text, and build the iframe harness from it in sandbox-harness.ts. The raw worker ships as a string, not a module, so it is excluded from eslint and the typecheck program.

@gabe-l-hart

… when deepstack is not used (#24357) * llama-graph : apply embedding scale when deepstack is not used * nits: remove non-existant hunyuan-vl from the tests * apply suggestion from @gabe-l-hart --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* webui: implement pinned conversations support * webui: linter/prettier pass * Fix the unused handleMobileSidebarItemClick from the component. * the search should find pinned conversations as well Co-authored-by: Pascal <admin@serveurperso.com> --------- Co-authored-by: Pascal <admin@serveurperso.com>

This is a non-functional change. When using `--spec-type ngram-map-k4v`, the log messages at startup and runtime say `ngram-map-k`. Added logic in the in the constructor of `common_speculative_impl_ngram_map_k` to pass the correct `COMMON_SPECULATIVE_TYPE_NGRAM_MAP_K4V` when `config.key_only` is `false`. After this change, the log messages use the correct name.

)

…state check (#23082) * mamba2: remove hardcoded 2x expansion factor, support any expand value * mamba2: remove invalid d_inner %% d_state check (unrelated parameters) * Update convert_hf_to_gguf.py: make expand optional with default 2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mamba2: apply expand fix to refactored conversion/mamba.py * also check for mamba_expand --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <1629204+CISC@users.noreply.github.com>

…24426)

* mtmd: add more validations * fix * refactor a bit * type check for get_arr_int

@ngxson

* server: SSE replay buffer, survives client disconnect Opt in on POST /v1/chat/completions when the client sends X-Stream-Resume: 1 and a non empty X-Conversation-Id. The conv id is the session identity end to end, no extra opaque token. The drain runs detached server side and buffers SSE bytes, the generation survives HTTP disconnect, F5, or lets users switch from iOS Safari to another app without losing the actively generated response. Routes: GET /v1/stream/<conv_id>?from=N replay GET /v1/streams[?conversation_id=X] list, drives sidebar spinners DELETE /v1/stream/<conv_id> Stop, idempotent Router parent fans out to children for list and delete, probes on GET to route to the owner, fans out DELETE on POST so "one session per conv" holds across model swaps. WebUI: the layout snapshots /v1/streams at mount and on visibilitychange, the sidebar reflects live inferences across all convs. The chat page reattaches on mount, append vs fresh is detected from existing content so continue mid stream keeps its prefix. update_slots: on llama_memory_seq_rm refusal at a deep position, full clear of the seq and reprefill from zero instead of GGML_ABORT. OAI strict path unchanged when the opt in headers are absent. * server: create stream session only after post_tasks succeeds * server, ui: drop X-Stream-Resume, X-Conversation-Id alone enables the replay buffer * server: drop magic 17, derive the X-Conversation-Id header length from sizeof at build time * refactor: address review feedback from ngxson * server-context: cleaning * server-stream: fix use-after-free on rd Guard stop_producer with a shared alive flag, flipped by on_stream_end before rd dies. Prevents a late cancel (session eviction by a later POST on the same conv_id, or a DELETE arriving after the producer ended) from touching a destroyed rd. * ui: fix cross-conversation contamination Scope streaming flags per conv so one finishing does not unflag the others, guard discoverActiveStream against concurrent runs to avoid duplicate attaches, and stop racing syncRemoteRunningStreams for the sidebar set. * server-http: keep request alive in detached SSE drain The response next() lambda may reach into *request via &req long after on_complete reset the request shared_ptr. Capture request in the detached thread so it outlives the drain. * ui: address review feedback from coder543 Forward Authorization to /v1/stream and /v1/streams fetches, the resumable routes must obey --api-key like the rest of the API. Wrap reader.read() in a try/catch, the underlying connection drop rejects with TypeError instead of resolving done=true, treat it as a premature end of stream so the existing resume loop kicks in. Freeze the model at session start in chatStreamingStates.model and thread it through cancel and resume, the dropdown selection may have changed since the POST and the server side identity is fixed at that time. * format * ui: remove unused selectedModelName * server-stream: poll session->is_cancelled() in stream_aware_should_stop Address review feedback from coder543. The cancel propagation through rd.stop() relies on the slot eventually processing the cancel task and posting a result that notifies the recv condvar, remove_waiting_task_ids does not notify directly. Add a defensive poll on session->is_cancelled() so the producer-side next() loop exits on its next iteration after cancel() without waiting for the cancel task to round trip through a slot. * server-stream, ui: replace GET /v1/streams with POST /v1/streams/lookup Address review feedback from coder543. Listing live sessions leaks the conversation_id of every concurrent user, which defeats the random UUID unguessability. The new route takes {conversation_ids: [...]} in the body and returns matches only for the ids the caller already owns, so foreign UUIDs stay private. The router fans out the same POST to every child and aggregates, the WebUI passes the convs visible in its sidebar. * ui: read conv ids from IndexedDB in syncRemoteRunningStreams The conversations store is not hydrated yet at +layout onMount, so the sidebar spinners stayed off for background convs until the user clicked on them. Read straight from the DB to dodge the init race. * server-models: deduplicate stream lookup timeouts behind one constant * ui: extract visibility kick grace into a stream constant, bump to 1000 ms * make it safer & more simple * server-stream: survive client disconnect via stream_pipe::finish_producer After the RAII rewrite the generation stopped the moment the client disconnected. httplib bails its content provider on the is_peer_alive check at the top of write_content_chunked, so returning true from the provider never keeps it producing: the response resets, rd is destroyed and its task gets cancelled. Reinstate the disconnect survival inside the pipe. stream_pipe gains finish_producer, which pumps the response next() into the ring buffer until the generation ends, and mark_producer_done for the clean wire end. server-http only triggers them: mark before sink.done on a clean close, finish in on_complete when the peer left early. No detach, no stream logic in server-http beyond the trigger, and the strict OAI path is untouched when no pipe is attached. Known limitation: finish_producer pumps synchronously on the http worker, so a disconnected stream keeps its worker busy until the generation ends. A follow-up will move the drain off the http worker so no worker is held. * server-stream: drain disconnected streams on a manager owned thread The previous commit pumped the post disconnect drain synchronously in on_complete, on the http worker, so a disconnected stream kept its worker busy until the generation ended. Under a wave of reloads or tab closes that pins workers from the pool. Move the drain off the http worker. on_complete now hands the response to stream_session_manager::adopt_orphan, which pumps it to completion on a manager owned thread and releases the worker at once. One thread per disconnected stream still generating, stored in a list, joined and reaped on the next adopt, by the GC, and at shutdown. No detach, the thread lifecycle is fully owned by the manager. needs_drain gates the handoff so a cleanly finished stream never spawns a thread, and the strict OAI path stays untouched when no pipe is attached. stop_gc now cancels sessions before finalizing them, so an in flight drain sees is_cancelled and exits instead of blocking the shutdown join until the generation ends naturally. * ui: add missing JSDoc * server-stream: drain on the http worker, drop the manager thread Address @ngxson review: httplib runs a large dynamic pool and a worker blocked in next() sits on a condvar instead of burning cpu, so draining the rest of the generation on that worker is fine and much simpler than a dedicated thread. on_complete calls finish_producer directly again. Removes adopt_orphan, the orphan thread list and its reaping, the stop_gc session cancel that only existed to unblock those threads, and the now dead drain_shutdown flag. * server-stream: split stream_pipe into producer and consumer classes Address @ngxson review: one class covering both ends was messy. stream_pipe is now a base holding the session and is_cancelled, with stream_pipe_producer (write, mark_producer_done, finish_producer, cleanup, finalizes on destruct) and stream_pipe_consumer (read only, no finalize) deriving from it. Drops the is_producer_ discriminator and its runtime guards, the type now encodes the role. res.spipe is retyped to shared_ptr<stream_pipe_producer> since it is only ever a producer. No behavior change. * server-stream: rename producer methods to unix pipe semantics Address @ngxson review: mark_producer_done becomes done(), finish_producer becomes close(), matching a unix pipe write end. The producer_done_ member follows as done_. write() is unchanged. No behavior change. * server, ui: route resumable streams via a conv map, persist resume identity Address ngxson review: drop the polling probe, proxy_post records a conv_id -> model map and the stream routes resolve the owning child with one lookup. The map is the single source of truth, the ::model suffix stays for child session uniqueness but the router never parses it. UI: the server keys a session by the POST time identity (conv::model), but reload probed with the bare conv id and missed model tagged sessions, so F5 stopped the stream and sidebar spinners stayed off. Persist the model and rebuild the exact identity on resume, single conv and bulk sidebar both send it. Add unit coverage for the identity round trip. * ui: resolve continue target by id to stop cross-conversation flash on switch * ui: skip stream resume when the abort is intentional * server: move the conv id to model map into a self contained tracker Address review from ngxson: server_models held two mutexes side by side, the global one and a bare conv_model_mu guarding a loose map, which made the locking hard to follow. Wrap the map and its lock in a small conv_model_tracker struct that owns its mutex, one mutex per struct. The remember, lookup and forget methods move inline into the tracker, server_models exposes a single conv_models member and the routes call models.conv_models.lookup and friends. No behavior change, the map stays the single source of truth for routing resumable streams to a child. * ui: replace stream magic values with enums and shared constants Address review from allozaur: lift the inline literals around the resumable stream code into named symbols so the intent is explicit and reusable. * ui: fold the stream resume and discovery helpers into ChatService Address review from allozaur: drop the two standalone stream-*.service files. They were used only by the chat service and store, carried no shared state, and did not follow the static class pattern the other services use, so a separate abstraction was not warranted. Move the helpers onto ChatService as static methods. No behavior change, tests now exercise them through ChatService. * docs: document the SSE replay buffer in server README-dev Add the resumable streaming section, list stream_session_manager in the backend component inventory, and link PR 23226 in the related PRs. * ui: align attachServerStream call with onCompletionId param in handleStreamResponse * server-http: rename del_ to del to match get and post * ui: address review feedback from allozaur * ui: drop duplicate SSE constants, keep sse.ts canonical * ui: use svelte:document for the visibilitychange listener address review from allozaur: replace the manual document.addEventListener in onMount with a declarative <svelte:document onvisibilitychange>. svelte handles attach, detach and SSR, so the typeof document guard and the onMount cleanup go away. onMount keeps only the first load snapshot. * server: trim redundant stream drain comments Address review from ngxson * server: balance and clean up stream comments remove redundant comments and tighten the verbose ones across the resumable stream code, keeping the concurrency and lifetime rationale that is not obvious from the code. also fix two stale comments in server.cpp and server-models.h that still described the old ::model suffix probe and fan out routing, now replaced by the conv_id -> model map Address review from ngxson * ui: balance and clean up stream comments dedup repeated rationale (frozen conv::model identity, the lookup privacy note, the abort patterns) down to one canonical spot, tighten the verbose blocks, and keep the concurrency and resume-offset reasoning. fix stale comments in stream-identity.ts and chat.service.ts that still described the old loopback probe and fan out routing, now the conv_id -> model map. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* ggml-cpu: fix SVE leftover path in ggml_vec_dot_f32 2D convolutions with kernel size 9 produced different results on SVE enabled ARM devices. After debugging it turned out that ggml_vec_dot_f32 was using data from inactive lanes. Use svmla_f32_m(pg, sum1, ax1, ay1) so inactive lanes retain sum1. * cont : clean-up --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

)

@jeffbolznv

* vulkan: Workaround compiler bug in conv2d coopmat2 path * apply same workaround to CONV_3D * Apply suggestion from @jeffbolznv

…y claude(in debugging and tests) (#24727)

…lus (#24404) * vulkan: add INTEL_PRE_XE2 arch enum and enable coopmat1 on Intel Xe-LPG Plus (1/3, Xe1-ARLH) Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com> * Address comments of bf16 and trailing whitespace * Rename INTEL_PRE_XE2 to INTEL_XE1 and remove driver workaround * Add Windows driver check --------- Co-authored-by: Xia, Jie <jie.xia@intel.com> Co-authored-by: Liu, Russell <russell.liu@intel.com>

…rator improvements (#24974) * Update to OV 2026.2.1, Make OV release packages self-contained * Update to OV 2026.2.1, Make OV release packages self-contained * OpenVINO Backend: Remove compute_op_type hardcoded sets (#222) * OpenVINO Backend: Remove compute_op_type hardcoded sets * revert get_op_type removal * OpenVINO backend: enable softmax with sink input * OpenVINO backend: opt mul_mat_id convert process for large size * OpenVINO backend: Modify add_id to support 2D/4D * OpenVINO Backend: Add glu_swiglu_oai * PR review: fix paths * PR review: fix path consistency --------- Co-authored-by: Mostafa <mostafas.main.email@gmail.com> Co-authored-by: Xuejun <Xuejun.Zhai@intel.com>

* arg: fix handling --spec-draft-hf and --hf-repo-v * fix missing mparams.hf_file

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

@ggerganov

* CUDA: Improve performance via less synchronizations between token (#17795) * Adds CPU-to-CUDA copy capability to ggml_backend_cuda_cpy_tensor_async() * Adds function to relax sync requirements between input copies on supported backends (CUDA for now) * Exchanges synchronous copy with async copy function. * Adds macro guards to allow compilation in non-CUDA builds * Reworked backend detection in ggml-backend.cpp to avoid linking conflicts * Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues * Minor cleanup * Makes opt-in to relax use of explicit syncs more general. Backends like vulkan which require a synchronization between HtoD copies and graph execution could also adopt this change now. * Reintroduces stricter check for CPU->CUDA backend async copy via GGML_DEVICE_TYPE_CPU. * Corrects initialization of ggml_backend_sync_mode in ggml_backend_sched_split initialization * Simplifies synchronizations to adhere to `saaasg` pattern. * Apply suggestion from @ggerganov (src->buffer to buf_src) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestion from @ggerganov (src->buffer to buf_src) v2 Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Apply suggestions from @JohannesGaessler code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Adds single-GPU synchronizations to multi-GPU settings to fix hip backend pipeline parallel bugs. * Scheduler Hardening: Exclude hip/MUSA from copy_from_host CPU split -> GPU split optimization * Scheduler Hardening: Re-adding original additional synchronizations for non-async backends * Adds disclaimer to hip/musa exclusion of copy_from_host. Highlights that it is out of precaution, but that no perf-impact is visible, and that it can be revisited separately anytime. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Tests are generally prefixed with -test, so rename export-graph-ops accordingly. rpc-server is probably too generic a name for /usr/bin. Because it should work with any ggml application, it is renamed to ggml-rpc-server.

* [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies. When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel. This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps. * Add new tests that execute the new optimized strided copy path * Return unsupported for strided copy in OpenVINO, as new tests are failing

* opencl: rework FA kernel for f16 and f32 * opencl: flash-attention prefill prepass kernels - flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple - flash_attn_mask_pad_f16 pads the matching mask tile - flash_attn_blk_f16 classifies each KV tile per query block as fully masked / mixed / fully unmasked, so the main kernel can skip fully-masked tiles and the mask lookup for fully-unmasked ones * opencl: FA kernels for q4_0 and q8_0 * opencl: `set_rows` for f32 to q8_0/q4_0 * opencl: dequant kernels for q4_0 and q8_0 * opencl: add FA tile tuning table with override * opencl: wire host side for FA * opencl: q4_0 MoE tensors are also SOA'ed * opencl: cosmetic fix * opencl: refactor, also clarify some code paths in comments * opencl: fix inifity for `-cl-finite-math-only` --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_

Expose the existing --offline flag to `llama download` so a script can run it to check whether a model is already cached and ready to be served without touching the network. Also fix a latent use-after-free in the URL-task on_done callback: first_path is block-scoped and was captured by reference, but invoked after the block ends. Signed-off-by: Adrien Gallouët <angt@huggingface.co>

akapoor3518 changed the base branch from master to llama.cpp-syn-sept2 September 2, 2025 20:28

github-actions Bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN testing build examples devops python script android server ggml nix Ascend NPU OpenCL labels Sep 17, 2025

github-actions Bot added the model label Nov 4, 2025

jeffbolznv and others added 7 commits June 9, 2026 13:27

vulkan: reduce iq1 shared memory usage for mul_mm (#24287)

d6d0ce8

mtmd: build_vit batching (#24352)

49f3542

ci : fix windows release (#24369)

e25a32e

tdakhran and others added 30 commits June 25, 2026 18:58

model : Add label for LFM2.5-230M (#25008)

9d5d882

xcframework : disable mtmd video on i/tv/visionos (#25018)

beac530

opencl: flush profiling batch at shutdown for incomplete batches (#25016

5c7c22c

)

CUDA: batch out_prod broadcast (dps2>1) path with cublasSgemmBatched (#…

f818065

…24426)

mtmd: add more validations (#25013)

b11f7c1

* mtmd: add more validations * fix * refactor a bit * type check for get_arr_int

sycl : clamp softmax input to avoid underflow (#24941)

e7e3f35

CUDA: add cublasSgemmBatched mapping for HIP/MUSA vendor headers (#25033

2f18fe1

)

vulkan: Workaround compiler bug in conv2d coopmat2 path (#24924)

9df0680

* vulkan: Workaround compiler bug in conv2d coopmat2 path * apply same workaround to CONV_3D * Apply suggestion from @jeffbolznv

ui: fix accessibility for hover-gated interactive elements assisted b…

ded1561

…y claude(in debugging and tests) (#24727)

vulkan: opt mul_mat_vecq for mi50 (#22933)

487a6cc

ggml : bump version to 0.15.3 (ggml/1550)

96183e9

sync : ggml

e7ea94a

arg: fix handling --spec-draft-hf and --hf-repo-v (#25043)

024930c

* arg: fix handling --spec-draft-hf and --hf-repo-v * fix missing mparams.hf_file

devops : add llama in all docker images (#25035)

5d8ccdf

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

app : allow --version, --licenses & --help (#25054)

050ee92

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

tests : fix test-chat-template --no-common option (#25075)

83d385b

ci : add windows-openvino to check-release (#25022)

0275c0f

vulkan: fix step operator for 0 input (#25036)

0b6529d

sycl : fix failed ut cases of norm (#25044)

9bebfcb

logs : reduce v2 (#25078)

27c8bb4

* server : reduce logs * cont : common * cont : spec * cont : CMN_ -> COM_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama.cpp SYNC#45

llama.cpp SYNC#45
akapoor3518 wants to merge 4374 commits into
tsisw:llama.cpp-syn-sept2from
ggml-org:master

akapoor3518 commented Sep 2, 2025

Uh oh!

akapoor3518 commented May 5, 2026

Uh oh!

akapoor3518 commented May 5, 2026

Uh oh!

akapoor3518 commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants