Releases · ggml-org/llama.cpp

20 May 04:13

github-actions

b9245

b39a7bf

b9245 Latest

Latest

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (#23349)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-05-20T04:13:18Z
cudart-llama-bin-win-cuda-13.1-x64.zip

sha256:f96935e7e385e3b2d0189239077c10fe8fd7e95690fea4afec455b1b6c7e3f18

384 MB 2026-05-20T04:13:35Z
llama-b9245-bin-310p-openEuler-aarch64.tar.gz

sha256:e4be2f2cc6dc23321949ccdcc062b076e66c5607718c23387534fcf3fec82474

10.4 MB 2026-05-20T04:13:51Z
llama-b9245-bin-310p-openEuler-x86.tar.gz

sha256:cced61b757231bbcd322717753b119e89bf84eef5d6136a184b92601db6e0398

11.2 MB 2026-05-20T04:13:53Z
llama-b9245-bin-910b-openEuler-aarch64-aclgraph.tar.gz

sha256:e3de84ce1a22b0157c8f99939b9609f785737404b55c81f48c28c19c8482709b

10.4 MB 2026-05-20T04:13:54Z
llama-b9245-bin-910b-openEuler-x86-aclgraph.tar.gz

sha256:51f9bcf0e71a91ebff864b39e15572ecbd66da96ef7bbf12e910a2d9e2410425

11.1 MB 2026-05-20T04:13:55Z
llama-b9245-bin-android-arm64.tar.gz

sha256:44bc3220f55b9708a9a5941ae0de28746e9df10e0d0a6b52874645ecec444c3c

62.1 MB 2026-05-20T04:13:57Z
llama-b9245-bin-macos-arm64-kleidiai.tar.gz

sha256:8ca9692e09089e4d6f7e4b84b7304f04181b7d581afb35fe6473b25108d9d62f

8.11 MB 2026-05-20T04:14:00Z
llama-b9245-bin-macos-arm64.tar.gz

sha256:e933326d73810c720da5fb32ea52cc1a0d0af92b3a047aa50050988807143dbc

8.09 MB 2026-05-20T04:14:01Z
llama-b9245-bin-macos-x64.tar.gz

sha256:384d8f55cf24b61cf0dd525ad58349a5f1b50d1755fd25a52e3df3b6ccfecdde

8.14 MB 2026-05-20T04:14:03Z
Source code (zip)

2026-05-20T01:52:21Z
Source code (tar.gz)

2026-05-20T01:52:21Z

20 May 03:39

github-actions

b9244

b28a2f3

b9244

opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (#23303)

opencl: add q4_k moe support
opencl: add q5_k moe support
opencl: add q6_k moe support
opencl: adjust format

Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

20 May 03:08

github-actions

b9243

17d22a3

b9243

hexagon: add MROPE and IMROPE support in HTP rope op (#23317)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

20 May 02:44

github-actions

b9240

57cb35c

b9240

common: fix --help for --verbosity (#23278)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

20 May 02:42

github-actions

b9239

7256fce

b9239

common: fix --fit verbosity with --verbosity 4 (#23282)

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

20 May 03:04

github-actions

b9235

d14ce3d

b9235

llama : MTP clean-up (#23269)

llama : disable equal splits for recurrent memory with partial rollback
spec : re-enable p-min with MTP drafts
spec : re-enable ngram spec in combination with RS rollback
spec : fix ngram-map-* params
spec : fix acceptance logic in combined ngram + draft configs
graph : fix reuse for combined token + embd batches
spec : log parameters for each speculative implementation

add LOG_INF in each constructor with implementation type and parameters
extract device string logic into common_speculative_get_devices_str()
move 'adding speculative implementation' log from init into constructors

Assisted-by: llama.cpp:local pi

spec : extend --spec-default with ngram-map-k4v

Assisted-by: llama.cpp:local pi

minor : fix n_embd log
args : update draft.n_max == 3 + regen docs
spec : relax ngram-mod rejection thold to 0.25 @ 5 low
logs : improve
docs : update speculative decoding CLI argument documentation

Add missing draft model CPU scheduling and tensor override parameters
Update --spec-type to include all available types (excluding draft-eagle3 WIP)
Fix default values to match implementation (n_max=3, n_min=0, p_min=0.0)
Remove deprecated options (spec-draft-ctx-size, spec-draft-replace)
Add environment variables for new parameters

Assisted-by: llama.cpp:local pi

arg : step-back on adding k4v to the default spec config
cont : fix name

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

19 May 00:29

github-actions

b9222

9a532ae

b9222

hexagon: add support for TRI op (#22822)

Hexagon: TRI HVX Kernel addition to ggml hexagon HTP ops and context
addressed PR review comments for TRI op
hexagon: clang format
hex-unary: remove merge conflict markers
hex-ggml: remove duplicate op cases (merge conflict)
hex-ggml: fix editor config errors

Co-authored-by: Todor Boinovski todorb@qti.qualcomm.com
Co-authored-by: Max Krasnyansky maxk@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

18 May 23:16

github-actions

b9221

b734044

b9221

ggml-hexagon: add PAD op HVX kernel (#23078)

ggml-hexagon: add PAD op HVX kernel

Implements GGML_OP_PAD on the Hexagon HTP backend using HVX vectorized
kernels. Supports zero-padding and circular padding across all 4 tensor
dimensions.

hex-ggml: remove duplicate op cases (merge conflict)
hex-pad: fix editorconfig checks and macro alignment

Co-authored-by: Max Krasnyansky maxk@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

18 May 19:25

github-actions

b9219

45b455e

b9219

common : remove hf cache migration (#23266)

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

18 May 18:23

github-actions

b9216

1ff0fc1

b9216

ui: Refactor models store, MCP service, and gate logs behind VITE_DEBUG (#23236)

refactor: Scope console logs to DEV + VITE_DEBUG env vars
refactor: skip MCP proxy probe when no server requires it
refactor: suppress expected disconnect errors during MCP client shutdown
refactor: Deduplicate requests
refactor: deduplicate model fetching across ROUTER and MODEL modes
refactor: Clean up models logic
chore: Add .env.example file
refactor: replace client-side CORS proxy probe with server status flag
refactor: Post-review fixes
test: add vitest client setup with API fetch mocks

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

Assets 30

Releases: ggml-org/llama.cpp

b9245

Uh oh!

b9244

Uh oh!

b9243

Uh oh!

b9240

Uh oh!

b9239

Uh oh!

b9235

Uh oh!

b9222

Uh oh!

b9221

Uh oh!

b9219

Uh oh!

b9216

Uh oh!