Upgrade llama.cpp from b9222 to b9245

claude · claude · commit 90910613dd82 · 2026-05-20T08:54:08.000Z
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9222**
+Current llama.cpp pinned version: **b9245**
 
 ## Upgrading CUDA Version
 
@@ -303,6 +303,24 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9198–b9219 | `ggml/src/ggml-sycl/ggml-sycl.cpp` + `vecdotq.hpp` | SYCL GEMM now falls back to direct MKL for small problems (gemm_flops < 256³); Q6_K dot product refactored to a single scalar fast-path helper `vec_dot_q6_K_q8_1_impl_mmvq_scalar`; internal SYCL backend, no project changes |
 | ~b9219–b9222 | `ggml/src/ggml-hexagon/` + `htp/pad-ops.c` (new) + `htp/unary-ops.c` | Hexagon HTP backend gains `GGML_OP_PAD` (HVX + optional VTCM/DMA double-buffered, both zero-pad and circular-pad variants) and `GGML_OP_TRI` (HVX-vectorised triangular masking) support; new `HTP_OP_PAD` / `HTP_OP_TRI` opcodes; internal Qualcomm DSP backend, no project changes |
 | ~b9219–b9222 | `.devops/*.Dockerfile` + `.github/workflows/docker.yml` | OCI image labels (`org.opencontainers.image.*`) added via `BUILD_DATE`/`APP_VERSION`/`APP_REVISION` build args; new `skip_s390x` workflow_dispatch input; manifest annotations on `docker buildx imagetools create`; upstream packaging/CI only, no project changes |
+| ~b9222–b9245 | `common/common.h` + `common.cpp` | `common_init_result(common_params &, bool model_only = false)` and `common_init_from_params(common_params &, bool model_only = false)` gain an optional `model_only` flag that skips context/sampler/lora/warmup setup and returns only the loaded model. Additive with default value; no project call sites in `src/main/cpp/`, no source changes required |
+| ~b9222–b9245 | `common/common.h` | `common_params_speculative_draft` defaults retuned: `n_max` 16→3, `p_min` 0.75f→0.0f. Defaults only; Java `ModelParameters` sets these explicitly via JSON, so behaviour is unchanged for this project |
+| ~b9222–b9245 | `common/speculative.{h,cpp}` | `common_speculative_impl::accept()` virtual gains a 3rd `bool is_other` parameter; `common_speculative_accept()` now broadcasts the accepted-token count to every registered impl (with `is_other=true` for impls that did not generate the draft). `common_speculative_impl_ngram_map_k` ctor signature simplified (no longer takes `common_params_speculative`). Lots of new `LOG_INF` startup banners per impl. Internal to upstream-compiled `server-context.cpp`; no project call sites |
+| ~b9222–b9245 | `common/arg.cpp` + `common/common.cpp` + `tools/fit-params/fit-params.cpp` | `--verbosity` levels relabeled: level `4` now means "trace (more info)" and level `5` means "debug"; `LOG_LEVEL_DEBUG` constant value moved from `4` to `5`. Direct `params.verbosity >= 4` comparisons in upstream `common.cpp` and `fit-params.cpp` replaced with `>= LOG_LEVEL_DEBUG`. Project does not reference `LOG_LEVEL_DEBUG` or numeric verbosity thresholds in `src/main/cpp/`; no source changes required |
+| ~b9222–b9245 | `common/arg.cpp` | `--spec-type` duplicate-arg DEPRECATED warning suppressed (the flag legitimately accepts repeated values to form the comma-list). Behaviour-only |
+| ~b9222–b9245 | `common/ngram-map.cpp` | One per-draft `LOG_INF` downgraded to `LOG_DBG`. Log-level only |
+| ~b9222–b9245 | `src/llama-graph.h` | `llm_graph_params::operator==` adds a third disjunct so ubatches with both `token` and `embd` arrays present compare equal (graph reuse fix for MTP pre-norm path). Internal |
+| ~b9222–b9245 | `src/llama-memory-recurrent.{h,cpp}` + `src/llama-memory-hybrid.cpp` + `src/llama-memory-hybrid-iswa.cpp` | `init_batch()` now forces sequential split (`split_seq`) instead of equal split when `n_rs_seq > 0` (recurrent-state rollback is incompatible with equal splits). Internal upstream model code, no project impact |
+| ~b9222–b9245 | `src/models/delta-net-base.cpp` + `src/models/models.h` + `src/models/qwen35.cpp` | `llm_build_delta_net_base::keep_rs()` helper removed; conv-state and recurrent-attn paths reworked to read `cparams.n_rs_seq` directly and loop `K = n_rs_seq + 1` snapshot slots. Comment fix in `qwen35.cpp` MTP layer index. All internal upstream model code |
+| ~b9222–b9245 | `tools/server/server-context.cpp` | `pos_min_thold` lowered by one (`pos_next - n_swa` → `pos_next - n_swa - 1`); checkpoint trigger guard relaxed from `n_past < slot.prompt.n_tokens()` to `<=`; per-slot `print_timings_pp`/`print_timings_tg` lines split into separate `SLT_INF` calls; new `graphs reused` and `draft acceptance` lines; `n_draft_total` log moved from `SLT_CNT` to `SLT_INF`. Compiled upstream-as-is, no project changes |
+| ~b9222–b9245 | `ggml/src/ggml-cuda/mmvq.cu` | `calc_nwarps` table tweak: Q6_K returns 2 warps (was grouped with the 8-warp tier). Internal CUDA backend |
+| ~b9222–b9245 | `ggml/src/ggml-hexagon/` (`htp/rope-ops.c`, `htp/unary-ops.c`, `htp-ops.h`, `main.c`, `ggml-hexagon.cpp`) | New `HTP_OP_NORM` opcode (mean+variance norm); `rope-ops.c` adds MROPE / IMROPE position-id support via new `mrope_cache_init()`. Internal Qualcomm DSP backend |
+| ~b9222–b9245 | `ggml/src/ggml-opencl/` (`ggml-opencl.cpp`, `kernels/cvt.cl`, six new `gemm_moe_q{4,5,6}_k_f32_ns` + `gemv_moe_q{4,5,6}_k_f32_ns` kernels) | Adreno MoE pipeline extended to Q4_K / Q5_K / Q6_K (image1d_buffer_t transposed layout, dedicated convert/restore kernels, GEMM + GEMV paths). Internal OpenCL backend |
+| ~b9222–b9245 | `ggml/src/ggml-rpc/ggml-rpc.cpp` | `last_graph_uid` field moved from `ggml_backend_rpc_context` (per-backend) into `ggml_backend_rpc_device_context` (per-device) so multiple backends sharing a device reuse cached graphs. Internal RPC backend |
+| ~b9222–b9245 | `ggml/src/ggml-sycl/ggml-sycl.cpp` | New `GGML_SYCL_USE_ASYNC_MEM_OP` env (default `1`) decouples async USM alloc/free from the graph path. Internal SYCL backend |
+| ~b9222–b9245 | `ggml/src/ggml-webgpu/ggml-webgpu.cpp` + `wgsl-shaders/gated_delta_net.wgsl` | Gated-delta-net shader gains a `K` snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend |
+| ~b9222–b9245 | `convert_hf_to_gguf.py`, `convert_lora_to_gguf.py`, `examples/save-load-state/save-load-state.cpp`, `examples/llama-eval/*`, `tools/cli/README.md`, `tools/server/README.md`, `docs/speculative.md`, `docs/backend/SYCL.md` | Doc/example/tooling updates only. Not compiled by this project |
+| ~b9222–b9245 | `tools/ui/*` | WebUI source reorganisation (enum file renames `*.ts` → `*.enums.ts`, new chat components, Tailwind plugin imports). Project sets `LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE` in `CMakeLists.txt`, so the UI is never built — no impact |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9222
+	GIT_TAG        b9245
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9222](https://img.shields.io/badge/llama.cpp-%23b9222-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9222)  
+[![llama.cpp b9245](https://img.shields.io/badge/llama.cpp-%23b9245-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9245)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)  
 

Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)`
`108`	`108`	`FetchContent_Declare(`
`109`	`109`	`llama.cpp`
`110`	`110`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`111`		`- GIT_TAG b9222`
	`111`	`+ GIT_TAG b9245`
`112`	`112`	`)`
`113`	`113`	`FetchContent_MakeAvailable(llama.cpp)`
`114`	`114`