Upgrade llama.cpp from b9245 to b9264 (#174)

bernardladenthin · claude · web-flow · commit 674314c3226e · 2026-05-21T12:33:28.000+02:00
Co-authored-by: Claude &lt;noreply@anthropic.com&gt;
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9245**
+Current llama.cpp pinned version: **b9264**
 
 ## Upgrading CUDA Version
 
@@ -321,6 +321,30 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9222–b9245 | `ggml/src/ggml-webgpu/ggml-webgpu.cpp` + `wgsl-shaders/gated_delta_net.wgsl` | Gated-delta-net shader gains a `K` snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend |
 | ~b9222–b9245 | `convert_hf_to_gguf.py`, `convert_lora_to_gguf.py`, `examples/save-load-state/save-load-state.cpp`, `examples/llama-eval/*`, `tools/cli/README.md`, `tools/server/README.md`, `docs/speculative.md`, `docs/backend/SYCL.md` | Doc/example/tooling updates only. Not compiled by this project |
 | ~b9222–b9245 | `tools/ui/*` | WebUI source reorganisation (enum file renames `*.ts` → `*.enums.ts`, new chat components, Tailwind plugin imports). Project sets `LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE` in `CMakeLists.txt`, so the UI is never built — no impact |
+| ~b9245–b9264 | `src/llama-chat.{h,cpp}` | `LLM_CHAT_TEMPLATE_HUNYUAN_OCR` renamed to `LLM_CHAT_TEMPLATE_HUNYUAN_VL` (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required |
+| ~b9245–b9264 | `tools/mtmd/clip-impl.h` + `tools/mtmd/models/` | `PROJECTOR_TYPE_HUNYUANOCR` removed and merged into `PROJECTOR_TYPE_HUNYUANVL`; `hunyuanocr.cpp` renamed to `hunyuanvl.cpp`; clip graph class `clip_graph_hunyuanocr` renamed to `clip_graph_hunyuanvl`. Not referenced by project — no source changes required |
+| ~b9245–b9264 | `tools/mtmd/clip.h` | `clip_is_minicpmv()` and `clip_is_glm()` removed from public API. Not referenced by project — no source changes required |
+| ~b9245–b9264 | `tools/mtmd/clip.h` (`struct clip_context_params`) | New `bool no_alloc` field added (initialized via `mtmd_context_params_default()`). Additive default-zero — no project changes required |
+| ~b9245–b9264 | `tools/mtmd/mtmd.h` | New `mtmd_get_memory_usage()` C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project |
+| ~b9245–b9264 | `tools/mtmd/clip-model.h` | New `enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST }` replacing the `bool image_resize_pad` flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links `mtmd` as-is |
+| ~b9245–b9264 | `common/common.h` (`struct common_params_speculative_draft`) | New `bool backend_sampling = true` field — offloads draft sampling to the backend. Additive default-on; Java `ModelParameters` doesn't set it, so the upstream default applies. Backend sampler auto-disables when `split_mode == TENSOR` in `src/llama-context.cpp` — safe |
+| ~b9245–b9264 | `common/speculative.cpp` | `common_speculative_impl_draft_mtp` now registers a per-seq backend sampler chain (top-k 10) on `ctx_dft` via `llama_set_sampler`; cleaned up in destructor. Falls back to CPU sampler if `llama_set_sampler` fails. Internal to upstream-compiled speculative module, no project call sites |
+| ~b9245–b9264 | `app/` (new) | New optional unified `llama` binary (`llama-app` target) dispatching to `serve`/`cli`/`completion`/`bench`. Guarded by `LLAMA_BUILD_APP=OFF` default — project doesn't enable it |
+| ~b9245–b9264 | `tools/{cli,completion,llama-bench,server}/CMakeLists.txt` | Each tool split into a `*-impl` static library (the logic) plus a thin `main.cpp` wrapper; the `main()` in `cli.cpp`/`completion.cpp`/`llama-bench.cpp`/`server.cpp` is renamed to `llama_cli`/`llama_completion`/`llama_bench`/`llama_server` and now satisfies `-Wmissing-declarations` via a forward decl. Project does NOT compile any of these `.cpp` files — only `server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-models.cpp` (see `CMakeLists.txt:237`/`:302`) — so no impact |
+| ~b9245–b9264 | `tools/server/server-context.cpp` | Adds mmproj memory estimation: when `params_base.fit_params` is set, calls `mtmd_get_memory_usage(mmproj_path, mparams)` and adds the per-device cost into `params_base.fit_params_target` before `common_init_from_params`. Also calls `mtmd_helper_log_set(common_log_default_callback, nullptr)` once when `!is_resume`. Compiled upstream-as-is, no project call sites |
+| ~b9245–b9264 | `src/llama-context.cpp` | New `llama_context::set_sampler()` short-circuits with a one-shot `LLAMA_LOG_WARN` and returns `false` when `model.split_mode() == LLAMA_SPLIT_MODE_TENSOR` (backend sampling not supported with tensor split). Internal safety check, no project call sites |
+| ~b9245–b9264 | `common/arg.cpp` | New CLI flags `--spec-draft-backend-sampling` / `--no-spec-draft-backend-sampling` and env `LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING` to toggle the new `backend_sampling` field. Not exposed by `ModelParameters`; could be added later as a Java-side enhancement |
+| ~b9245–b9264 | `ggml/src/ggml-cuda/CMakeLists.txt` + `common.cuh` + `binbcast.cu`, `concat.cu`, `cpy.cu`, `fattn-*.cu`, `gated_delta_net.cu`, `getrows.cu`, `mean.cu`, `mmvf.cu`, `mmvq.cu`, `norm.cu`, `quantize.cu`, `reduce_rows.cuh`, `rope.cu`, `scale.cu`, `set-rows.cu`, `softcap.cu`, `ssm-conv.cu`, `ssm-scan.cu`, `sumrows.cu`, `topk-moe.cu`, `unary.cu` | New PDL (Programmatic Dependent Launch) infrastructure: `GGML_CUDA_USE_PDL` build flag (CUDART ≥ 11.8, non-HIP/MUSA); `ggml_cuda_pdl_sync()` / `ggml_cuda_pdl_lc()` device helpers (active on Hopper sm_90+); `ggml_cuda_kernel_launch_params` + `ggml_cuda_kernel_launch()` host template that calls `cudaLaunchKernelEx` with stream-serialization attribute when `GGML_CUDA_PDL` env var allows. Adds `90-virtual` (Hopper) to default `CMAKE_CUDA_ARCHITECTURES` when CUDA ≥ 11.8. Internal CUDA backend, no project changes required |
+| ~b9245–b9264 | `ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp` + `ggml-metal.metal` | New 4-element `kernel_pad_*_4` variant (currently disabled — `is_c4 = false`); `kernel_pad` rewritten with 1024-element-per-block tiling for larger tensors; `kernel_cpy_*` rewritten to use `tpitg` rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend |
+| ~b9245–b9264 | `ggml/src/ggml-hexagon/htp/` (`hmx-matmul-ops.c`, `hmx-ops.h`, `matmul-ops.c`, `main.c`) | HMX matmul refactor: K-loop tiled in 32-tile blocks with `Q6_activation_hf_mxmem_RR_deep`; the out-stationary fallback path for large M·K·N was deleted; function rename `hmx_mat_mul_permuted_w16a32` → `hmx_matmul_f16_f32`, `hmx_mat_mul_permuted_qk_0_d16a32` → `hmx_matmul_q_f32`, `hmx_mat_mul_permuted_w16a32_batched_params_t` → `hmx_matmul_f16_f32_batched_params_t`. HMX power-up code reorganized (`HAP_power_set_HMX_v2` now combines power-on + clock in one step for `__HVX_ARCH__ ≥ 75`). Internal Qualcomm DSP backend |
+| ~b9245–b9264 | `ggml/src/ggml-opencl/ggml-opencl.cpp` | Lazy kernel compilation: `argsort` and `flash_attn` programs are now built only when first needed (`load_cl_kernels_argsort` / `load_cl_kernels_flash_attn` called from `supports_op`); new device-supported probe in `ggml_opencl_is_device_supported` runs at registration time; renamed `ggml_cl2_init`/`ggml_cl2_free` → `ggml_cl_init`/`ggml_cl_free`; OpenCL contexts now live as long as the process. Internal OpenCL backend |
+| ~b9245–b9264 | `ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp` | Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes `BLOCK_SIZE` outputs per step. Internal Vulkan backend |
+| ~b9245–b9264 | `src/models/delta-net-base.cpp` | Renamed local variables (`state_in_3d`→`s_3d`, `state_3d`→`s_3d_pad`) when reshaping the recurrent state; behaviour unchanged |
+| ~b9245–b9264 | `tools/mtmd/mtmd-image.cpp` | `img_tool::resize()` takes a `pad_style` enum (was `bool add_padding`); new `PAD_NEAREST` rounding path for Pillow byte-parity; `mtmd_image_preprocessor_deepseekocr::preprocess` rewritten with `static constexpr` resolution table and `RESIZE_ALGO_BICUBIC_PILLOW` + `PAD_NEAREST`. Internal mtmd, project links as-is |
+| ~b9245–b9264 | `tools/mtmd/models/deepseekocr.cpp` | Extracted `build_sam(ggml_tensor *inp_raw)` member function from the monolithic build path; FA mask casting to F16 only when `flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED`. Internal |
+| ~b9245–b9264 | `conversion/hunyuan.py`, `gguf-py/gguf/constants.py`, `gguf-py/gguf/tensor_mapping.py` | HunyuanOCR / HunyuanVL unified in conversion: `VisionProjectorType.HUNYUANOCR` removed; `HunYuanVLForConditionalGeneration` registers a single `HunyuanVLVisionModel` + `HunyuanVLTextModel`; `vit.perceive.*` tensor mappings now only mention `HunyuanVL`. Python tooling, not compiled by project |
+| ~b9245–b9264 | `CMakeLists.txt` (upstream) | New `LLAMA_BUILD_APP` option (default OFF); deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_BUILD_UI`/`LLAMA_USE_PREBUILT_UI` preserved. Project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` still works unchanged |
+| ~b9245–b9264 | `.devops/*.Dockerfile`, `.github/workflows/build-and-test-snapdragon.yml`, `scripts/snapdragon/`, `docs/backend/snapdragon/`, `tools/cli/README.md`, `tools/server/README.md`, `tools/mtmd/tests/` | Docker images add `conversion/` dir; snapdragon toolchain bumped v0.3 → v0.6 with `+dotprod+i8mm`; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9245
+	GIT_TAG        b9264
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9245](https://img.shields.io/badge/llama.cpp-%23b9245-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9245)  
+[![llama.cpp b9264](https://img.shields.io/badge/llama.cpp-%23b9264-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9264)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)  
 

Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)`
`108`	`108`	`FetchContent_Declare(`
`109`	`109`	`llama.cpp`
`110`	`110`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`111`		`- GIT_TAG b9245`
	`111`	`+ GIT_TAG b9264`
`112`	`112`	`)`
`113`	`113`	`FetchContent_MakeAvailable(llama.cpp)`
`114`	`114`