Upgrade llama.cpp from b9297 to b9305

claude · claude · commit 278ba4f8c810 · 2026-05-25T09:23:52.000Z
- Bump GIT_TAG, README badge, CLAUDE.md.
- Rename project-side cache pin from LLAMA_BUILD_WEBUI to LLAMA_BUILD_UI:
  the top-level CMakeLists no longer forwards the deprecated name (the
  shim survived only in tools/ui/CMakeLists.txt, which is not configured
  in FetchContent mode).
- No C++ source changes required.  server-context.cpp now includes
  common/fit.h to estimate draft/MTP VRAM when fit_params is on; the
  include resolves via the existing llama-common include path and the
  feature is purely additive.

Local: 435/435 ctest pass.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9297**
+Current llama.cpp pinned version: **b9305**
 
 ## Upgrading CUDA Version
 
@@ -408,6 +408,17 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9284–b9297 | `ggml/src/ggml-vulkan/CMakeLists.txt` | `find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
 | ~b9284–b9297 | `ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`) | ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
 | ~b9284–b9297 | `tools/perplexity/perplexity.cpp` | `log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |
+| ~b9297–b9305 | upstream `CMakeLists.txt` | Top-level backward-compat shims that forwarded `LLAMA_BUILD_WEBUI` → `LLAMA_BUILD_UI` and `LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_USE_PREBUILT_UI` were REMOVED (they now live only in `tools/ui/CMakeLists.txt`). **Java impact**: project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` no longer hits the shim at top level. `tools/ui` is not configured in FetchContent mode (`LLAMA_BUILD_TOOLS=OFF`), so the old setting was inert in practice, but the project's `CMakeLists.txt:107` was renamed to `set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE)` for clarity and to defend against future flips of `LLAMA_BUILD_UI` default |
+| ~b9297–b9305 | `common/common.h` | `LLAMA_UI_DEFAULT_ENABLED` macro removed; `common_params::ui` default is now unconditionally `true`. Not referenced by project, no changes required |
+| ~b9297–b9305 | `common/fit.{h,cpp}` | `common_get_device_memory_data()` made non-static and exported from `fit.h` (was a file-local helper). `fit.h` now also pulls in `ggml-backend.h`, `llama.h`, and `../src/llama-ext.h`. Used by upstream `tools/server/server-context.cpp` (compiled directly into jllama). The `#include "../src/llama-ext.h"` resolves relative to fit.h's location (`common/../src/llama-ext.h`), so no extra include paths are required. No project source changes |
+| ~b9297–b9305 | `tools/server/server-context.cpp` | New `#include "fit.h"` and a new draft/MTP memory measurement block: when `params_base.fit_params` is set AND the speculative config includes a draft model or `COMMON_SPECULATIVE_TYPE_DRAFT_MTP`, `common_get_device_memory_data()` is called against the draft model (or a copy of the target params with `LLAMA_CONTEXT_TYPE_MTP` for MTP) and the resulting per-device `model + context + compute` bytes are added to `params_base.fit_params_target` before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. `ModelParameters.setFit(boolean)` defaults to `on`, so this kicks in automatically when a user configures a draft model — no Java-side wiring required |
+| ~b9297–b9305 | `tools/server/server-context.cpp` | `[mtmd] estimated memory usage of mmproj` log line reworded to `estimated worst-case memory usage`; log only, no behavioural change |
+| ~b9297–b9305 | `tools/server/server-http.cpp` | UI serving path migrated from per-asset extern arrays (`index_html`, `bundle_js`, …) and the `LLAMA_BUILD_UI` macro to a runtime `llama_ui_find_asset()` lookup gated on the new `LLAMA_UI_HAS_ASSETS` macro generated by the new `llama-ui-embed` host tool. Project does NOT compile `server-http.cpp` (only `server-context.cpp`/`server-queue.cpp`/`server-task.cpp`/`server-models.cpp`), no impact |
+| ~b9297–b9305 | `tools/ui/` (`CMakeLists.txt`, new `embed.cpp`, new `sources.cmake`, new `scripts/ui-assets.cmake`, removed `scripts/ui-download.cmake` + `scripts/xxd.cmake`, removed `ui.cpp`+`ui.h`) | Full UI build pipeline rewrite: `xxd.cmake`+`ui-download.cmake` replaced by a host-compiled `llama-ui-embed` C++ tool that generates `ui.cpp`/`ui.h` (declaring a `g_assets[]` table and `llama_ui_find_asset()` lookup, plus `LLAMA_UI_HAS_ASSETS` macro) from arbitrary asset files; new `scripts/ui-assets.cmake` orchestrates asset provisioning with a clearer priority (pre-built `tools/ui/dist` → npm build → HF Bucket); `tools/ui` is now an `add_custom_target` always re-run per build. The deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI`/`LLAMA_WEBUI_HF_BUCKET` moved here from the top-level `CMakeLists.txt`. Project does not build the UI (`LLAMA_BUILD_TOOLS=OFF` in FetchContent mode), no impact |
+| ~b9297–b9305 | `ggml/include/ggml-alloc.h` | Comment-only API documentation update for `ggml_backend_alloc_ctx_tensors_from_buft`. No project changes required |
+| ~b9297–b9305 | `ggml/src/ggml-backend-meta.cpp` | Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required |
+| ~b9297–b9305 | `ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c` | `hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
+| ~b9297–b9305 | `ggml/src/ggml-opencl/ggml-opencl.cpp` | `GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -104,13 +104,17 @@ endif()
 set(GGML_FMA     ON  CACHE BOOL "" FORCE)
 set(GGML_F16C    ON  CACHE BOOL "" FORCE)
 set(GGML_AVX512  OFF CACHE BOOL "" FORCE)
-set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)
+# b9305 removed the top-level LLAMA_BUILD_WEBUI -> LLAMA_BUILD_UI shim; set the
+# new name directly. (The old name no longer forwards at top level; the shim
+# survives in tools/ui/CMakeLists.txt but that subdir is not configured in
+# FetchContent mode, so the old setting would be inert anyway.)
+set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE)
 # b9284 flipped LLAMA_BUILD_APP default to ON; we don't build the unified binary
 set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9297
+	GIT_TAG        b9305
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9297](https://img.shields.io/badge/llama.cpp-%23b9297-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9297)  
+[![llama.cpp b9305](https://img.shields.io/badge/llama.cpp-%23b9305-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9305)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)