Upgrade llama.cpp from b9305 to b9333 (#196)

bernardladenthin · claude · web-flow · commit 0e2436f22885 · 2026-05-26T10:25:33.000+02:00
No project source changes required. All upstream API changes are either additive (new GGUF buffer/callback init, message_spans, FWHT kernel) or internal to compiled-in upstream files (checkpoint logic refactor, ggml-backend-meta external view fix, OpenMP move to ggml-base). The renamed checkpoint_every_nt → checkpoint_min_step field is not exposed by the Java API. https://claude.ai/code/session_01KhxAkQXywgMS4NxDhYirnX Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9305**
+Current llama.cpp pinned version: **b9333**
 
 ## Upgrading CUDA Version
 
@@ -419,6 +419,21 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9297–b9305 | `ggml/src/ggml-backend-meta.cpp` | Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required |
 | ~b9297–b9305 | `ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c` | `hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
 | ~b9297–b9305 | `ggml/src/ggml-opencl/ggml-opencl.cpp` | `GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required |
+| ~b9305–b9333 | `common/common.h` + `common/arg.cpp` | `common_params::checkpoint_every_nt` renamed to `checkpoint_min_step`; default changed 8192 → 256; CLI flag `-cpent`/`--checkpoint-every-n-tokens` **REMOVED** (throws `std::invalid_argument` at parse time) and replaced by `-cms`/`--checkpoint-min-step`; env var `LLAMA_ARG_CHECKPOINT_EVERY_NT` → `LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT`. Java layer does not expose this flag, no project source changes required |
+| ~b9305–b9333 | `common/chat.h` + `common/chat.cpp` | New `common_chat_msg_span` and `common_chat_msg_delimiter` structs; new `common_chat_params::message_spans` field (default empty vector); new `common_chat_split_by_role()` function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected `user_start`/`assistant_start` markers; passed through `server-common.cpp` as `message_spans` JSON array in the task params; compiled from upstream, no Java changes required |
+| ~b9305–b9333 | `common/chat-diff-analyzer.cpp` + `common/chat-auto-parser.h` | New `autoparser::user_start` and `autoparser::assistant_start` fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required |
+| ~b9305–b9333 | `tools/server/server-task.h` + `tools/server/server-context.cpp` | New `task_params::n_before_user` field (default `-1`); server computes it from `message_spans` to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates `draft.cache_type_k/v`; compiled directly into jllama from upstream, no project source changes required |
+| ~b9305–b9333 | `ggml/include/gguf.h` + `ggml/src/gguf.cpp` | New `gguf_reader_callback_t` typedef; new `gguf_init_from_buffer(data, size, params)` and `gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params)` public APIs; internal `gguf_init_from_reader()` helper refactored to use a callback-based reader; additive, not used by project |
+| ~b9305–b9333 | `ggml/CMakeLists.txt` | GGML version bumped 0.12.0 → 0.13.0; no project changes required |
+| ~b9305–b9333 | `ggml/src/CMakeLists.txt` + `ggml/src/ggml-cpu/CMakeLists.txt` | OpenMP detection and `target_link_libraries` moved from `ggml-cpu` into `ggml-base`; exported `ggml-config.cmake.in` updated to add `GGML_BASE_INTERFACE_LINK_LIBRARIES` and guard OpenMP targets before appending; fixes static-lib consumers that link only `ggml-base`; no project source changes required |
+| ~b9305–b9333 | `ggml/src/ggml-alloc.c` | Off-by-one bug fix in `ggml_dyn_tallocr_remove_block`: loop ran one iteration past the last valid element; internal allocator fix, no project changes required |
+| ~b9305–b9333 | `ggml/src/ggml-backend-meta.cpp` | Rotating-pair compute containers: external views created between evals now use a `stc_compute[2]` double-buffer scheme so they don't slowly deplete `stc_static` memory; `split_state_cache` is now unbounded (comment documents it as FIXME); `ggml_backend_meta_alloc_ctx_tensors_from_buft` uses `ggml_get_mem_size(ctx)` for static container and `16×` that for each compute container; internal multi-GPU meta backend refactor, no project changes required |
+| ~b9305–b9333 | `ggml/src/ggml-cuda/fwht.cu` + `fwht.cuh` + `ggml-cuda.cu` | New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (`fwht_cuda<N>`) for N = 64/128/256/512; dispatched from `ggml_cuda_mul_mat` when `GGML_HINT_SRC0_IS_HADAMARD` op hint is set on a `ggml_mul_mat` node (hint index 1); internal CUDA backend, no project changes required |
+| ~b9305–b9333 | `ggml/src/ggml-metal/ggml-metal-device.{h,m}` | New `ggml_metal_device_id` enum covering M1–M5 variants; `device_id` field added to `ggml_metal_device_props`, populated by new `ggml_metal_device_id_parse()` from the MTL device name string; additive, no project changes required |
+| ~b9305–b9333 | `ggml/src/ggml-quants.c` | IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on `counter` under OpenMP; guards with `#ifdef GGML_USE_OPENMP`; internal quantization init, no project changes required |
+| ~b9305–b9333 | `src/llama-arch.cpp` | `LLM_TENSOR_FFN_LATENT_DOWN` and `LLM_TENSOR_FFN_LATENT_UP` probe op changed from `GGML_OP_MUL` to `GGML_OP_MUL_MAT`; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use `MUL_MAT` to keep them there); internal upstream fix, no project changes required |
+| ~b9305–b9333 | `vendor/cpp-httplib/httplib.{h,cpp}` | Bumped to v0.45.1: `close_socket`, `shutdown_socket`, `Server::stop` marked `noexcept`; macOS Keychain cert loading migrated from deprecated `SecTrustCopyAnchorCertificates` to `SecTrustSettingsCopyCertificates` (all three trust domains: system, admin, user); `CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN` now restricted to `TARGET_OS_OSX` only with compile-time `#error` on iOS/tvOS/watchOS; compiled automatically, no project changes required |
+| ~b9305–b9333 | `common/common.h` | New `string_lcs(std::string_view a, std::string_view b)` function (longest common substring via DP); additive, not used by project directly |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9305
+	GIT_TAG        b9333
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9305](https://img.shields.io/badge/llama.cpp-%23b9305-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9305)  
+[![llama.cpp b9333](https://img.shields.io/badge/llama.cpp-%23b9333-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9333)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)  
 

Original file line number	Diff line number	Diff line change
`@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)`
`114`	`114`	`FetchContent_Declare(`
`115`	`115`	`llama.cpp`
`116`	`116`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`117`		`- GIT_TAG b9305`
	`117`	`+ GIT_TAG b9333`
`118`	`118`	`)`
`119`	`119`	`FetchContent_MakeAvailable(llama.cpp)`
`120`	`120`