Skip to content

Commit 0e2436f

Browse files
Upgrade llama.cpp from b9305 to b9333 (#196)
No project source changes required. All upstream API changes are either additive (new GGUF buffer/callback init, message_spans, FWHT kernel) or internal to compiled-in upstream files (checkpoint logic refactor, ggml-backend-meta external view fix, OpenMP move to ggml-base). The renamed checkpoint_every_nt → checkpoint_min_step field is not exposed by the Java API. https://claude.ai/code/session_01KhxAkQXywgMS4NxDhYirnX Co-authored-by: Claude <noreply@anthropic.com>
1 parent d78ff92 commit 0e2436f

3 files changed

Lines changed: 18 additions & 3 deletions

File tree

CLAUDE.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9305**
9+
Current llama.cpp pinned version: **b9333**
1010

1111
## Upgrading CUDA Version
1212

@@ -419,6 +419,21 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
419419
| ~b9297–b9305 | `ggml/src/ggml-backend-meta.cpp` | Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required |
420420
| ~b9297–b9305 | `ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c` | `hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
421421
| ~b9297–b9305 | `ggml/src/ggml-opencl/ggml-opencl.cpp` | `GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required |
422+
| ~b9305–b9333 | `common/common.h` + `common/arg.cpp` | `common_params::checkpoint_every_nt` renamed to `checkpoint_min_step`; default changed 8192 → 256; CLI flag `-cpent`/`--checkpoint-every-n-tokens` **REMOVED** (throws `std::invalid_argument` at parse time) and replaced by `-cms`/`--checkpoint-min-step`; env var `LLAMA_ARG_CHECKPOINT_EVERY_NT``LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT`. Java layer does not expose this flag, no project source changes required |
423+
| ~b9305–b9333 | `common/chat.h` + `common/chat.cpp` | New `common_chat_msg_span` and `common_chat_msg_delimiter` structs; new `common_chat_params::message_spans` field (default empty vector); new `common_chat_split_by_role()` function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected `user_start`/`assistant_start` markers; passed through `server-common.cpp` as `message_spans` JSON array in the task params; compiled from upstream, no Java changes required |
424+
| ~b9305–b9333 | `common/chat-diff-analyzer.cpp` + `common/chat-auto-parser.h` | New `autoparser::user_start` and `autoparser::assistant_start` fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required |
425+
| ~b9305–b9333 | `tools/server/server-task.h` + `tools/server/server-context.cpp` | New `task_params::n_before_user` field (default `-1`); server computes it from `message_spans` to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates `draft.cache_type_k/v`; compiled directly into jllama from upstream, no project source changes required |
426+
| ~b9305–b9333 | `ggml/include/gguf.h` + `ggml/src/gguf.cpp` | New `gguf_reader_callback_t` typedef; new `gguf_init_from_buffer(data, size, params)` and `gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params)` public APIs; internal `gguf_init_from_reader()` helper refactored to use a callback-based reader; additive, not used by project |
427+
| ~b9305–b9333 | `ggml/CMakeLists.txt` | GGML version bumped 0.12.0 → 0.13.0; no project changes required |
428+
| ~b9305–b9333 | `ggml/src/CMakeLists.txt` + `ggml/src/ggml-cpu/CMakeLists.txt` | OpenMP detection and `target_link_libraries` moved from `ggml-cpu` into `ggml-base`; exported `ggml-config.cmake.in` updated to add `GGML_BASE_INTERFACE_LINK_LIBRARIES` and guard OpenMP targets before appending; fixes static-lib consumers that link only `ggml-base`; no project source changes required |
429+
| ~b9305–b9333 | `ggml/src/ggml-alloc.c` | Off-by-one bug fix in `ggml_dyn_tallocr_remove_block`: loop ran one iteration past the last valid element; internal allocator fix, no project changes required |
430+
| ~b9305–b9333 | `ggml/src/ggml-backend-meta.cpp` | Rotating-pair compute containers: external views created between evals now use a `stc_compute[2]` double-buffer scheme so they don't slowly deplete `stc_static` memory; `split_state_cache` is now unbounded (comment documents it as FIXME); `ggml_backend_meta_alloc_ctx_tensors_from_buft` uses `ggml_get_mem_size(ctx)` for static container and `16×` that for each compute container; internal multi-GPU meta backend refactor, no project changes required |
431+
| ~b9305–b9333 | `ggml/src/ggml-cuda/fwht.cu` + `fwht.cuh` + `ggml-cuda.cu` | New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (`fwht_cuda<N>`) for N = 64/128/256/512; dispatched from `ggml_cuda_mul_mat` when `GGML_HINT_SRC0_IS_HADAMARD` op hint is set on a `ggml_mul_mat` node (hint index 1); internal CUDA backend, no project changes required |
432+
| ~b9305–b9333 | `ggml/src/ggml-metal/ggml-metal-device.{h,m}` | New `ggml_metal_device_id` enum covering M1–M5 variants; `device_id` field added to `ggml_metal_device_props`, populated by new `ggml_metal_device_id_parse()` from the MTL device name string; additive, no project changes required |
433+
| ~b9305–b9333 | `ggml/src/ggml-quants.c` | IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on `counter` under OpenMP; guards with `#ifdef GGML_USE_OPENMP`; internal quantization init, no project changes required |
434+
| ~b9305–b9333 | `src/llama-arch.cpp` | `LLM_TENSOR_FFN_LATENT_DOWN` and `LLM_TENSOR_FFN_LATENT_UP` probe op changed from `GGML_OP_MUL` to `GGML_OP_MUL_MAT`; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use `MUL_MAT` to keep them there); internal upstream fix, no project changes required |
435+
| ~b9305–b9333 | `vendor/cpp-httplib/httplib.{h,cpp}` | Bumped to v0.45.1: `close_socket`, `shutdown_socket`, `Server::stop` marked `noexcept`; macOS Keychain cert loading migrated from deprecated `SecTrustCopyAnchorCertificates` to `SecTrustSettingsCopyCertificates` (all three trust domains: system, admin, user); `CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN` now restricted to `TARGET_OS_OSX` only with compile-time `#error` on iOS/tvOS/watchOS; compiled automatically, no project changes required |
436+
| ~b9305–b9333 | `common/common.h` | New `string_lcs(std::string_view a, std::string_view b)` function (longest common substring via DP); additive, not used by project directly |
422437

423438
## Build Commands
424439

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
114114
FetchContent_Declare(
115115
llama.cpp
116116
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
117-
GIT_TAG b9305
117+
GIT_TAG b9333
118118
)
119119
FetchContent_MakeAvailable(llama.cpp)
120120

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
**Build:**
22
![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)
33
![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)
4-
[![llama.cpp b9305](https://img.shields.io/badge/llama.cpp-%23b9305-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9305)
4+
[![llama.cpp b9333](https://img.shields.io/badge/llama.cpp-%23b9333-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9333)
55
[![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)
66
[![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)
77

0 commit comments

Comments
 (0)