You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
No project source changes required. All upstream API changes are either
additive (new GGUF buffer/callback init, message_spans, FWHT kernel) or
internal to compiled-in upstream files (checkpoint logic refactor,
ggml-backend-meta external view fix, OpenMP move to ggml-base).
The renamed checkpoint_every_nt → checkpoint_min_step field is not
exposed by the Java API.
https://claude.ai/code/session_01KhxAkQXywgMS4NxDhYirnX
Co-authored-by: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CLAUDE.md
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9305**
9
+
Current llama.cpp pinned version: **b9333**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -419,6 +419,21 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
419
419
|~b9297–b9305 |`ggml/src/ggml-backend-meta.cpp`| Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required |
420
420
|~b9297–b9305 |`ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c`|`hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
421
421
|~b9297–b9305 |`ggml/src/ggml-opencl/ggml-opencl.cpp`|`GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required |
422
+
|~b9305–b9333 |`common/common.h` + `common/arg.cpp`|`common_params::checkpoint_every_nt` renamed to `checkpoint_min_step`; default changed 8192 → 256; CLI flag `-cpent`/`--checkpoint-every-n-tokens`**REMOVED** (throws `std::invalid_argument` at parse time) and replaced by `-cms`/`--checkpoint-min-step`; env var `LLAMA_ARG_CHECKPOINT_EVERY_NT` → `LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT`. Java layer does not expose this flag, no project source changes required |
423
+
|~b9305–b9333 |`common/chat.h` + `common/chat.cpp`| New `common_chat_msg_span` and `common_chat_msg_delimiter` structs; new `common_chat_params::message_spans` field (default empty vector); new `common_chat_split_by_role()` function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected `user_start`/`assistant_start` markers; passed through `server-common.cpp` as `message_spans` JSON array in the task params; compiled from upstream, no Java changes required |
424
+
|~b9305–b9333 |`common/chat-diff-analyzer.cpp` + `common/chat-auto-parser.h`| New `autoparser::user_start` and `autoparser::assistant_start` fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required |
425
+
|~b9305–b9333 |`tools/server/server-task.h` + `tools/server/server-context.cpp`| New `task_params::n_before_user` field (default `-1`); server computes it from `message_spans` to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates `draft.cache_type_k/v`; compiled directly into jllama from upstream, no project source changes required |
426
+
|~b9305–b9333 |`ggml/include/gguf.h` + `ggml/src/gguf.cpp`| New `gguf_reader_callback_t` typedef; new `gguf_init_from_buffer(data, size, params)` and `gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params)` public APIs; internal `gguf_init_from_reader()` helper refactored to use a callback-based reader; additive, not used by project |
427
+
|~b9305–b9333 |`ggml/CMakeLists.txt`| GGML version bumped 0.12.0 → 0.13.0; no project changes required |
428
+
|~b9305–b9333 |`ggml/src/CMakeLists.txt` + `ggml/src/ggml-cpu/CMakeLists.txt`| OpenMP detection and `target_link_libraries` moved from `ggml-cpu` into `ggml-base`; exported `ggml-config.cmake.in` updated to add `GGML_BASE_INTERFACE_LINK_LIBRARIES` and guard OpenMP targets before appending; fixes static-lib consumers that link only `ggml-base`; no project source changes required |
429
+
|~b9305–b9333 |`ggml/src/ggml-alloc.c`| Off-by-one bug fix in `ggml_dyn_tallocr_remove_block`: loop ran one iteration past the last valid element; internal allocator fix, no project changes required |
430
+
|~b9305–b9333 |`ggml/src/ggml-backend-meta.cpp`| Rotating-pair compute containers: external views created between evals now use a `stc_compute[2]` double-buffer scheme so they don't slowly deplete `stc_static` memory; `split_state_cache` is now unbounded (comment documents it as FIXME); `ggml_backend_meta_alloc_ctx_tensors_from_buft` uses `ggml_get_mem_size(ctx)` for static container and `16×` that for each compute container; internal multi-GPU meta backend refactor, no project changes required |
431
+
|~b9305–b9333 |`ggml/src/ggml-cuda/fwht.cu` + `fwht.cuh` + `ggml-cuda.cu`| New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (`fwht_cuda<N>`) for N = 64/128/256/512; dispatched from `ggml_cuda_mul_mat` when `GGML_HINT_SRC0_IS_HADAMARD` op hint is set on a `ggml_mul_mat` node (hint index 1); internal CUDA backend, no project changes required |
432
+
|~b9305–b9333 |`ggml/src/ggml-metal/ggml-metal-device.{h,m}`| New `ggml_metal_device_id` enum covering M1–M5 variants; `device_id` field added to `ggml_metal_device_props`, populated by new `ggml_metal_device_id_parse()` from the MTL device name string; additive, no project changes required |
433
+
|~b9305–b9333 |`ggml/src/ggml-quants.c`| IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on `counter` under OpenMP; guards with `#ifdef GGML_USE_OPENMP`; internal quantization init, no project changes required |
434
+
|~b9305–b9333 |`src/llama-arch.cpp`|`LLM_TENSOR_FFN_LATENT_DOWN` and `LLM_TENSOR_FFN_LATENT_UP` probe op changed from `GGML_OP_MUL` to `GGML_OP_MUL_MAT`; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use `MUL_MAT` to keep them there); internal upstream fix, no project changes required |
435
+
|~b9305–b9333 |`vendor/cpp-httplib/httplib.{h,cpp}`| Bumped to v0.45.1: `close_socket`, `shutdown_socket`, `Server::stop` marked `noexcept`; macOS Keychain cert loading migrated from deprecated `SecTrustCopyAnchorCertificates` to `SecTrustSettingsCopyCertificates` (all three trust domains: system, admin, user); `CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN` now restricted to `TARGET_OS_OSX` only with compile-time `#error` on iOS/tvOS/watchOS; compiled automatically, no project changes required |
436
+
|~b9305–b9333 |`common/common.h`| New `string_lcs(std::string_view a, std::string_view b)` function (longest common substring via DP); additive, not used by project directly |
0 commit comments