You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9284**
9
+
Current llama.cpp pinned version: **b9297**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -399,6 +399,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
399
399
|~b9279–b9284 |`tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt`| Each `*-impl` target switched from `add_library(... STATIC ...)` to default library type (becomes SHARED when `BUILD_SHARED_LIBS=ON`); added `WINDOWS_EXPORT_ALL_SYMBOLS ON` and conditional `install(TARGETS ... LIBRARY)` under `LLAMA_TOOLS_INSTALL`. Project doesn't enable `LLAMA_BUILD_TOOLS`, so none of these targets are configured — no impact |
400
400
|~b9279–b9284 |`src/llama-vocab.cpp` + `conversion/base.py`| HybridDNA tokenizer fix: k-mers are now stored in `token_to_id` with a reserved `\xee\x80\x80` (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. `CCCCCC`); the suffix is stripped from `id_to_token` text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required |
401
401
|~b9279–b9284 |`ggml/src/ggml-cuda/common.cuh`| PDL-launch gating now uses `ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER` instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required |
402
+
|~b9284–b9297 | upstream `CMakeLists.txt`|`LLAMA_BUILD_APP` default reverted from `ON` back to `${LLAMA_STANDALONE}` (i.e. OFF for FetchContent consumers). Project's `set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)` shim is now redundant but harmless; kept as defensive pin against future flips |
403
+
|~b9284–b9297 |`common/chat.h` + `tools/server/server-task.cpp`| New additive `common_chat_parser_params::is_continuation` field (default `false`); `params_from_json_cmpl` now parses the `continue_final_message` request field via `common_chat_continuation_parse()` and sets `is_continuation` when the result is non-`NONE`. `task_result_state` ctor guard tightened: the empty-prefill `chat_msg = common_chat_parse("", true, ...)` initialization is now gated on `is_continuation && !echo` (was just `!echo`) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java `InferenceParameters.setContinueFinalMessage(boolean\|ContinuationMode)` already writes `continue_final_message` to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it |
404
+
|~b9284–b9297 |`src/llama-model.{h,cpp}` + `src/models/qwen35.cpp` + `src/models/qwen35moe.cpp`| NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: `llama_layer_nextn` gains four scale fields (`eh_proj_s`, `eh_proj_in_s`, `shared_head_head_s`, `shared_head_head_in_s`); `load_tensors()` loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into `build_lora_mm()`. Internal model-loading + graph-building changes, no project changes required |
405
+
|~b9284–b9297 |`ggml/src/ggml-backend.cpp`| Bug fix in `ggml_backend_tensor_get_2d_async`: fast-path condition checked `iface.set_tensor_2d_async == NULL` (typo) instead of `iface.get_tensor_2d_async == NULL`; multi-copy gets now correctly fall back when the backend lacks `get_tensor_2d_async`. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required |
406
+
|~b9284–b9297 |`ggml/src/ggml-opencl/` (`ggml-opencl.cpp` + 17 kernel files) | Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the `ne01` bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when `ne01 % 64 != 0`. Fixed by: (1) rounding `global_size[0]` up to the next multiple of 64 in `ggml_cl_mul_mat_id`, (2) moving the per-thread `ne01` early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from `ne01 % 64 == 0` to `ne01 % 32 == 0` in `use_adreno_moe_kernels`. Internal OpenCL backend, affects the `opencl-android-aarch64` classifier build only — no project source changes |
407
+
|~b9284–b9297 |`ggml/src/ggml-sycl/` (`ggml-sycl.cpp`, `dmmv.cpp`, `gated_delta_net.cpp`, `common.hpp`) | (1) BF16 added to `ggml_sycl_supports_dmmv()` and `can_use_dequantize_mul_mat_vec()`; new `convert_mul_mat_vec_bf16_sycl` path. (2) Level Zero auto-detect moved into `ggml_sycl_init()` — `info.ext_oneapi_level_zero` flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for `GGML_SYCL_ENABLE_LEVEL_ZERO` env. (3) `mmid_counting_sort_rows()` replaces the per-expert atomic scan in `ggml_sycl_mul_mat_id` — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with `keep_rs_t` template parameter and per-token snapshot writes when `K > 1`, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required |
408
+
|~b9284–b9297 |`ggml/src/ggml-vulkan/CMakeLists.txt`|`find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
409
+
|~b9284–b9297 |`ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`) | ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
410
+
|~b9284–b9297 |`tools/perplexity/perplexity.cpp`|`log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |
0 commit comments