Merge pull request #211 from bernardladenthin/claude/blissful-meitner-f8Hr6

bernardladenthin · web-flow · commit 09c9ee9e5c82 · 2026-06-06T23:23:54.000+02:00
Upgrade llama.cpp from b9495 to b9543
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9495**
+Current llama.cpp pinned version: **b9543**
 
 ## Upgrading CUDA Version
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9495
+	GIT_TAG        b9543
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9495](https://img.shields.io/badge/llama.cpp-%23b9495-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9495)  
+[![llama.cpp b9543](https://img.shields.io/badge/llama.cpp-%23b9543-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9543)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -303,3 +303,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
 | ~b9490–b9495 | `tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
 | ~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |
+| ~b9495–b9543 | `src/llama-hparams.{h,cpp}` + every `src/models/*.cpp` (~150 files) | Field `hparams::n_layer` (uint32_t) was split: the raw count moved to `hparams::n_layer_all` and `hparams::n_layer()` is now a member **function** that returns `n_layer_all - n_layer_nextn` (the effective non-MTP layer count). Sibling rename: `hparams::nextn_predict_layers` &#x2192; `hparams::n_layer_nextn`. Every per-model TU in `src/models/*.cpp` was updated to call `hparams.n_layer()` and `hparams.n_layer_nextn`. New `hparams::set_recr_pattern()` mirror of `set_swa_pattern()` for hybrid recurrent architectures. New per-layer `hparams::deepstack_mapping_arr` (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key `LLM_KV_DEEPSTACK_MAPPING` for Granite4-Vision-style per-layer deepstack injection. `hparams::kv_only_nextn` was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly &mdash; verified via `grep -rn "hparams\.n_layer\|nextn_predict_layers\|n_layer_nextn\|n_layer_all\|deepstack_mapping" src/main/cpp/ src/test/cpp/` returns zero matches. All consumers are inside upstream-compiled TUs (`llama-model.cpp`, `llama-context.cpp`, model TUs); no project source changes required |
+| ~b9495–b9543 | `include/llama.h` (state-seq flags) + `tools/server/server-context.cpp` + `examples/speculative-simple/speculative-simple.cpp` | The `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` flag was removed from the `llama_state_seq_flags` enum. All upstream call sites that passed `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY \| LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` were updated to pass only `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY` &mdash; the on-device path is now the default for partial saves/loads. Project does not call `llama_state_seq_get_*` / `llama_state_seq_set_*` directly from `jllama.cpp`; the only consumer in the JNI build is upstream `server-context.cpp` (speculative checkpoint helpers), which was updated upstream. Verified via `grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/` returns zero matches. No project source changes required |
+| ~b9495–b9543 | new `common/imatrix-loader.{h,cpp}` + refactor of `tools/imatrix/imatrix.cpp` + `tools/quantize/quantize.cpp` | Extracted shared imatrix-loading logic into a standalone library: new `common_imatrix` struct (`entries`, `datasets`, `chunk_count`, `chunk_size`, `is_legacy`, `has_metadata`) and `common_imatrix_load(const std::string &, common_imatrix &)` reader. New GGUF metadata keys exposed as `LLM_KV_IMATRIX_DATASETS`, `LLM_KV_IMATRIX_CHUNK_COUNT`, `LLM_KV_IMATRIX_CHUNK_SIZE`. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: `common/CMakeLists.txt` now includes `imatrix-loader.cpp` and `imatrix-loader.h` in `libcommon`, which means the JNI build picks up the new TU automatically via FetchContent + the existing `target_link_libraries(jllama PRIVATE common)` line. Project does not use imatrix loading from Java today (no `LlamaImatrix` class); the new symbols ship as additive surface area only. No project source changes required |
+| ~b9495–b9543 | `tools/mtmd/clip.{h,cpp}` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/mtmd.{h,cpp}` + `tools/mtmd/mtmd-helper.{h,cpp}` + `tools/mtmd/mtmd-image.cpp` + every `tools/mtmd/models/*.cpp` | Large MTMD subsystem refactor: (1) `clip_image_u8` and `clip_image_f32` switched from public POD-style `nx` / `ny` / `buf` fields to private members with `get_size()` / `set_size()` / `get_ro_buf()` / `cpy_buf()` / `get_pixel()` / `set_pixel()` / `is_placeholder()` getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from `tools/mtmd/clip.h`: `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`. (3) `mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` gained a required `bool placeholder` parameter (when true the bitmap reserves shape only, no pixel decode &mdash; used for token counting). (4) `mtmd_bitmap` is now a true class (private buffer + `is_placeholder()` / `can_batch_with()`); `mtmd_bitmap_init()` and `mtmd_bitmap_init_from_audio()` accept `nullptr` data to create placeholder bitmaps. (5) New Granite4 Vision projector type `PROJECTOR_TYPE_GRANITE4_VISION` and tensor enums (`V_MULTI_PROJ_*`, `V_QF_*`) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: `clip_graph_qwen2vl::build_inp_with_temporal_merge()` plus `n_batch_max=2` for batch-merged consecutive image frames. Project does not link any `tools/mtmd/*` TUs into the JNI build (`libllama` + `libcommon` only); the JNI vision API surfaces through `mtmd-helper.h` and was reviewed: zero `clip_image_*` / removed-helper references found across `src/main/cpp/` and `src/test/cpp/`. No project source changes required |
+| ~b9495–b9543 | `tools/server/server-context.cpp` + `tools/server/server-http.cpp` + `tools/server/server.cpp` (new `/v1/responses/input_tokens` + `/v1/chat/completions/input_tokens` + `/v1/messages/count_tokens`) | New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: `server_routes::handle_count_tokens()` consolidates the body parsing path (chat completions, responses, anthropic messages) and emits `{"input_tokens": N, "object": "response.input_tokens"}`. `process_mtmd_prompt()` signature gained a `bool is_placeholder = false` parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither `tools/server/server.cpp` nor `server-http.cpp`); the only server TU we link is `server-context.cpp`, where the only project-visible change is the new optional `process_mtmd_prompt` parameter, which is defaulted &mdash; existing project call sites compile unchanged. No project source changes required |
+| ~b9495–b9543 | `common/chat-peg-parser.{h,cpp}` + `common/chat.cpp` (LFM2/2.5 unified) | LFM2.5's chat-completion parser was merged into the single `common_chat_params_init_lfm2()` (was a separate `_lfm2_5` function); a `bool tool_list_tokens` flag toggles between the two template flavours. New helper `common_chat_peg_builder::python_or_json_value()` and a new `bool allow_json_literals` parameter on `python_style_tool_calls()` so LFM2.5 can accept JSON-cased `true` / `false` / `null` alongside the Python-cased literals. Pure-Python literal normalisation in `chat-peg-parser.cpp` (`True`/`False`/`None` &#x2192; JSON during streaming). Project does not call any `common_chat_peg_*` or `common_chat_params_init_lfm2*` symbols; routing happens inside upstream-compiled `chat.cpp`. No project source changes required |
+| ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
+| ~b9495–b9543 | `conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py` | Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
+| ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 &#x2192; b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change &mdash; `process_mtmd_prompt()`'s new `bool is_placeholder` parameter &mdash; is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |

Original file line number	Diff line number	Diff line change
`@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)`
`114`	`114`	`FetchContent_Declare(`
`115`	`115`	`llama.cpp`
`116`	`116`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`117`		`- GIT_TAG b9495`
	`117`	`+ GIT_TAG b9543`
`118`	`118`	`)`
`119`	`119`	`FetchContent_MakeAvailable(llama.cpp)`
`120`	`120`