Upgrade llama.cpp from b9543 to b9549

claude · claude · commit 78cfef1d1e36 · 2026-06-07T15:37:04.000Z
Recompile-only bump: every upstream change in the b9543..b9549 range is
absorbed inside upstream-compiled translation units (libllama + libcommon +
server-context.cpp pulled via FetchContent). Reviewed the full patch against
the project's own C++ surface — grep across src/main/cpp/ finds zero
references to any changed/added symbol (llama_context_params::ctx_other,
llama_get_ctx_other, llama-ext.h, the kv-cache mem_other/share ctor params,
n_layer_nextn, n_embd_inp_impl). Highlights:

- include/llama.h: new llama_context_params::ctx_other (shared source/target
  context); initialized in llama_context_default_params(), set by upstream
  server-context.cpp for MTP draft contexts. Project never aggregate-inits it.
- New arch GEMMA4_ASSISTANT (NextN/MTP draft head sharing the target's KV
  cache) + NEXTN_PROJ_PRE/POST tensors. A speculative-decoding/MTP feature —
  remains deferred-by-policy; loads of non-assistant GGUFs are unaffected.
- KV-cache constructors gained mem_other + share(layer_share_cb) params for
  cross-context cell sharing; all call sites updated upstream.
- common/chat.cpp LFM2 reasoning-&gt;thinking + new LFM2.5-8B-A1B.jinja template;
  handled automatically inside upstream chat.cpp.

Also fixes a pre-existing latent bug surfaced by the local build (not an
upstream change): CMakeLists.txt invoked net.ladenthin.llama.OSInfo, but the
class moved to the loader subpackage in the layered restructure, so
`cmake -B build` failed to resolve OS_NAME/OS_ARCH on hosts that don't pass
them explicitly (CI does). Fixed both --os/--arch invocations to
loader.OSInfo — same stale-FQN-after-restructure class as the earlier
spotbugs-exclude / PIT-target repairs.

Verified on Linux x86_64: cmake configure clean, libjllama.so + jllama_test
link with zero project-TU warnings, ctest 435/435 passing. Appended the
b9543..b9549 range to docs/history/llama-cpp-breaking-changes.md.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9543**
+Current llama.cpp pinned version: **b9549**
 
 ## Upgrading CUDA Version
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9543
+	GIT_TAG        b9549
 )
 FetchContent_MakeAvailable(llama.cpp)
 
@@ -159,7 +159,7 @@ if(NOT DEFINED OS_NAME)
     find_package(Java REQUIRED)
     find_program(JAVA_EXECUTABLE NAMES java)
 	execute_process(
-      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.OSInfo --os
+      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.loader.OSInfo --os
       OUTPUT_VARIABLE OS_NAME
       OUTPUT_STRIP_TRAILING_WHITESPACE
     )
@@ -177,7 +177,7 @@ if(NOT DEFINED OS_ARCH)
     find_package(Java REQUIRED)
     find_program(JAVA_EXECUTABLE NAMES java)
     execute_process(
-      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.OSInfo --arch
+      COMMAND ${JAVA_EXECUTABLE} -cp ${CMAKE_SOURCE_DIR}/target/classes net.ladenthin.llama.loader.OSInfo --arch
       OUTPUT_VARIABLE OS_ARCH
       OUTPUT_STRIP_TRAILING_WHITESPACE
     )
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9543](https://img.shields.io/badge/llama.cpp-%23b9543-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9543)  
+[![llama.cpp b9549](https://img.shields.io/badge/llama.cpp-%23b9549-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9549)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -312,3 +312,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
 | ~b9495–b9543 | `conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py` | Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
 | ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 &#x2192; b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change &mdash; `process_mtmd_prompt()`'s new `bool is_placeholder` parameter &mdash; is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
+| ~b9543–b9549 | `include/llama.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-ext.h` | New `llama_context_params::ctx_other` field (a source/target/parent `llama_context *`, default `nullptr`) used to share results or `llama_memory` between two contexts; mirrored by new `cparams.ctx_other` and the new staging API `llama_get_ctx_other()` (`llama-ext.h`). `llama_get_memory()` was moved earlier in `llama-context.cpp` and made null-safe (returns `nullptr` for a null ctx). `llama_context_default_params()` initializes `ctx_other = nullptr`. Project does not aggregate-init `llama_context_params` (it goes through `llama_context_default_params()` inside upstream `server-context.cpp`) and never includes `llama-ext.h` &mdash; verified via `grep -rn "llama_context_params\|ctx_other\|llama-ext.h\|llama_get_ctx_other\|llama_get_memory" src/main/cpp/` returns zero matches. No project source changes required |
+| ~b9543–b9549 | `src/llama-kv-cache.{h,cpp}` + `llama-kv-cache-iswa.{h,cpp}` + `llama-kv-cache-dsa.cpp` + `llama-memory.h` + `llama-memory-hybrid{,-iswa}.cpp` | KV-cache constructors gained two new parameters: `llama_memory_t mem_other` and `layer_share_cb share` (`std::function<int32_t(int32_t il)>` returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). `llama_memory_params` gained a `mem_other` field. All call sites (iswa/dsa/hybrid wrappers, `llama_model::create_memory`) updated upstream; the project never constructs a `llama_kv_cache*` or `llama_memory_*` directly. No project source changes required |
+| ~b9543–b9549 | `src/llama-arch.{h,cpp}` + new `src/models/gemma4-assistant.cpp` + `src/models/models.h` + `src/llama-model.{h,cpp}` + `src/llama-hparams.{h,cpp}` + `src/llama-graph.{h,cpp}` + `gguf-py/` + `conversion/gemma.py` | **New model architecture `LLM_ARCH_GEMMA4_ASSISTANT` ("gemma4-assistant")** &mdash; a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors `LLM_TENSOR_NEXTN_PROJ_PRE`/`NEXTN_PROJ_POST` (`nextn.pre_projection`/`post_projection`) plus model-level `nextn_proj_pre`/`nextn_proj_post`; new hparams `n_embd_inp_impl` (input-embedding dim override, honoured by `n_embd_inp()`) and graph field `n_layer_nextn`. Python conversion registers `Gemma4AssistantForCausalLM`/`Gemma4UnifiedAssistantForCausalLM`. This is the headline new feature; it is a speculative-decoding / **MTP** mechanism, which this project tracks as deferred-by-policy (see Open TODOs / `spec-draft-backend-sampling` + MTP). Consumed entirely inside upstream-compiled TUs &mdash; loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO |
+| ~b9543–b9549 | `common/chat.cpp` + new `models/templates/LFM2.5-8B-A1B.jinja` | LFM2 chat-template handling: prior-turn `reasoning_content` is now copied into the template's `thinking` field, and `<think>` reasoning extraction is gated on the template source actually containing `<think>` (and no longer on `enable_thinking`). New `LFM2.5-8B-A1B` template + parser test consolidation. Routing happens inside upstream-compiled `chat.cpp`; the project calls no `common_chat_params_init_lfm2*` symbol. Handled automatically when such a model is loaded; no project source or Java API changes required |
+| ~b9543–b9549 | `common/arg.cpp` + `common/speculative.cpp` + `src/llama-graph.cpp` | `common_params_handle_models()` mmproj auto-download now also requires `params.mmproj.path.empty() && params.mmproj.url.empty()` (an explicitly-specified mmproj is no longer re-downloaded). `speculative.cpp` MTP path adds a shared-memory fast path (`is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt`) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to `llama_model_n_embd_out()` for the MTP row width. `llama-graph.cpp` moved the `set_input_kq_mask` / `can_reuse_kq_mask` calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required |
+| ~b9543–b9549 | `tools/server/server-context.cpp` (project-linked) | The one project-linked server TU changed: now `#include`s `ggml-cpp.h` and `../../src/llama-ext.h`; sets `cparams.ctx_other = ctx_tgt` for MTP draft/MTP contexts; moved the `ctx_dft_seq_rm_type = common_context_can_seq_rm(...)` assignment to after context init (guarded by `if (ctx_dft)`); downgraded the spec memory-measure failure log from `SRV_ERR` to `SRV_WRN`; and gated the mtmd draft-processing block on `llama_get_ctx_other(ctx_dft) != ctx_tgt`. All changes are internal to the TU and the new includes resolve against the FetchContent'd `src/` and `ggml` headers. Compiles into `jllama` unchanged from the project's side. No project source changes required |
+| ~b9543–b9549 | `.github/workflows/docker.yml` (upstream CI) | Upstream's `cuda13` Docker image bumped from CUDA `13.1.1` to `13.3.0`. Upstream's own CI only; this project ships its own `publish.yml` and pins CUDA 13.2 via `.github/build_cuda_linux.sh` (see CLAUDE.md "Upgrading CUDA Version"). No impact |
+| ~b9543–b9549 | project `CMakeLists.txt` (pre-existing latent bug, fixed in this bump) | **Not an upstream change** &mdash; surfaced while build-testing this bump locally. The OS/arch detection block invoked `net.ladenthin.llama.OSInfo`, but the class had moved to `net.ladenthin.llama.loader.OSInfo` in the earlier layered-package restructure, so `cmake -B build` failed with "Could not determine OS name" on any host that does not pass `-DOS_NAME`/`-DOS_ARCH` explicitly (CI does, which is why it went unnoticed). Fixed both `execute_process` invocations (`--os` and `--arch`) to the `loader.OSInfo` FQN. Same stale-FQN-after-restructure class as the earlier `spotbugs-exclude.xml` / PIT-`targetClasses` repairs &mdash; the standing reminder to re-validate every FQN-bearing config after a package move now also covers `CMakeLists.txt` |
+| ~b9543–b9549 | upstream build / verification | Local build with `GIT_TAG b9549` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly (after the `loader.OSInfo` FQN fix above), `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit (incl. the changed `server-context.cpp`), and `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |