Upgrade llama.cpp from b9490 to b9495

claude · claude · commit 03bcef1f2935 · 2026-06-03T18:21:13.000Z
Verified clean: 435/435 C++ tests pass, libjllama.so links with zero
warnings on any project translation unit.

Main upstream change in range: mass terminology rename pre_norm -&gt; nextn
across the MTP / speculative-decoding API surface
(llama_set_embeddings_pre_norm -&gt; llama_set_embeddings_nextn etc.). Zero
references to any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm*
symbol in src/main/cpp/*.{cpp,hpp} - all uses stay inside upstream-compiled
translation units.

Also additive: GGML_CUDA_RESTRICT macro for Hopper PDL kernels, new
tokenizer.ggml.suppress_tokens GGUF key, new Gemma4 Unified vision+audio
projector types, upstream UI gains Mermaid rendering. None affect the
JNI bridge.

No new Java API setters added - none of the renames or additions expose
plumbing that would justify a Java surface change without a real user
request.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9490**
+Current llama.cpp pinned version: **b9495**
 
 ## Upgrading CUDA Version
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9490
+	GIT_TAG        b9495
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@
 [![Lincheck](https://img.shields.io/badge/tested%20with-Lincheck-7F52FF)](https://github.com/JetBrains/lincheck)  
 [![vmlens](https://img.shields.io/badge/tested%20with-vmlens-ff6f00)](https://vmlens.com)  
 [![JMH](https://img.shields.io/badge/benchmarked%20with-JMH-25A162)](https://openjdk.org/projects/code-tools/jmh/)  
-[![llama.cpp b9490](https://img.shields.io/badge/llama.cpp-%23b9490-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9490)  
+[![llama.cpp b9495](https://img.shields.io/badge/llama.cpp-%23b9495-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9495)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)  
 
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -297,3 +297,9 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9444–b9490 | `src/llama-arch.h` + `src/models/*.cpp` (new) | New model architectures: `LLM_ARCH_MELLUM` (JetBrains code-completion), `LLM_ARCH_EXAONE4_5` (LG AI multimodal), `LLM_ARCH_STEP3P7` (StepFun Step-3.7 with MTP support); `LLM_ARCH_QWEN3NEXT`/`LLM_ARCH_QWEN35`/`LLM_ARCH_QWEN35MOE` removed from `llama_model_saver_supports_arch()` allowlist. New tokenizer pre-types: `LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54`, `LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55`. All additive at the architecture level &mdash; consumed automatically when loading a matching GGUF, no project source or Java API changes required |
 | ~b9444–b9490 | `common/arg.cpp` | New `--mtp` / `--no-mtp` flags (env `LLAMA_ARG_MTP`) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to `ModelParameters.setMtp(boolean)`). Tracked under Open TODOs if a user requests it |
 | ~b9444–b9490 | upstream build / verification | Local build with `GIT_TAG b9490` was verified clean: `cmake -B build` configures cleanly; `cmake --build build --config Release -j$(nproc)` links `libjllama.so` with zero warnings on `jllama.cpp` or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (`common.cpp`, `arg.cpp`, `llama.cpp`, `server-*.cpp`, `download.cpp`); no project source edits required for the version bump itself |
+| ~b9490–b9495 | `include/llama.h` + `src/llama-ext.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-graph.{h,cpp}` + `common/speculative.{h,cpp}` + `src/models/{qwen35,qwen35moe,step35}.cpp` | Mass terminology rename: `pre_norm` &#x2192; `nextn` everywhere the pre-final-norm hidden state is referenced. Affects the public API: `llama_set_embeddings_pre_norm()` &#x2192; `llama_set_embeddings_nextn()`, `llama_get_embeddings_pre_norm()` &#x2192; `llama_get_embeddings_nextn()`, `llama_get_embeddings_pre_norm_ith()` &#x2192; `llama_get_embeddings_nextn_ith()`. Internal: `cparams.embeddings_pre_norm` &#x2192; `cparams.embeddings_nextn`, `cparams.embeddings_pre_norm_masked` &#x2192; `cparams.embeddings_nextn_masked`, `llm_graph_result::t_h_pre_norm` &#x2192; `t_h_nextn`, `common_speculative_need_embd_pre_norm()` &#x2192; `common_speculative_need_embd_nextn()`. Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting `t_h_nextn` (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly &mdash; all references stay inside upstream-compiled translation units (`speculative.cpp`, `llama-context.cpp`, `server-context.cpp`, model TUs). Verified by grep across `src/main/cpp/*.{cpp,hpp}`: zero matches for any `pre_norm` / `nextn` / `embeddings_pre_norm*` / `t_h_pre_norm*` symbol. No project source changes required |
+| ~b9490–b9495 | `ggml/src/ggml-cuda/common.cuh` + 10 CUDA kernel files | New `GGML_CUDA_RESTRICT` macro replaces `__restrict__` on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires `__restrict__` to be disabled per [llama.cpp PR #24030](https://github.com/ggml-org/llama.cpp/pull/24030); the macro expands to nothing under `GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER`, otherwise to `__restrict__`. Kernel signatures change from direct `T * __restrict__ x` parameters to `T * x_ptr` parameter + an internal `T * GGML_CUDA_RESTRICT x = x_ptr;` alias line; `GGML_UNUSED_VARS` calls in fallback branches updated to reference the `_ptr` names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required |
+| ~b9490–b9495 | `src/llama-arch.{h,cpp}` + `src/llama-vocab.{h,cpp}` + `gguf-py/gguf/constants.py` + `gguf-py/gguf/gguf_writer.py` | New `LLM_KV_TOKENIZER_SUPPRESS_TOKENS` GGUF key (`tokenizer.ggml.suppress_tokens`). When a GGUF declares this array, the loader stores it on `llama_vocab::impl::suppress_tokens` and exposes it via new `llama_vocab::get_suppress_tokens()` accessor. The Gemma4 model graph (`src/models/gemma4.cpp`) reads this list and appends a `-INFINITY` logit bias to those token IDs at the end of the forward graph (new `llm_graph_input_logits_bias` class). Additive: existing models without the key produce an empty `suppress_tokens` vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers `suppress_tokens` parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image|>` / `<audio|>` placeholder tokens. No project source or Java API changes required &mdash; the bias is applied automatically when a Gemma4U GGUF is loaded |
+| ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
+| ~b9490–b9495 | `tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
+| ~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |

Original file line number	Diff line number	Diff line change
`@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)`
`114`	`114`	`FetchContent_Declare(`
`115`	`115`	`llama.cpp`
`116`	`116`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`117`		`- GIT_TAG b9490`
	`117`	`+ GIT_TAG b9495`
`118`	`118`	`)`
`119`	`119`	`FetchContent_MakeAvailable(llama.cpp)`
`120`	`120`