Skip to content

Commit 03bcef1

Browse files
committed
Upgrade llama.cpp from b9490 to b9495
Verified clean: 435/435 C++ tests pass, libjllama.so links with zero warnings on any project translation unit. Main upstream change in range: mass terminology rename pre_norm -> nextn across the MTP / speculative-decoding API surface (llama_set_embeddings_pre_norm -> llama_set_embeddings_nextn etc.). Zero references to any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm* symbol in src/main/cpp/*.{cpp,hpp} - all uses stay inside upstream-compiled translation units. Also additive: GGML_CUDA_RESTRICT macro for Hopper PDL kernels, new tokenizer.ggml.suppress_tokens GGUF key, new Gemma4 Unified vision+audio projector types, upstream UI gains Mermaid rendering. None affect the JNI bridge. No new Java API setters added - none of the renames or additions expose plumbing that would justify a Java surface change without a real user request.
1 parent ec3e3ab commit 03bcef1

4 files changed

Lines changed: 9 additions & 3 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9490**
9+
Current llama.cpp pinned version: **b9495**
1010

1111
## Upgrading CUDA Version
1212

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
114114
FetchContent_Declare(
115115
llama.cpp
116116
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
117-
GIT_TAG b9490
117+
GIT_TAG b9495
118118
)
119119
FetchContent_MakeAvailable(llama.cpp)
120120

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
[![Lincheck](https://img.shields.io/badge/tested%20with-Lincheck-7F52FF)](https://github.com/JetBrains/lincheck)
1616
[![vmlens](https://img.shields.io/badge/tested%20with-vmlens-ff6f00)](https://vmlens.com)
1717
[![JMH](https://img.shields.io/badge/benchmarked%20with-JMH-25A162)](https://openjdk.org/projects/code-tools/jmh/)
18-
[![llama.cpp b9490](https://img.shields.io/badge/llama.cpp-%23b9490-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9490)
18+
[![llama.cpp b9495](https://img.shields.io/badge/llama.cpp-%23b9495-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9495)
1919
[![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)
2020
[![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)
2121

docs/history/llama-cpp-breaking-changes.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,3 +297,9 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
297297
| ~b9444–b9490 | `src/llama-arch.h` + `src/models/*.cpp` (new) | New model architectures: `LLM_ARCH_MELLUM` (JetBrains code-completion), `LLM_ARCH_EXAONE4_5` (LG AI multimodal), `LLM_ARCH_STEP3P7` (StepFun Step-3.7 with MTP support); `LLM_ARCH_QWEN3NEXT`/`LLM_ARCH_QWEN35`/`LLM_ARCH_QWEN35MOE` removed from `llama_model_saver_supports_arch()` allowlist. New tokenizer pre-types: `LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54`, `LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55`. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required |
298298
| ~b9444–b9490 | `common/arg.cpp` | New `--mtp` / `--no-mtp` flags (env `LLAMA_ARG_MTP`) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to `ModelParameters.setMtp(boolean)`). Tracked under Open TODOs if a user requests it |
299299
| ~b9444–b9490 | upstream build / verification | Local build with `GIT_TAG b9490` was verified clean: `cmake -B build` configures cleanly; `cmake --build build --config Release -j$(nproc)` links `libjllama.so` with zero warnings on `jllama.cpp` or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (`common.cpp`, `arg.cpp`, `llama.cpp`, `server-*.cpp`, `download.cpp`); no project source edits required for the version bump itself |
300+
| ~b9490–b9495 | `include/llama.h` + `src/llama-ext.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-graph.{h,cpp}` + `common/speculative.{h,cpp}` + `src/models/{qwen35,qwen35moe,step35}.cpp` | Mass terminology rename: `pre_norm` → `nextn` everywhere the pre-final-norm hidden state is referenced. Affects the public API: `llama_set_embeddings_pre_norm()` → `llama_set_embeddings_nextn()`, `llama_get_embeddings_pre_norm()` → `llama_get_embeddings_nextn()`, `llama_get_embeddings_pre_norm_ith()` → `llama_get_embeddings_nextn_ith()`. Internal: `cparams.embeddings_pre_norm` → `cparams.embeddings_nextn`, `cparams.embeddings_pre_norm_masked` → `cparams.embeddings_nextn_masked`, `llm_graph_result::t_h_pre_norm` → `t_h_nextn`, `common_speculative_need_embd_pre_norm()` → `common_speculative_need_embd_nextn()`. Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting `t_h_nextn` (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (`speculative.cpp`, `llama-context.cpp`, `server-context.cpp`, model TUs). Verified by grep across `src/main/cpp/*.{cpp,hpp}`: zero matches for any `pre_norm` / `nextn` / `embeddings_pre_norm*` / `t_h_pre_norm*` symbol. No project source changes required |
301+
| ~b9490–b9495 | `ggml/src/ggml-cuda/common.cuh` + 10 CUDA kernel files | New `GGML_CUDA_RESTRICT` macro replaces `__restrict__` on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires `__restrict__` to be disabled per [llama.cpp PR #24030](https://github.com/ggml-org/llama.cpp/pull/24030); the macro expands to nothing under `GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER`, otherwise to `__restrict__`. Kernel signatures change from direct `T * __restrict__ x` parameters to `T * x_ptr` parameter + an internal `T * GGML_CUDA_RESTRICT x = x_ptr;` alias line; `GGML_UNUSED_VARS` calls in fallback branches updated to reference the `_ptr` names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required |
302+
| ~b9490–b9495 | `src/llama-arch.{h,cpp}` + `src/llama-vocab.{h,cpp}` + `gguf-py/gguf/constants.py` + `gguf-py/gguf/gguf_writer.py` | New `LLM_KV_TOKENIZER_SUPPRESS_TOKENS` GGUF key (`tokenizer.ggml.suppress_tokens`). When a GGUF declares this array, the loader stores it on `llama_vocab::impl::suppress_tokens` and exposes it via new `llama_vocab::get_suppress_tokens()` accessor. The Gemma4 model graph (`src/models/gemma4.cpp`) reads this list and appends a `-INFINITY` logit bias to those token IDs at the end of the forward graph (new `llm_graph_input_logits_bias` class). Additive: existing models without the key produce an empty `suppress_tokens` vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers `suppress_tokens` parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image|>` / `<audio|>` placeholder tokens. No project source or Java API changes required &mdash; the bias is applied automatically when a Gemma4U GGUF is loaded |
303+
| ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
304+
| ~b9490–b9495 | `tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
305+
| ~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |

0 commit comments

Comments
 (0)