You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Verified clean: 435/435 C++ tests pass, libjllama.so links with zero
warnings on any project translation unit.
Main upstream change in range: mass terminology rename pre_norm -> nextn
across the MTP / speculative-decoding API surface
(llama_set_embeddings_pre_norm -> llama_set_embeddings_nextn etc.). Zero
references to any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm*
symbol in src/main/cpp/*.{cpp,hpp} - all uses stay inside upstream-compiled
translation units.
Also additive: GGML_CUDA_RESTRICT macro for Hopper PDL kernels, new
tokenizer.ggml.suppress_tokens GGUF key, new Gemma4 Unified vision+audio
projector types, upstream UI gains Mermaid rendering. None affect the
JNI bridge.
No new Java API setters added - none of the renames or additions expose
plumbing that would justify a Java surface change without a real user
request.
Copy file name to clipboardExpand all lines: CLAUDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
Copy file name to clipboardExpand all lines: docs/history/llama-cpp-breaking-changes.md
+6Lines changed: 6 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -297,3 +297,9 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
297
297
|~b9444–b9490 |`src/llama-arch.h` + `src/models/*.cpp` (new) | New model architectures: `LLM_ARCH_MELLUM` (JetBrains code-completion), `LLM_ARCH_EXAONE4_5` (LG AI multimodal), `LLM_ARCH_STEP3P7` (StepFun Step-3.7 with MTP support); `LLM_ARCH_QWEN3NEXT`/`LLM_ARCH_QWEN35`/`LLM_ARCH_QWEN35MOE` removed from `llama_model_saver_supports_arch()` allowlist. New tokenizer pre-types: `LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54`, `LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55`. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required |
298
298
|~b9444–b9490 |`common/arg.cpp`| New `--mtp` / `--no-mtp` flags (env `LLAMA_ARG_MTP`) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to `ModelParameters.setMtp(boolean)`). Tracked under Open TODOs if a user requests it |
299
299
|~b9444–b9490 | upstream build / verification | Local build with `GIT_TAG b9490` was verified clean: `cmake -B build` configures cleanly; `cmake --build build --config Release -j$(nproc)` links `libjllama.so` with zero warnings on `jllama.cpp` or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (`common.cpp`, `arg.cpp`, `llama.cpp`, `server-*.cpp`, `download.cpp`); no project source edits required for the version bump itself |
300
+
| ~b9490–b9495 | `include/llama.h` + `src/llama-ext.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-graph.{h,cpp}` + `common/speculative.{h,cpp}` + `src/models/{qwen35,qwen35moe,step35}.cpp` | Mass terminology rename: `pre_norm` → `nextn` everywhere the pre-final-norm hidden state is referenced. Affects the public API: `llama_set_embeddings_pre_norm()` → `llama_set_embeddings_nextn()`, `llama_get_embeddings_pre_norm()` → `llama_get_embeddings_nextn()`, `llama_get_embeddings_pre_norm_ith()` → `llama_get_embeddings_nextn_ith()`. Internal: `cparams.embeddings_pre_norm` → `cparams.embeddings_nextn`, `cparams.embeddings_pre_norm_masked` → `cparams.embeddings_nextn_masked`, `llm_graph_result::t_h_pre_norm` → `t_h_nextn`, `common_speculative_need_embd_pre_norm()` → `common_speculative_need_embd_nextn()`. Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting `t_h_nextn` (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (`speculative.cpp`, `llama-context.cpp`, `server-context.cpp`, model TUs). Verified by grep across `src/main/cpp/*.{cpp,hpp}`: zero matches for any `pre_norm` / `nextn` / `embeddings_pre_norm*` / `t_h_pre_norm*` symbol. No project source changes required |
301
+
|~b9490–b9495 |`ggml/src/ggml-cuda/common.cuh` + 10 CUDA kernel files | New `GGML_CUDA_RESTRICT` macro replaces `__restrict__` on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires `__restrict__` to be disabled per [llama.cpp PR #24030](https://github.com/ggml-org/llama.cpp/pull/24030); the macro expands to nothing under `GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER`, otherwise to `__restrict__`. Kernel signatures change from direct `T * __restrict__ x` parameters to `T * x_ptr` parameter + an internal `T * GGML_CUDA_RESTRICT x = x_ptr;` alias line; `GGML_UNUSED_VARS` calls in fallback branches updated to reference the `_ptr` names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required |
302
+
|~b9490–b9495 |`src/llama-arch.{h,cpp}` + `src/llama-vocab.{h,cpp}` + `gguf-py/gguf/constants.py` + `gguf-py/gguf/gguf_writer.py`| New `LLM_KV_TOKENIZER_SUPPRESS_TOKENS` GGUF key (`tokenizer.ggml.suppress_tokens`). When a GGUF declares this array, the loader stores it on `llama_vocab::impl::suppress_tokens` and exposes it via new `llama_vocab::get_suppress_tokens()` accessor. The Gemma4 model graph (`src/models/gemma4.cpp`) reads this list and appends a `-INFINITY` logit bias to those token IDs at the end of the forward graph (new `llm_graph_input_logits_bias` class). Additive: existing models without the key produce an empty `suppress_tokens` vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers `suppress_tokens` parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image|>` / `<audio|>` placeholder tokens. No project source or Java API changes required — the bias is applied automatically when a Gemma4U GGUF is loaded |
303
+
| ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
304
+
|~b9490–b9495 |`tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
305
+
|~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |
0 commit comments