You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recompile-only bump: every upstream change in the b9543..b9549 range is
absorbed inside upstream-compiled translation units (libllama + libcommon +
server-context.cpp pulled via FetchContent). Reviewed the full patch against
the project's own C++ surface — grep across src/main/cpp/ finds zero
references to any changed/added symbol (llama_context_params::ctx_other,
llama_get_ctx_other, llama-ext.h, the kv-cache mem_other/share ctor params,
n_layer_nextn, n_embd_inp_impl). Highlights:
- include/llama.h: new llama_context_params::ctx_other (shared source/target
context); initialized in llama_context_default_params(), set by upstream
server-context.cpp for MTP draft contexts. Project never aggregate-inits it.
- New arch GEMMA4_ASSISTANT (NextN/MTP draft head sharing the target's KV
cache) + NEXTN_PROJ_PRE/POST tensors. A speculative-decoding/MTP feature —
remains deferred-by-policy; loads of non-assistant GGUFs are unaffected.
- KV-cache constructors gained mem_other + share(layer_share_cb) params for
cross-context cell sharing; all call sites updated upstream.
- common/chat.cpp LFM2 reasoning->thinking + new LFM2.5-8B-A1B.jinja template;
handled automatically inside upstream chat.cpp.
Also fixes a pre-existing latent bug surfaced by the local build (not an
upstream change): CMakeLists.txt invoked net.ladenthin.llama.OSInfo, but the
class moved to the loader subpackage in the layered restructure, so
`cmake -B build` failed to resolve OS_NAME/OS_ARCH on hosts that don't pass
them explicitly (CI does). Fixed both --os/--arch invocations to
loader.OSInfo — same stale-FQN-after-restructure class as the earlier
spotbugs-exclude / PIT-target repairs.
Verified on Linux x86_64: cmake configure clean, libjllama.so + jllama_test
link with zero project-TU warnings, ctest 435/435 passing. Appended the
b9543..b9549 range to docs/history/llama-cpp-breaking-changes.md.
Copy file name to clipboardExpand all lines: CLAUDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
Copy file name to clipboardExpand all lines: docs/history/llama-cpp-breaking-changes.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -312,3 +312,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
312
312
| ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
313
313
|~b9495–b9543 |`conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py`| Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
314
314
| ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change — `process_mtmd_prompt()`'s new `bool is_placeholder` parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
315
+
|~b9543–b9549 |`include/llama.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-ext.h`| New `llama_context_params::ctx_other` field (a source/target/parent `llama_context *`, default `nullptr`) used to share results or `llama_memory` between two contexts; mirrored by new `cparams.ctx_other` and the new staging API `llama_get_ctx_other()` (`llama-ext.h`). `llama_get_memory()` was moved earlier in `llama-context.cpp` and made null-safe (returns `nullptr` for a null ctx). `llama_context_default_params()` initializes `ctx_other = nullptr`. Project does not aggregate-init `llama_context_params` (it goes through `llama_context_default_params()` inside upstream `server-context.cpp`) and never includes `llama-ext.h`— verified via `grep -rn "llama_context_params\|ctx_other\|llama-ext.h\|llama_get_ctx_other\|llama_get_memory" src/main/cpp/` returns zero matches. No project source changes required |
316
+
|~b9543–b9549 |`src/llama-kv-cache.{h,cpp}` + `llama-kv-cache-iswa.{h,cpp}` + `llama-kv-cache-dsa.cpp` + `llama-memory.h` + `llama-memory-hybrid{,-iswa}.cpp`| KV-cache constructors gained two new parameters: `llama_memory_t mem_other` and `layer_share_cb share` (`std::function<int32_t(int32_t il)>` returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). `llama_memory_params` gained a `mem_other` field. All call sites (iswa/dsa/hybrid wrappers, `llama_model::create_memory`) updated upstream; the project never constructs a `llama_kv_cache*` or `llama_memory_*` directly. No project source changes required |
317
+
| ~b9543–b9549 | `src/llama-arch.{h,cpp}` + new `src/models/gemma4-assistant.cpp` + `src/models/models.h` + `src/llama-model.{h,cpp}` + `src/llama-hparams.{h,cpp}` + `src/llama-graph.{h,cpp}` + `gguf-py/` + `conversion/gemma.py` | **New model architecture `LLM_ARCH_GEMMA4_ASSISTANT` ("gemma4-assistant")** — a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors `LLM_TENSOR_NEXTN_PROJ_PRE`/`NEXTN_PROJ_POST` (`nextn.pre_projection`/`post_projection`) plus model-level `nextn_proj_pre`/`nextn_proj_post`; new hparams `n_embd_inp_impl` (input-embedding dim override, honoured by `n_embd_inp()`) and graph field `n_layer_nextn`. Python conversion registers `Gemma4AssistantForCausalLM`/`Gemma4UnifiedAssistantForCausalLM`. This is the headline new feature; it is a speculative-decoding / **MTP** mechanism, which this project tracks as deferred-by-policy (see Open TODOs / `spec-draft-backend-sampling` + MTP). Consumed entirely inside upstream-compiled TUs — loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO |
318
+
|~b9543–b9549 |`common/chat.cpp` + new `models/templates/LFM2.5-8B-A1B.jinja`| LFM2 chat-template handling: prior-turn `reasoning_content` is now copied into the template's `thinking` field, and `<think>` reasoning extraction is gated on the template source actually containing `<think>` (and no longer on `enable_thinking`). New `LFM2.5-8B-A1B` template + parser test consolidation. Routing happens inside upstream-compiled `chat.cpp`; the project calls no `common_chat_params_init_lfm2*` symbol. Handled automatically when such a model is loaded; no project source or Java API changes required |
319
+
|~b9543–b9549 |`common/arg.cpp` + `common/speculative.cpp` + `src/llama-graph.cpp`|`common_params_handle_models()` mmproj auto-download now also requires `params.mmproj.path.empty() && params.mmproj.url.empty()` (an explicitly-specified mmproj is no longer re-downloaded). `speculative.cpp` MTP path adds a shared-memory fast path (`is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt`) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to `llama_model_n_embd_out()` for the MTP row width. `llama-graph.cpp` moved the `set_input_kq_mask` / `can_reuse_kq_mask` calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required |
320
+
|~b9543–b9549 |`tools/server/server-context.cpp` (project-linked) | The one project-linked server TU changed: now `#include`s `ggml-cpp.h` and `../../src/llama-ext.h`; sets `cparams.ctx_other = ctx_tgt` for MTP draft/MTP contexts; moved the `ctx_dft_seq_rm_type = common_context_can_seq_rm(...)` assignment to after context init (guarded by `if (ctx_dft)`); downgraded the spec memory-measure failure log from `SRV_ERR` to `SRV_WRN`; and gated the mtmd draft-processing block on `llama_get_ctx_other(ctx_dft) != ctx_tgt`. All changes are internal to the TU and the new includes resolve against the FetchContent'd `src/` and `ggml` headers. Compiles into `jllama` unchanged from the project's side. No project source changes required |
321
+
|~b9543–b9549 |`.github/workflows/docker.yml` (upstream CI) | Upstream's `cuda13` Docker image bumped from CUDA `13.1.1` to `13.3.0`. Upstream's own CI only; this project ships its own `publish.yml` and pins CUDA 13.2 via `.github/build_cuda_linux.sh` (see CLAUDE.md "Upgrading CUDA Version"). No impact |
322
+
|~b9543–b9549 | project `CMakeLists.txt` (pre-existing latent bug, fixed in this bump) |**Not an upstream change**— surfaced while build-testing this bump locally. The OS/arch detection block invoked `net.ladenthin.llama.OSInfo`, but the class had moved to `net.ladenthin.llama.loader.OSInfo` in the earlier layered-package restructure, so `cmake -B build` failed with "Could not determine OS name" on any host that does not pass `-DOS_NAME`/`-DOS_ARCH` explicitly (CI does, which is why it went unnoticed). Fixed both `execute_process` invocations (`--os` and `--arch`) to the `loader.OSInfo` FQN. Same stale-FQN-after-restructure class as the earlier `spotbugs-exclude.xml` / PIT-`targetClasses` repairs — the standing reminder to re-validate every FQN-bearing config after a package move now also covers `CMakeLists.txt`|
323
+
|~b9543–b9549 | upstream build / verification | Local build with `GIT_TAG b9549` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly (after the `loader.OSInfo` FQN fix above), `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit (incl. the changed `server-context.cpp`), and `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |
0 commit comments