You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
Copy file name to clipboardExpand all lines: docs/history/llama-cpp-breaking-changes.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -303,3 +303,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
303
303
| ~b9490–b9495 | `gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py` | New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
304
304
|~b9490–b9495 |`tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins) | Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ` ```mermaid ` code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact |
305
305
|~b9490–b9495 | upstream build / verification | Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |
306
+
| ~b9495–b9543 | `src/llama-hparams.{h,cpp}` + every `src/models/*.cpp` (~150 files) | Field `hparams::n_layer` (uint32_t) was split: the raw count moved to `hparams::n_layer_all` and `hparams::n_layer()` is now a member **function** that returns `n_layer_all - n_layer_nextn` (the effective non-MTP layer count). Sibling rename: `hparams::nextn_predict_layers` → `hparams::n_layer_nextn`. Every per-model TU in `src/models/*.cpp` was updated to call `hparams.n_layer()` and `hparams.n_layer_nextn`. New `hparams::set_recr_pattern()` mirror of `set_swa_pattern()` for hybrid recurrent architectures. New per-layer `hparams::deepstack_mapping_arr` (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key `LLM_KV_DEEPSTACK_MAPPING` for Granite4-Vision-style per-layer deepstack injection. `hparams::kv_only_nextn` was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly — verified via `grep -rn "hparams\.n_layer\|nextn_predict_layers\|n_layer_nextn\|n_layer_all\|deepstack_mapping" src/main/cpp/ src/test/cpp/` returns zero matches. All consumers are inside upstream-compiled TUs (`llama-model.cpp`, `llama-context.cpp`, model TUs); no project source changes required |
307
+
|~b9495–b9543 |`include/llama.h` (state-seq flags) + `tools/server/server-context.cpp` + `examples/speculative-simple/speculative-simple.cpp`| The `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` flag was removed from the `llama_state_seq_flags` enum. All upstream call sites that passed `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY \| LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` were updated to pass only `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY`— the on-device path is now the default for partial saves/loads. Project does not call `llama_state_seq_get_*` / `llama_state_seq_set_*` directly from `jllama.cpp`; the only consumer in the JNI build is upstream `server-context.cpp` (speculative checkpoint helpers), which was updated upstream. Verified via `grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/` returns zero matches. No project source changes required |
308
+
| ~b9495–b9543 | new `common/imatrix-loader.{h,cpp}` + refactor of `tools/imatrix/imatrix.cpp` + `tools/quantize/quantize.cpp` | Extracted shared imatrix-loading logic into a standalone library: new `common_imatrix` struct (`entries`, `datasets`, `chunk_count`, `chunk_size`, `is_legacy`, `has_metadata`) and `common_imatrix_load(const std::string &, common_imatrix &)` reader. New GGUF metadata keys exposed as `LLM_KV_IMATRIX_DATASETS`, `LLM_KV_IMATRIX_CHUNK_COUNT`, `LLM_KV_IMATRIX_CHUNK_SIZE`. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: `common/CMakeLists.txt` now includes `imatrix-loader.cpp` and `imatrix-loader.h` in `libcommon`, which means the JNI build picks up the new TU automatically via FetchContent + the existing `target_link_libraries(jllama PRIVATE common)` line. Project does not use imatrix loading from Java today (no `LlamaImatrix` class); the new symbols ship as additive surface area only. No project source changes required |
309
+
| ~b9495–b9543 | `tools/mtmd/clip.{h,cpp}` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/mtmd.{h,cpp}` + `tools/mtmd/mtmd-helper.{h,cpp}` + `tools/mtmd/mtmd-image.cpp` + every `tools/mtmd/models/*.cpp` | Large MTMD subsystem refactor: (1) `clip_image_u8` and `clip_image_f32` switched from public POD-style `nx` / `ny` / `buf` fields to private members with `get_size()` / `set_size()` / `get_ro_buf()` / `cpy_buf()` / `get_pixel()` / `set_pixel()` / `is_placeholder()` getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from `tools/mtmd/clip.h`: `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`. (3) `mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` gained a required `bool placeholder` parameter (when true the bitmap reserves shape only, no pixel decode — used for token counting). (4) `mtmd_bitmap` is now a true class (private buffer + `is_placeholder()` / `can_batch_with()`); `mtmd_bitmap_init()` and `mtmd_bitmap_init_from_audio()` accept `nullptr` data to create placeholder bitmaps. (5) New Granite4 Vision projector type `PROJECTOR_TYPE_GRANITE4_VISION` and tensor enums (`V_MULTI_PROJ_*`, `V_QF_*`) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: `clip_graph_qwen2vl::build_inp_with_temporal_merge()` plus `n_batch_max=2` for batch-merged consecutive image frames. Project does not link any `tools/mtmd/*` TUs into the JNI build (`libllama` + `libcommon` only); the JNI vision API surfaces through `mtmd-helper.h` and was reviewed: zero `clip_image_*` / removed-helper references found across `src/main/cpp/` and `src/test/cpp/`. No project source changes required |
310
+
| ~b9495–b9543 | `tools/server/server-context.cpp` + `tools/server/server-http.cpp` + `tools/server/server.cpp` (new `/v1/responses/input_tokens` + `/v1/chat/completions/input_tokens` + `/v1/messages/count_tokens`) | New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: `server_routes::handle_count_tokens()` consolidates the body parsing path (chat completions, responses, anthropic messages) and emits `{"input_tokens": N, "object": "response.input_tokens"}`. `process_mtmd_prompt()` signature gained a `bool is_placeholder = false` parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither `tools/server/server.cpp` nor `server-http.cpp`); the only server TU we link is `server-context.cpp`, where the only project-visible change is the new optional `process_mtmd_prompt` parameter, which is defaulted — existing project call sites compile unchanged. No project source changes required |
311
+
|~b9495–b9543 |`common/chat-peg-parser.{h,cpp}` + `common/chat.cpp` (LFM2/2.5 unified) | LFM2.5's chat-completion parser was merged into the single `common_chat_params_init_lfm2()` (was a separate `_lfm2_5` function); a `bool tool_list_tokens` flag toggles between the two template flavours. New helper `common_chat_peg_builder::python_or_json_value()` and a new `bool allow_json_literals` parameter on `python_style_tool_calls()` so LFM2.5 can accept JSON-cased `true` / `false` / `null` alongside the Python-cased literals. Pure-Python literal normalisation in `chat-peg-parser.cpp` (`True`/`False`/`None`→ JSON during streaming). Project does not call any `common_chat_peg_*` or `common_chat_params_init_lfm2*` symbols; routing happens inside upstream-compiled `chat.cpp`. No project source changes required |
312
+
| ~b9495–b9543 | `ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/*` + `ggml/src/ggml-sycl/*` + `ggml/src/ggml-vulkan/*` + `ggml/src/ggml-webgpu/*` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp` | Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required |
313
+
|~b9495–b9543 |`conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py`| Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
314
+
| ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change — `process_mtmd_prompt()`'s new `bool is_placeholder` parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
0 commit comments