Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
- `pom.xml` SCM URL: `tree/master` → `tree/main` (default branch renamed).
- Upgraded llama.cpp from b9151 to b9172.
- Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001`–`0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001`–`0004`) apply unchanged; the project binds none of the new symbols.
- `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.

Expand Down
8 changes: 4 additions & 4 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9829**
Current llama.cpp pinned version: **b9840**

## Upgrading CUDA Version

Expand Down Expand Up @@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
ships no UI):
```bash
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
git clone --depth 1 --branch b9829 https://github.com/ggml-org/llama.cpp /tmp/lc
git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
( cd /tmp/lc/tools/ui && npm ci && npm run build \
&& ( cd dist && find . -type f -not -path './_gzip/*' \
| while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
Expand Down Expand Up @@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
- `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
as the repo secret **`DEPOT_TOKEN`**.

Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9829`), the
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
*changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
Expand Down Expand Up @@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"

#### Upstream source location (in CMake build tree)

llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9829`.
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.

**GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the
Expand Down
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9829
GIT_TAG b9840
PATCH_COMMAND ${CMAKE_COMMAND}
-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
-DLLAMA_SRC=<SOURCE_DIR>
Expand All @@ -166,7 +166,7 @@ execute_process(
COMMAND ${CMAKE_COMMAND}
-DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
-DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
-DLLAMA_TAG=b9829
-DLLAMA_TAG=b9840
-P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
**Build:**
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
[![llama.cpp b9829](https://img.shields.io/badge/llama.cpp-%23b9829-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9829)
[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)
Expand Down
9 changes: 9 additions & 0 deletions docs/history/llama-cpp-breaking-changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -398,3 +398,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
| b9803–b9829 | `ggml/src/ggml-opencl/` (FA q4_0/q8_0 KV, +5 new kernel files) + `ggml/src/ggml-cuda/{cpy,out-prod}.cu` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-sycl/{norm,softmax}.cpp` + `ggml/src/ggml-openvino/` | Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new `fa_tune.h`, `flash_attn_pre_f16.cl`, `flash_attn_f32_q{4,8}_0.cl`, `cvt.cl`/`set_rows.cl` SoA quant variants); CUDA adds a `cudaMemcpy2DAsync` fast path for strided same-type copies, batched `cublasSgemmBatched` out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. `GGML_GLU_OP_SWIGLU_OAI`, softmax attention-sinks). No API surface visible to `jllama.cpp`; the OpenCL set only affects the `opencl-android-aarch64` classifier, CUDA only `cuda13-linux-x86-64`. No project source changes required |
| b9803–b9829 | `common/common.{h,cpp}` + `common/speculative.cpp` + `common/arg.{cpp,h}` + `tools/mtmd/clip*.{h,cpp}` | Internal upstream churn: new `COM_*`/`SPC_*` logging macros (the `LOG_*` calls inside `common.cpp`/`speculative.cpp`/`reasoning-budget.cpp` were rewrapped, several `LOG_INF`→`LOG_TRC` quieting); `common_models_handler` gained `plan_spec`/`plan_voc` for `--spec-draft-hf`/`--hf-repo-v` downloads + duplicate-task dedup; `clip` hardened GGUF array reads (`get_arr_f32`, even-pinpoints / mean-std validation, `n_merge` defaults to 1). All consumed inside upstream-compiled `common`/`mtmd`; `grep -rn "common_models_handler\|COM_TRC\|n_merge" src/main/cpp src/test/cpp` → zero matches. No project source changes required |
| b9803–b9829 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply + reverse-apply cleanly** against b9829 via `git apply --check` / `git apply --reverse --check` over the actual b9829 sources fetched from `api.github.com` (github.com git-clone — incl. `FetchContent` of `nlohmann/json` and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the `test-export-graph-ops` rename and the `server.cpp` GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The **`server-stream.cpp` link fix** in `CMakeLists.txt` is required by the b9829 server-TU `#include`s (verified against the upstream diff: `server-context`/`server-http`/`server-models` reference symbols defined only in `server-stream.cpp`). Full build + `ctest` (target 454/454) to be confirmed by the CI pipeline. |
| b9829–b9839 | `common/regex-partial.{cpp,h}` (removed) + `common/CMakeLists.txt` + `tests/test-regex-partial.cpp` (removed) + `tests/CMakeLists.txt` | The standalone reversed-partial-regex matcher (`common_regex`, `regex_to_reversed_partial_regex`, `common_regex_match`/`common_string_range`) was **deleted** — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 `json-partial` removal). Project references none of these symbols — verified `grep -rn "regex-partial\|common_regex\|regex_to_reversed\|COMMON_REGEX" src/main/cpp src/test/cpp` &#x2192; zero matches; the deleted upstream test isn't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
| b9829–b9839 | `common/common.h` + `common/speculative.cpp` + `conversion/*.py` + `gguf-py/` + `src/llama-arch.{cpp,h}` + `src/llama-{context,graph,model}.cpp` + `src/models/dflash.cpp` (new) + `docs/speculative.md` | **New feature** — DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`, PR #22105): a new `LLM_ARCH_DFLASH` arch + `common_speculative_impl_draft_dflash` that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds `COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH` (so `COMMON_SPECULATIVE_TYPE_COUNT` 9&#x2192;10, `static_assert` bumped) and a `self_kq_mask && self_kq_mask->buffer` guard in `llm_graph_input_attn_kv::set_input` for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no `common_speculative_*`/arch symbol — all consumed inside upstream-compiled `common`/`libllama`. No project source changes required. Could later surface as a `--spec-type` inference parameter |
| b9829–b9839 | `common/chat.cpp` + `models/templates/openbmb-MiniCPM5-1B.jinja` (new) + `tests/test-chat*.cpp` | **New model support** — MiniCPM5 chat template (`common_chat_params_init_minicpm5`): XML tool calls `<function name="…"><param name="…">…</param></function>` with CDATA-escaped string values + `<think>` reasoning. Detected by `common_chat_try_specialized_template` and handled inside the compiled-in `chat.cpp`, so it flows through the embedded server / `LlamaModel` chat path automatically. Upstream test additions aren't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
| b9829–b9839 | `common/arg.cpp` + `common/chat.cpp` + `common/jinja/caps.{cpp,h}` + `tools/server/server-context.cpp` | **New feature** — `--reasoning-preserve` / `--no-reasoning-preserve` (`LLAMA_ARG_REASONING_PRESERVE`): preserve the reasoning trace across the **full** chat history (not just the last assistant message) when the template advertises the `supports_preserve_reasoning` capability; `server-context.cpp` adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by `ModelParameters`. **Note:** the b9839 `server-context.cpp` additions sit in `load_model` after the `chat_params` block — disjoint from the load-progress-callback guard `patches/0002` targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter |
| b9829–b9839 | `common/jinja/runtime.{cpp,h}` + `common/jinja/value.cpp` + `tools/ui/**` + `tests/test-jinja.cpp` + `tools/server/server-{models,stream}.cpp` | Internal/cosmetic only: Jinja gains an AST `visitor` + `runtime::debug_dump_program` (template debugging) and `min`/`max` array filters; `server-models.cpp`/`server-stream.cpp` add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into `jllama`); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required |
| b9829–b9839 | `common/arg.cpp` (lambda capture + `--offline` examples) | Behaviour-neutral upstream churn: the `common_models_handler_apply` `on_done` lambda now captures `first_path` by value (dangling-reference fix) and `--offline` gained `LLAMA_EXAMPLE_COMMON`/`LLAMA_EXAMPLE_DOWNLOAD` `set_examples` tags. The project's `ModelParameters.setOffline(boolean)` (`--offline`) already exists; both changes are inside upstream-compiled `arg.cpp` and don't touch the `patches/0001` hunks. No project source changes required |
| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline` `set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` &#x2192; zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |