Skip to content

Commit 7889287

Browse files
Merge pull request #280 from bernardladenthin/claude/hopeful-albattani-9825zx
Upgrade llama.cpp from b9829 to b9840
2 parents 85228ef + de3bda7 commit 7889287

5 files changed

Lines changed: 18 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
3030
- `pom.xml` SCM URL: `tree/master``tree/main` (default branch renamed).
3131
- Upgraded llama.cpp from b9151 to b9172.
3232
- Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
33+
- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001``0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
34+
- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001``0004`) apply unchanged; the project binds none of the new symbols.
3335
- `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
3436
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
3537

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9829**
9+
Current llama.cpp pinned version: **b9840**
1010

1111
## Upgrading CUDA Version
1212

@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
286286
ships no UI):
287287
```bash
288288
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
289-
git clone --depth 1 --branch b9829 https://github.com/ggml-org/llama.cpp /tmp/lc
289+
git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
290290
( cd /tmp/lc/tools/ui && npm ci && npm run build \
291291
&& ( cd dist && find . -type f -not -path './_gzip/*' \
292292
| while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
320320
- `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
321321
as the repo secret **`DEPOT_TOKEN`**.
322322

323-
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9829`), the
323+
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
324324
~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
325325
*changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
326326
per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
10211021

10221022
#### Upstream source location (in CMake build tree)
10231023

1024-
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9829`.
1024+
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.
10251025

10261026
**GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
10271027
by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the

CMakeLists.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
143143
FetchContent_Declare(
144144
llama.cpp
145145
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
146-
GIT_TAG b9829
146+
GIT_TAG b9840
147147
PATCH_COMMAND ${CMAKE_COMMAND}
148148
-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
149149
-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
166166
COMMAND ${CMAKE_COMMAND}
167167
-DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
168168
-DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
169-
-DLLAMA_TAG=b9829
169+
-DLLAMA_TAG=b9840
170170
-P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
171171
RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
172172
)

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
**Build:**
88
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
99
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
10-
[![llama.cpp b9829](https://img.shields.io/badge/llama.cpp-%23b9829-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9829)
10+
[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)
1111
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
1212
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
1313
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)

docs/history/llama-cpp-breaking-changes.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,3 +398,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
398398
| b9803–b9829 | `ggml/src/ggml-opencl/` (FA q4_0/q8_0 KV, +5 new kernel files) + `ggml/src/ggml-cuda/{cpy,out-prod}.cu` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-sycl/{norm,softmax}.cpp` + `ggml/src/ggml-openvino/` | Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new `fa_tune.h`, `flash_attn_pre_f16.cl`, `flash_attn_f32_q{4,8}_0.cl`, `cvt.cl`/`set_rows.cl` SoA quant variants); CUDA adds a `cudaMemcpy2DAsync` fast path for strided same-type copies, batched `cublasSgemmBatched` out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. `GGML_GLU_OP_SWIGLU_OAI`, softmax attention-sinks). No API surface visible to `jllama.cpp`; the OpenCL set only affects the `opencl-android-aarch64` classifier, CUDA only `cuda13-linux-x86-64`. No project source changes required |
399399
| b9803–b9829 | `common/common.{h,cpp}` + `common/speculative.cpp` + `common/arg.{cpp,h}` + `tools/mtmd/clip*.{h,cpp}` | Internal upstream churn: new `COM_*`/`SPC_*` logging macros (the `LOG_*` calls inside `common.cpp`/`speculative.cpp`/`reasoning-budget.cpp` were rewrapped, several `LOG_INF``LOG_TRC` quieting); `common_models_handler` gained `plan_spec`/`plan_voc` for `--spec-draft-hf`/`--hf-repo-v` downloads + duplicate-task dedup; `clip` hardened GGUF array reads (`get_arr_f32`, even-pinpoints / mean-std validation, `n_merge` defaults to 1). All consumed inside upstream-compiled `common`/`mtmd`; `grep -rn "common_models_handler\|COM_TRC\|n_merge" src/main/cpp src/test/cpp` → zero matches. No project source changes required |
400400
| b9803–b9829 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply + reverse-apply cleanly** against b9829 via `git apply --check` / `git apply --reverse --check` over the actual b9829 sources fetched from `api.github.com` (github.com git-clone — incl. `FetchContent` of `nlohmann/json` and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the `test-export-graph-ops` rename and the `server.cpp` GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The **`server-stream.cpp` link fix** in `CMakeLists.txt` is required by the b9829 server-TU `#include`s (verified against the upstream diff: `server-context`/`server-http`/`server-models` reference symbols defined only in `server-stream.cpp`). Full build + `ctest` (target 454/454) to be confirmed by the CI pipeline. |
401+
| b9829–b9839 | `common/regex-partial.{cpp,h}` (removed) + `common/CMakeLists.txt` + `tests/test-regex-partial.cpp` (removed) + `tests/CMakeLists.txt` | The standalone reversed-partial-regex matcher (`common_regex`, `regex_to_reversed_partial_regex`, `common_regex_match`/`common_string_range`) was **deleted** — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 `json-partial` removal). Project references none of these symbols — verified `grep -rn "regex-partial\|common_regex\|regex_to_reversed\|COMMON_REGEX" src/main/cpp src/test/cpp` &#x2192; zero matches; the deleted upstream test isn't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
402+
| b9829–b9839 | `common/common.h` + `common/speculative.cpp` + `conversion/*.py` + `gguf-py/` + `src/llama-arch.{cpp,h}` + `src/llama-{context,graph,model}.cpp` + `src/models/dflash.cpp` (new) + `docs/speculative.md` | **New feature** — DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`, PR #22105): a new `LLM_ARCH_DFLASH` arch + `common_speculative_impl_draft_dflash` that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds `COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH` (so `COMMON_SPECULATIVE_TYPE_COUNT` 9&#x2192;10, `static_assert` bumped) and a `self_kq_mask && self_kq_mask->buffer` guard in `llm_graph_input_attn_kv::set_input` for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no `common_speculative_*`/arch symbol — all consumed inside upstream-compiled `common`/`libllama`. No project source changes required. Could later surface as a `--spec-type` inference parameter |
403+
| b9829–b9839 | `common/chat.cpp` + `models/templates/openbmb-MiniCPM5-1B.jinja` (new) + `tests/test-chat*.cpp` | **New model support** — MiniCPM5 chat template (`common_chat_params_init_minicpm5`): XML tool calls `<function name="…"><param name="…">…</param></function>` with CDATA-escaped string values + `<think>` reasoning. Detected by `common_chat_try_specialized_template` and handled inside the compiled-in `chat.cpp`, so it flows through the embedded server / `LlamaModel` chat path automatically. Upstream test additions aren't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
404+
| b9829–b9839 | `common/arg.cpp` + `common/chat.cpp` + `common/jinja/caps.{cpp,h}` + `tools/server/server-context.cpp` | **New feature**`--reasoning-preserve` / `--no-reasoning-preserve` (`LLAMA_ARG_REASONING_PRESERVE`): preserve the reasoning trace across the **full** chat history (not just the last assistant message) when the template advertises the `supports_preserve_reasoning` capability; `server-context.cpp` adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by `ModelParameters`. **Note:** the b9839 `server-context.cpp` additions sit in `load_model` after the `chat_params` block — disjoint from the load-progress-callback guard `patches/0002` targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter |
405+
| b9829–b9839 | `common/jinja/runtime.{cpp,h}` + `common/jinja/value.cpp` + `tools/ui/**` + `tests/test-jinja.cpp` + `tools/server/server-{models,stream}.cpp` | Internal/cosmetic only: Jinja gains an AST `visitor` + `runtime::debug_dump_program` (template debugging) and `min`/`max` array filters; `server-models.cpp`/`server-stream.cpp` add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into `jllama`); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required |
406+
| b9829–b9839 | `common/arg.cpp` (lambda capture + `--offline` examples) | Behaviour-neutral upstream churn: the `common_models_handler_apply` `on_done` lambda now captures `first_path` by value (dangling-reference fix) and `--offline` gained `LLAMA_EXAMPLE_COMMON`/`LLAMA_EXAMPLE_DOWNLOAD` `set_examples` tags. The project's `ModelParameters.setOffline(boolean)` (`--offline`) already exists; both changes are inside upstream-compiled `arg.cpp` and don't touch the `patches/0001` hunks. No project source changes required |
407+
| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline` `set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
408+
| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` &#x2192; zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
409+
| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |

0 commit comments

Comments
 (0)