Merge pull request #280 from bernardladenthin/claude/hopeful-albattani-9825zx

bernardladenthin · web-flow · commit 7889287040a0 · 2026-06-29T13:34:25.000+02:00
Upgrade llama.cpp from b9829 to b9840
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -30,6 +30,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
 - `pom.xml` SCM URL: `tree/master` → `tree/main` (default branch renamed).
 - Upgraded llama.cpp from b9151 to b9172.
 - Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
+- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001`–`0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
+- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001`–`0004`) apply unchanged; the project binds none of the new symbols.
 - `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
 - Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
 
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9829**
+Current llama.cpp pinned version: **b9840**
 
 ## Upgrading CUDA Version
 
@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
 ships no UI):
 ```bash
 # needs node/npm + network; embed.cpp is plain C++17 (no npm)
-git clone --depth 1 --branch b9829 https://github.com/ggml-org/llama.cpp /tmp/lc
+git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
 ( cd /tmp/lc/tools/ui && npm ci && npm run build \
   && ( cd dist && find . -type f -not -path './_gzip/*' \
        | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
 - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
   as the repo secret **`DEPOT_TOKEN`**.
 
-Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9829`), the
+Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
 ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
 *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
 per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
 
 #### Upstream source location (in CMake build tree)
 
-llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9829`.
+llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.
 
 **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
 by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9829
+	GIT_TAG        b9840
 	PATCH_COMMAND  ${CMAKE_COMMAND}
 		-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
 		-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
     COMMAND ${CMAKE_COMMAND}
         -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
         -DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
-        -DLLAMA_TAG=b9829
+        -DLLAMA_TAG=b9840
         -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
     RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
 )
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9829](https://img.shields.io/badge/llama.cpp-%23b9829-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9829)  
+[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -398,3 +398,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | b9803–b9829 | `ggml/src/ggml-opencl/` (FA q4_0/q8_0 KV, +5 new kernel files) + `ggml/src/ggml-cuda/{cpy,out-prod}.cu` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-sycl/{norm,softmax}.cpp` + `ggml/src/ggml-openvino/` | Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new `fa_tune.h`, `flash_attn_pre_f16.cl`, `flash_attn_f32_q{4,8}_0.cl`, `cvt.cl`/`set_rows.cl` SoA quant variants); CUDA adds a `cudaMemcpy2DAsync` fast path for strided same-type copies, batched `cublasSgemmBatched` out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. `GGML_GLU_OP_SWIGLU_OAI`, softmax attention-sinks). No API surface visible to `jllama.cpp`; the OpenCL set only affects the `opencl-android-aarch64` classifier, CUDA only `cuda13-linux-x86-64`. No project source changes required |
 | b9803–b9829 | `common/common.{h,cpp}` + `common/speculative.cpp` + `common/arg.{cpp,h}` + `tools/mtmd/clip*.{h,cpp}` | Internal upstream churn: new `COM_*`/`SPC_*` logging macros (the `LOG_*` calls inside `common.cpp`/`speculative.cpp`/`reasoning-budget.cpp` were rewrapped, several `LOG_INF`→`LOG_TRC` quieting); `common_models_handler` gained `plan_spec`/`plan_voc` for `--spec-draft-hf`/`--hf-repo-v` downloads + duplicate-task dedup; `clip` hardened GGUF array reads (`get_arr_f32`, even-pinpoints / mean-std validation, `n_merge` defaults to 1). All consumed inside upstream-compiled `common`/`mtmd`; `grep -rn "common_models_handler\|COM_TRC\|n_merge" src/main/cpp src/test/cpp` → zero matches. No project source changes required |
 | b9803–b9829 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply + reverse-apply cleanly** against b9829 via `git apply --check` / `git apply --reverse --check` over the actual b9829 sources fetched from `api.github.com` (github.com git-clone — incl. `FetchContent` of `nlohmann/json` and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the `test-export-graph-ops` rename and the `server.cpp` GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The **`server-stream.cpp` link fix** in `CMakeLists.txt` is required by the b9829 server-TU `#include`s (verified against the upstream diff: `server-context`/`server-http`/`server-models` reference symbols defined only in `server-stream.cpp`). Full build + `ctest` (target 454/454) to be confirmed by the CI pipeline. |
+| b9829–b9839 | `common/regex-partial.{cpp,h}` (removed) + `common/CMakeLists.txt` + `tests/test-regex-partial.cpp` (removed) + `tests/CMakeLists.txt` | The standalone reversed-partial-regex matcher (`common_regex`, `regex_to_reversed_partial_regex`, `common_regex_match`/`common_string_range`) was **deleted** — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 `json-partial` removal). Project references none of these symbols — verified `grep -rn "regex-partial\|common_regex\|regex_to_reversed\|COMMON_REGEX" src/main/cpp src/test/cpp` &#x2192; zero matches; the deleted upstream test isn't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
+| b9829–b9839 | `common/common.h` + `common/speculative.cpp` + `conversion/*.py` + `gguf-py/` + `src/llama-arch.{cpp,h}` + `src/llama-{context,graph,model}.cpp` + `src/models/dflash.cpp` (new) + `docs/speculative.md` | **New feature** — DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`, PR #22105): a new `LLM_ARCH_DFLASH` arch + `common_speculative_impl_draft_dflash` that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds `COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH` (so `COMMON_SPECULATIVE_TYPE_COUNT` 9&#x2192;10, `static_assert` bumped) and a `self_kq_mask && self_kq_mask->buffer` guard in `llm_graph_input_attn_kv::set_input` for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no `common_speculative_*`/arch symbol — all consumed inside upstream-compiled `common`/`libllama`. No project source changes required. Could later surface as a `--spec-type` inference parameter |
+| b9829–b9839 | `common/chat.cpp` + `models/templates/openbmb-MiniCPM5-1B.jinja` (new) + `tests/test-chat*.cpp` | **New model support** — MiniCPM5 chat template (`common_chat_params_init_minicpm5`): XML tool calls `<function name="…"><param name="…">…</param></function>` with CDATA-escaped string values + `<think>` reasoning. Detected by `common_chat_try_specialized_template` and handled inside the compiled-in `chat.cpp`, so it flows through the embedded server / `LlamaModel` chat path automatically. Upstream test additions aren't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
+| b9829–b9839 | `common/arg.cpp` + `common/chat.cpp` + `common/jinja/caps.{cpp,h}` + `tools/server/server-context.cpp` | **New feature** — `--reasoning-preserve` / `--no-reasoning-preserve` (`LLAMA_ARG_REASONING_PRESERVE`): preserve the reasoning trace across the **full** chat history (not just the last assistant message) when the template advertises the `supports_preserve_reasoning` capability; `server-context.cpp` adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by `ModelParameters`. **Note:** the b9839 `server-context.cpp` additions sit in `load_model` after the `chat_params` block — disjoint from the load-progress-callback guard `patches/0002` targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter |
+| b9829–b9839 | `common/jinja/runtime.{cpp,h}` + `common/jinja/value.cpp` + `tools/ui/**` + `tests/test-jinja.cpp` + `tools/server/server-{models,stream}.cpp` | Internal/cosmetic only: Jinja gains an AST `visitor` + `runtime::debug_dump_program` (template debugging) and `min`/`max` array filters; `server-models.cpp`/`server-stream.cpp` add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into `jllama`); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required |
+| b9829–b9839 | `common/arg.cpp` (lambda capture + `--offline` examples) | Behaviour-neutral upstream churn: the `common_models_handler_apply` `on_done` lambda now captures `first_path` by value (dangling-reference fix) and `--offline` gained `LLAMA_EXAMPLE_COMMON`/`LLAMA_EXAMPLE_DOWNLOAD` `set_examples` tags. The project's `ModelParameters.setOffline(boolean)` (`--offline`) already exists; both changes are inside upstream-compiled `arg.cpp` and don't touch the `patches/0001` hunks. No project source changes required |
+| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline` `set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
+| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` &#x2192; zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
+| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |