Upgrade llama.cpp from b9840 to b9842

claude · claude · commit 1af4f4d023bc · 2026-06-29T16:57:56.000Z
Pure version bump folded into the 5.0.3 release. The b9840->b9842 range is internal-only with no API impact: - common/preset.cpp: canonical_tag() helper canonicalizes the tag suffix of INI preset section names (CLI/common feature; not bound by the JNI layer). - ggml-vulkan.cpp: graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs (Vulkan-backend perf tuning; vulkan-windows classifier only). All four local patches (0001-0004) apply unchanged (no patch-target file or OuteTTS generator anchor touched). Updated CMakeLists GIT_TAG/LLAMA_TAG, README badge, CLAUDE.md pin references, CHANGELOG 5.0.3 range, and appended b9840-b9842 rows to docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01URUX3HiqQ1wzJnT8qn8c8E
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -13,7 +13,7 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
 
 Feature release. Headline addition is a full OpenAI-compatible embedded HTTP
 server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio
-input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9840**.
+input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9842**.
 
 ### Added
 - **OpenAI-compatible HTTP server** (`server` package, built on the JDK's `com.sun.net.httpserver` — no new runtime dependency; embeddable and the fat-jar `Main-Class`). Serves `POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions` (token-by-token streaming), `/v1/embeddings`, `/v1/rerank`, `/infill`, `GET /v1/models`, `GET /health`, and `GET /props` (every route also reachable without the `/v1` prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
@@ -30,7 +30,7 @@ input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b98
 - `log_helpers.hpp` — pure, unit-tested log-formatting helpers (`log_level_name`, `format_log_as_json`).
 
 ### Changed
-- Upgraded llama.cpp from **b9555 to b9840** across ten incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. All four local patches (`0001`–`0004`) apply across the range.
+- Upgraded llama.cpp from **b9555 to b9842** across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in `common/preset.cpp`; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (`0001`–`0004`) apply unchanged across the range.
 - Replaced the `--skip-download` flag with `--offline` (llama.cpp b9803).
 - `Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (`SessionState` extracted as a testable concurrency contract).
 - `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003`), instead of validating and discarding the value.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9840**
+Current llama.cpp pinned version: **b9842**
 
 ## Upgrading CUDA Version
 
@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
 ships no UI):
 ```bash
 # needs node/npm + network; embed.cpp is plain C++17 (no npm)
-git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
+git clone --depth 1 --branch b9842 https://github.com/ggml-org/llama.cpp /tmp/lc
 ( cd /tmp/lc/tools/ui && npm ci && npm run build \
   && ( cd dist && find . -type f -not -path './_gzip/*' \
        | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
 - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
   as the repo secret **`DEPOT_TOKEN`**.
 
-Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
+Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9842`), the
 ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
 *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
 per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
 
 #### Upstream source location (in CMake build tree)
 
-llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.
+llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9842`.
 
 **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
 by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9840
+	GIT_TAG        b9842
 	PATCH_COMMAND  ${CMAKE_COMMAND}
 		-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
 		-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
     COMMAND ${CMAKE_COMMAND}
         -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
         -DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
-        -DLLAMA_TAG=b9840
+        -DLLAMA_TAG=b9842
         -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
     RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
 )
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)  
+[![llama.cpp b9842](https://img.shields.io/badge/llama.cpp-%23b9842-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9842)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -407,3 +407,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | b9829–b9839 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline` `set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
 | b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` &#x2192; zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
 | b9839–b9840 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
+| b9840–b9842 | `common/preset.cpp` + `ggml/src/ggml-vulkan/ggml-vulkan.cpp` | Internal-only, no API surface. (1) `common/preset.cpp` adds a `canonical_tag()` helper and **canonicalizes the tag suffix** of INI preset section names (everything after the last `:` is upper-cased / normalized via a `<regex>`), so `[model:q4_k_m]` and `[model:Q4_K_M]` resolve to one preset. Preset loading is a CLI/`common` feature; the project's C++ never calls `common_preset*` — verified `grep -rn "common_preset\|canonical_tag\|load_from_ini" src/main/cpp src/test/cpp` &#x2192; zero matches. (2) `ggml-vulkan.cpp` reworks the per-graph **command-buffer submission heuristic** from "weight-matrix bytes per submit" to "estimated FLOPs per submit" (new `ggml_vk_get_node_flops()` over MUL_MAT/CONV/FLASH_ATTN nodes; `last_total_mul_mat_bytes` → `last_total_flops`; submit every ~200 GFLOP, still bounded by `max_nodes_per_submit`). Pure Vulkan-backend perf tuning, behaviour-neutral to callers; only affects the `vulkan-windows-x86-64` classifier. No project source changes required. |
+| b9840–b9842 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9842 via `git apply --check` over the actual b9842 sources (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run). The b9840→b9842 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `common/preset.cpp` and `ggml-vulkan.cpp` — so all patch hunks/offsets are byte-identical to b9840. Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |