Skip to content

Commit 1af4f4d

Browse files
committed
Upgrade llama.cpp from b9840 to b9842
Pure version bump folded into the 5.0.3 release. The b9840->b9842 range is internal-only with no API impact: - common/preset.cpp: canonical_tag() helper canonicalizes the tag suffix of INI preset section names (CLI/common feature; not bound by the JNI layer). - ggml-vulkan.cpp: graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs (Vulkan-backend perf tuning; vulkan-windows classifier only). All four local patches (0001-0004) apply unchanged (no patch-target file or OuteTTS generator anchor touched). Updated CMakeLists GIT_TAG/LLAMA_TAG, README badge, CLAUDE.md pin references, CHANGELOG 5.0.3 range, and appended b9840-b9842 rows to docs/history/llama-cpp-breaking-changes.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01URUX3HiqQ1wzJnT8qn8c8E
1 parent 90de054 commit 1af4f4d

5 files changed

Lines changed: 11 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
1313

1414
Feature release. Headline addition is a full OpenAI-compatible embedded HTTP
1515
server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio
16-
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9840**.
16+
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9842**.
1717

1818
### Added
1919
- **OpenAI-compatible HTTP server** (`server` package, built on the JDK's `com.sun.net.httpserver` — no new runtime dependency; embeddable and the fat-jar `Main-Class`). Serves `POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions` (token-by-token streaming), `/v1/embeddings`, `/v1/rerank`, `/infill`, `GET /v1/models`, `GET /health`, and `GET /props` (every route also reachable without the `/v1` prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
@@ -30,7 +30,7 @@ input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b98
3030
- `log_helpers.hpp` — pure, unit-tested log-formatting helpers (`log_level_name`, `format_log_as_json`).
3131

3232
### Changed
33-
- Upgraded llama.cpp from **b9555 to b9840** across ten incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. All four local patches (`0001``0004`) apply across the range.
33+
- Upgraded llama.cpp from **b9555 to b9842** across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in `common/preset.cpp`; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (`0001``0004`) apply unchanged across the range.
3434
- Replaced the `--skip-download` flag with `--offline` (llama.cpp b9803).
3535
- `Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (`SessionState` extracted as a testable concurrency contract).
3636
- `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003`), instead of validating and discarding the value.

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9840**
9+
Current llama.cpp pinned version: **b9842**
1010

1111
## Upgrading CUDA Version
1212

@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
286286
ships no UI):
287287
```bash
288288
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
289-
git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
289+
git clone --depth 1 --branch b9842 https://github.com/ggml-org/llama.cpp /tmp/lc
290290
( cd /tmp/lc/tools/ui && npm ci && npm run build \
291291
&& ( cd dist && find . -type f -not -path './_gzip/*' \
292292
| while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
320320
- `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
321321
as the repo secret **`DEPOT_TOKEN`**.
322322

323-
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
323+
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9842`), the
324324
~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
325325
*changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
326326
per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
10211021

10221022
#### Upstream source location (in CMake build tree)
10231023

1024-
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.
1024+
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9842`.
10251025

10261026
**GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
10271027
by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the

CMakeLists.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
143143
FetchContent_Declare(
144144
llama.cpp
145145
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
146-
GIT_TAG b9840
146+
GIT_TAG b9842
147147
PATCH_COMMAND ${CMAKE_COMMAND}
148148
-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
149149
-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
166166
COMMAND ${CMAKE_COMMAND}
167167
-DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
168168
-DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
169-
-DLLAMA_TAG=b9840
169+
-DLLAMA_TAG=b9842
170170
-P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
171171
RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
172172
)

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
**Build:**
88
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
99
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
10-
[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)
10+
[![llama.cpp b9842](https://img.shields.io/badge/llama.cpp-%23b9842-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9842)
1111
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
1212
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
1313
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)

docs/history/llama-cpp-breaking-changes.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -407,3 +407,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
407407
| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline` `set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
408408
| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` &#x2192; zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
409409
| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
410+
| b9840–b9842 | `common/preset.cpp` + `ggml/src/ggml-vulkan/ggml-vulkan.cpp` | Internal-only, no API surface. (1) `common/preset.cpp` adds a `canonical_tag()` helper and **canonicalizes the tag suffix** of INI preset section names (everything after the last `:` is upper-cased / normalized via a `<regex>`), so `[model:q4_k_m]` and `[model:Q4_K_M]` resolve to one preset. Preset loading is a CLI/`common` feature; the project's C++ never calls `common_preset*` — verified `grep -rn "common_preset\|canonical_tag\|load_from_ini" src/main/cpp src/test/cpp` &#x2192; zero matches. (2) `ggml-vulkan.cpp` reworks the per-graph **command-buffer submission heuristic** from "weight-matrix bytes per submit" to "estimated FLOPs per submit" (new `ggml_vk_get_node_flops()` over MUL_MAT/CONV/FLASH_ATTN nodes; `last_total_mul_mat_bytes` → `last_total_flops`; submit every ~200 GFLOP, still bounded by `max_nodes_per_submit`). Pure Vulkan-backend perf tuning, behaviour-neutral to callers; only affects the `vulkan-windows-x86-64` classifier. No project source changes required. |
411+
| b9840–b9842 | upstream verification (sandbox) | All four patches (`0001``0004`) re-verified to **apply cleanly** against b9842 via `git apply --check` over the actual b9842 sources (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run). The b9840→b9842 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `common/preset.cpp` and `ggml-vulkan.cpp` — so all patch hunks/offsets are byte-identical to b9840. Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |

0 commit comments

Comments
 (0)