Upgrade llama.cpp from b9739 to b9789

claude · claude · commit 9a9ac4f1632c · 2026-06-25T11:35:34.000Z
Bump the pinned llama.cpp tag and refresh the Windows argv patch for the upgraded source. Every upstream breaking change in this range is absorbed inside upstream-compiled translation units; no project C++ source edits were required. - CMakeLists.txt: GIT_TAG + LLAMA_TAG b9739 -> b9789. - README.md / CLAUDE.md / publish.yml / TODO.md: version badge, pinned- version notes, WebUI clone example, aarch64 GCC rationale. - patches/0001-win32-arg-parse-embed-guard.patch: refreshed for b9789. Upstream replaced the original #24779 argv override with the count-guard form (if utf8.buf.size() == argc), which is exactly the variant that breaks the Windows server-integration tests, so the patch still drops it entirely and keeps "(void) utf8;". Re-verified to apply and reverse-apply cleanly (idempotent) against b9789 common/arg.cpp. - docs/history/llama-cpp-breaking-changes.md: new b9739-b9789 rows (json-partial.{h,cpp} removed -> peg-parser; chat.h message-span restructure; server-task n_before_user -> message_spans; new llama_model_n_layer_nextn; mtmd/clip progress_callback; server-models child-process download refactor). Verified locally on Linux x86_64 (GCC 13.3): cmake configure passes the fail-loud OuteTTS extraction and refreshed-patch anchor checks against b9789, the full Release build links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest reports 454/454 passing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SLQk4Fk7vk7R4f2za1KxYg
diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml
@@ -264,7 +264,7 @@ jobs:
     # Native ARM64 build on GitHub's free arm64 runner, mirroring upstream llama.cpp's
     # `ubuntu-cpu` aarch64 release job (ubuntu-24.04-arm + GCC 14). Replaces the former dockcross
     # `linux-arm64-lts` cross-compile (GCC 8.5, glibc 2.17), which can no longer compile llama.cpp
-    # b9739 — its C++17 CTAD-in-`new` needs GCC >= 12. Building natively also lets us run the C++
+    # b9789 — its C++17 CTAD-in-`new` needs GCC >= 12. Building natively also lets us run the C++
     # unit suite (ctest) on real ARM hardware for the first time (the cross build ran no tests).
     # Trade-off: the glibc floor rises 2.17 -> ~2.39, the same envelope upstream's own ARM binaries
     # require. GGML_NATIVE=OFF keeps the artifact portable across ARMv8 CPU generations (no
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9739**
+Current llama.cpp pinned version: **b9789**
 
 ## Upgrading CUDA Version
 
@@ -241,7 +241,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
 ships no UI):
 ```bash
 # needs node/npm + network; embed.cpp is plain C++17 (no npm)
-git clone --depth 1 --branch b9739 https://github.com/ggml-org/llama.cpp /tmp/lc
+git clone --depth 1 --branch b9789 https://github.com/ggml-org/llama.cpp /tmp/lc
 ( cd /tmp/lc/tools/ui && npm ci && npm run build \
   && ( cd dist && find . -type f -not -path './_gzip/*' \
        | while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -275,7 +275,7 @@ plus a cache token are present, `build.sh` adds
 - `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
   as the repo secret **`DEPOT_TOKEN`**.
 
-Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9739`), the
+Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9789`), the
 ~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
 *changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
 per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -382,7 +382,7 @@ Current patches:
 
 | Patch | Fixes |
 |-------|-------|
-| `0001-win32-arg-parse-embed-guard.patch` | Windows JNI regression from llama.cpp **#24779** (b9739): `common_params_parse` unconditionally replaced the caller's argv with the process command line (`GetCommandLineW`), so an embedded/JNI caller (`java.exe`) lost its `--model …` args → "Failed to parse model parameters". The patch **drops the override for our build** (keeps the `make_utf8_argv()` call referenced so there's no `-Wunused-function`, but never adopts its result), so the caller's already-UTF-8 argv is always used. This is **deterministic** — an earlier count-guard variant (only override when the re-derived arg count equals `argc`) collided on the server-integration tests whose argv length happened to equal `java.exe`'s and kept them failing. The upstream PR can instead expose an opt-out / `common_params_parse_argv` that preserves the standalone tools' UTF-8 fix. |
+| `0001-win32-arg-parse-embed-guard.patch` | Windows JNI regression from llama.cpp **#24779** (introduced b9739): `common_params_parse` replaced the caller's argv with the process command line (`GetCommandLineW`), so an embedded/JNI caller (`java.exe`) lost its `--model …` args → "Failed to parse model parameters". The patch **drops the override for our build** (keeps the `make_utf8_argv()` call referenced so there's no `-Wunused-function`, but never adopts its result), so the caller's already-UTF-8 argv is always used. This is **deterministic** — a count-guard variant (only override when the re-derived arg count equals `argc`) collided on the server-integration tests whose argv length happened to equal `java.exe`'s and kept them failing, and **upstream b9789 now ships exactly that count-guard form** (`if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); }`), so the patch still drops it rather than adopting upstream's guard. The upstream PR can instead expose an opt-out / `common_params_parse_argv` that preserves the standalone tools' UTF-8 fix. |
 
 ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`)
 
@@ -888,7 +888,7 @@ now **"Build and Test Linux aarch64"**) builds **natively on `ubuntu-24.04-arm`*
 llama.cpp's own `ubuntu-cpu` aarch64 release job (`ubuntu-24.04-arm` + **GCC 14**).
 
 **Why it moved off dockcross.** The old `dockcross/linux-arm64-lts` image ships **GCC 8.5 / glibc
-2.17**; llama.cpp **b9739** uses C++17 CTAD-in-`new`, which needs **GCC ≥ 12**, so the cross build
+2.17**; llama.cpp **b9789** uses C++17 CTAD-in-`new`, which needs **GCC ≥ 12**, so the cross build
 stopped compiling. Upstream solved the same problem by building natively on `ubuntu-24.04-arm` with
 GCC 14 and ships a **glibc ≈ 2.39** ARM binary with no old-glibc compatibility layer. This repo now
 does the same: the aarch64 artifact's **glibc floor rises 2.17 → ~2.39** — the same envelope
@@ -956,7 +956,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
 
 #### Upstream source location (in CMake build tree)
 
-llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9739`.
+llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9789`.
 
 **GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
 by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9739
+	GIT_TAG        b9789
 	PATCH_COMMAND  ${CMAKE_COMMAND}
 		-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
 		-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
     COMMAND ${CMAKE_COMMAND}
         -DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
         -DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
-        -DLLAMA_TAG=b9739
+        -DLLAMA_TAG=b9789
         -P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
     RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
 )
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9739](https://img.shields.io/badge/llama.cpp-%23b9739-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9739)  
+[![llama.cpp b9789](https://img.shields.io/badge/llama.cpp-%23b9789-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9789)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/TODO.md b/TODO.md
@@ -164,7 +164,7 @@ primary goal: agentic tool-calling with Qwen):
   What remains is manual validation against the actual editor clients — point Copilot's Ollama provider /
   a Custom Endpoint, Claude Code, and a Responses client at the running server — since a server-side
   round-trip confirms the wire shapes but not each client's own parser.
-- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9739`) includes the Gemma 4
+- **Gemma 4 tool-calling validation.** Confirm the pinned llama.cpp (`b9789`) includes the Gemma 4
   tool-call parser fixes; if not, bump per the upgrade procedure.
 - **NativeServer — wire upstream `server.cpp` routes to JNI (in progress; scaffold landed `dd264b2`).**
   The upstream HTTP transport (`tools/server/server-http.cpp` + the cpp-httplib backend) is already
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -373,3 +373,14 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | b9682–b9739 | `ggml/src/ggml-sycl/` | New conv2d, conv2d_dw, conv2d_transpose, conv3d SYCL ops; Q1_0 quantization support. Internal SYCL backend, no project changes required |
 | b9682–b9739 | `ggml/src/ggml-cuda/` | New `col2im_1d` CUDA op. Internal CUDA backend, no project changes required |
 | b9682–b9739 | `ggml/src/ggml-metal/` | ROPE_BACK Metal support; concat kernel extended to additional types. Internal Metal backend, no project changes required |
+| b9739–b9789 | `common/json-partial.{h,cpp}` (removed) + `common/peg-parser.{h,cpp}` + `common/chat.cpp` | The standalone partial-JSON parser was **deleted** (`json-partial.h`/`.cpp`, &minus;363 lines) and its incremental-JSON handling folded into the PEG parser (`peg-parser.cpp` +194/&minus;81). Partial JSON during streaming tool-call parsing is now produced by `peg-parser` instead of `common_json_parse`. Project never included `json-partial.h` &mdash; verified `grep -rn "json-partial\|common_json_parse" src/main/cpp src/test/cpp` &#x2192; zero matches. All consumers stay inside upstream-compiled `chat.cpp`. No project source changes required |
+| b9739–b9789 | `common/chat.h` + `common/chat.cpp` | Message-span types restructured: new `enum common_chat_role` (+ `common_chat_role_from_string`/`_to_string`); `common_chat_msg_span::role` and `common_chat_msg_delimiter::role` changed `std::string` &#x2192; `common_chat_role`; new container structs `common_chat_msg_spans` / `common_chat_msg_delimiters` (the latter with `tokenize()`/`split()`/`to_json()`); `common_chat_params::message_spans` (vector) &#x2192; `message_delimiters`; free function `common_chat_split_by_role()` **removed**, replaced by `common_chat_msg_delimiters_parse()`. `common_chat_msg_diff` (used by `test_server.cpp`) is **unchanged**. Project references none of the changed span/delimiter symbols &mdash; verified `grep -rn "message_spans\|common_chat_split_by_role\|common_chat_msg_span\|common_chat_msg_delimiter" src/main/cpp src/test/cpp` &#x2192; zero matches. Routing happens inside upstream-compiled `chat.cpp` / `server-*.cpp`. No project source changes required |
+| b9739–b9789 | `tools/server/server-task.h` + `server-context.cpp` + `server-common.{h,cpp}` | Context-checkpointing reworked from a precomputed offset to message spans: `task_params::n_before_user` (int32) **removed**, replaced by `task_params::message_spans` (`common_chat_msg_spans`); new `server_tokens::find_message_spans(const common_chat_msg_delimiters &)` helper. `test_server.cpp` asserts against `task_params::to_json()` but never references `n_before_user` &mdash; verified `grep -rn "n_before_user\|message_spans" src/test/cpp` &#x2192; zero matches, so it compiles and passes unchanged. Consumed inside upstream-compiled `server-context.cpp` linked into `jllama`. No project source changes required |
+| b9739–b9789 | `include/llama.h` | **New API** `llama_model_n_layer_nextn(const llama_model *)` &mdash; returns the number of NextN/MTP layers (additive; the surrounding accessor block was otherwise only column-realigned). Not called by project; could back a future introspection accessor. No project source changes required |
+| b9739–b9789 | `common/common.h` | `common_params::checkpoint_min_step` default raised `256` &#x2192; `8192` (minimum spacing between context checkpoints). Tuning default consumed inside upstream-compiled `server-context.cpp`; not surfaced by `ModelParameters`. No project source changes required |
+| b9739–b9789 | `common/arg.h` + `common/arg.cpp` + `common/download.h` | `common_params_handle_models()` gained a 3rd parameter &mdash; a `common_params_handle_models_params` struct (`{ common_download_callback*, bool preset_only }`) for router-mode preset-only downloads; `arg.h` now `#include`s `download.h`; new `common_download_opts::preset_only`. Project does not call `common_params_handle_models()` directly (arg parsing happens upstream) &mdash; `grep -rn common_params_handle_models src/` &#x2192; zero matches. No project source changes required |
+| b9739–b9789 | `common/arg.cpp` (**patch target**) | Upstream's Windows `common_params_parse` argv handling changed again: the unconditional `argc/argv = make_utf8_argv()` override (the original #24779 regression) became a **count-guard** `if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); }`. That count-guard is exactly the variant this project already found breaks its Windows server-integration tests (argv length coincides with `java.exe`'s), so **`patches/0001-win32-arg-parse-embed-guard.patch` was refreshed** to drop the new form and keep `(void) utf8;` (caller's UTF-8 argv always used). The patch was re-verified to apply cleanly **and** reverse-cleanly (idempotency) against b9789 `common/arg.cpp` |
+| b9739–b9789 | `tools/mtmd/mtmd.h` + `tools/mtmd/clip.h` + `clip.cpp` + `mtmd.cpp` | **New feature** &mdash; multimodal model-load progress: new `mtmd_progress_callback` typedef + `progress_callback` / `progress_callback_user_data` fields on `mtmd_context_params` and `clip_context_params` (additive, appended to the structs; returning `false` aborts the load). Project does not aggregate-init either struct (`grep -rn mtmd_context_params src/` &#x2192; zero matches) so the new fields are harmless; could later feed a Java `LoadProgressCallback` for vision models. No project source changes required |
+| b9739–b9789 | `tools/server/server-models.{h,cpp}` + `server-context.h` | Multi-model router refactor: model downloading moved into a dedicated child-process mode (`enum server_child_mode`, `server_models::load(name, load_options)`, `server_child::run_download()`; old `server_models::download()` removed); `SERVER_STATE_DOWNLOADING` re-enabled in `server_state`. Project links `server-models.cpp` but does not drive the router (`grep -rn "server_models\|SERVER_CHILD_MODE" src/` &#x2192; zero matches). Compiles into `jllama` unchanged. No project source changes required |
+| b9739–b9789 | `ggml/src/ggml-{hexagon,vulkan,sycl,opencl,webgpu,cuda}/` + shaders | Backend-internal work only: Hexagon HTP matmul kernels re-tiled (`hmx-matmul-ops.c` &#x2192; `hmx-mm-kernels-tiled.h`); Vulkan gains a `conv3d_mm` shader + `get_rows_back` and folds the elementwise unary shaders (`clamp`/`cos`/`sin`/`sqrt`/`square`/`leaky_relu`.comp removed) into `unary.comp`; SYCL element-wise / conv3d additions; OpenCL Adreno norm/gemv tweaks; WebGPU `mul_mat_vec` refactor. No API surface visible to `jllama.cpp`; the OpenCL set only affects the `opencl-android-aarch64` classifier. No project source changes required |
+| b9739–b9789 | upstream build / verification | Local build with `GIT_TAG b9789` verified clean on Linux x86_64 (GCC 13.3; sources were pre-staged from release tarballs + the patch hand-applied because this sandbox blocks `github.com` git-clone, so `FetchContent`'s git path and `PATCH_COMMAND` could not run &mdash; the published CI pipeline uses the normal git FetchContent path). `cmake -B build -DBUILD_TESTING=ON` configures cleanly (the OuteTTS build-time extraction **and** the refreshed Windows patch both pass their fail-loud anchor checks against b9789), `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit, and `ctest --test-dir build --output-on-failure` reports **454/454 tests passing**. Every upstream breaking change in this range is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |
diff --git a/patches/0001-win32-arg-parse-embed-guard.patch b/patches/0001-win32-arg-parse-embed-guard.patch
@@ -1,12 +1,12 @@
 diff --git a/common/arg.cpp b/common/arg.cpp
 --- a/common/arg.cpp
 +++ b/common/arg.cpp
-@@ -924,10 +924,17 @@ bool common_params_parse(int argc, char ** argv, common_params & params, llama_e
+@@ -933,10 +933,18 @@ bool common_params_parse(int argc, char ** argv, common_params & params, llama_e
  bool common_params_parse(int argc, char ** argv, common_params & params, llama_example ex, void(*print_usage)(int, char **)) {
  #ifdef _WIN32
      auto utf8 = make_utf8_argv();
--    if (!utf8.ptrs.empty()) {
--        argc = static_cast<int>(utf8.buf.size());
+-    // repair argv only when it matches the process command line
+-    if (static_cast<int>(utf8.buf.size()) == argc) {
 -        argv = utf8.ptrs.data();
 -    }
 +    // java-llama.cpp patch (PR #248): upstream (llama.cpp #24779) replaced the caller's argv with
@@ -15,10 +15,11 @@ diff --git a/common/arg.cpp b/common/arg.cpp
 +    // UTF-8 argv (GetStringUTFChars), and adopting GetCommandLineW discarded it -> common_params_parse
 +    // parsed java.exe's command line and failed with "Failed to parse model parameters". We keep the
 +    // make_utf8_argv() call (so it stays referenced -> -Wunused-function-clean) but do NOT adopt its
-+    // result, so the caller's already-UTF-8 argv is always used. This is deterministic: a count-guard
-+    // (only override when the re-derived arg count equals argc) collided on the server-integration
-+    // tests whose argv length happened to equal java.exe's, so they kept failing. The upstream PR
-+    // can instead expose an opt-out / a common_params_parse_argv that preserves the standalone fix.
++    // result, so the caller's already-UTF-8 argv is always used. This is deterministic: upstream's
++    // count-guard form here (only override when the re-derived arg count equals argc) collided on the
++    // server-integration tests whose argv length happened to equal java.exe's, so they kept failing.
++    // The upstream PR can instead expose an opt-out / a common_params_parse_argv that preserves the
++    // standalone tools' UTF-8 fix.
 +    (void) utf8;
  #endif