Upgrade llama.cpp from b9264 to b9279

claude · claude · commit cf2d2b346a58 · 2026-05-22T06:09:52.000Z
All upstream changes in this range are additive or internal to llama.cpp.
The two files this project compiles from upstream (server-context.cpp,
server-models.cpp) receive only additive changes: new slot-info JSON
fields, a destructor reorder, and a no-op LLAMA_APP_CMD env-var hook.
No project source changes required.

Verified: cmake build clean, all 417 C++ tests pass.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9264**
+Current llama.cpp pinned version: **b9279**
 
 ## Upgrading CUDA Version
 
@@ -345,6 +345,18 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9245–b9264 | `conversion/hunyuan.py`, `gguf-py/gguf/constants.py`, `gguf-py/gguf/tensor_mapping.py` | HunyuanOCR / HunyuanVL unified in conversion: `VisionProjectorType.HUNYUANOCR` removed; `HunYuanVLForConditionalGeneration` registers a single `HunyuanVLVisionModel` + `HunyuanVLTextModel`; `vit.perceive.*` tensor mappings now only mention `HunyuanVL`. Python tooling, not compiled by project |
 | ~b9245–b9264 | `CMakeLists.txt` (upstream) | New `LLAMA_BUILD_APP` option (default OFF); deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_BUILD_UI`/`LLAMA_USE_PREBUILT_UI` preserved. Project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` still works unchanged |
 | ~b9245–b9264 | `.devops/*.Dockerfile`, `.github/workflows/build-and-test-snapdragon.yml`, `scripts/snapdragon/`, `docs/backend/snapdragon/`, `tools/cli/README.md`, `tools/server/README.md`, `tools/mtmd/tests/` | Docker images add `conversion/` dir; snapdragon toolchain bumped v0.3 → v0.6 with `+dotprod+i8mm`; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project |
+| ~b9264–b9279 | `tools/server/server-context.cpp` | Slot-info JSON adds three additive fields (`n_prompt_tokens`, `n_prompt_tokens_processed`, `n_prompt_tokens_cache`) on each in-flight task; `server_context_impl::destroy()` now resets `spec` / `ctx_dft` / `model_dft` BEFORE `llama_init.reset()` to avoid use-after-free when a draft model holds back-references into the target context. Compiled directly into jllama from upstream — no project source changes required |
+| ~b9264–b9279 | `tools/server/server-models.cpp` | Adds `#include <cstdlib>` and a `LLAMA_APP_CMD` env-var lookup in `server_model_meta::update_args()` to re-inject the unified-binary subcommand into router-spawned child argv. Env var is only set by the new `llama-app` binary (which this project does not build), so the lookup harmlessly returns null and the code path is a no-op. Compiled upstream-as-is, no project changes |
+| ~b9264–b9279 | `src/llama-vocab.cpp` | New `hybriddna` BPE tokenizer model (DNA k-mer tokenization with `<dna>…</dna>` tag handling, k=6, OOV fallback) registered as a BPE variant; reached only when GGUF metadata declares `tokenizer.model = "hybriddna"`. Adds a virtual destructor + virtual `tokenize()` to `llm_tokenizer_bpe_session` and a `llm_tokenizer_hybriddna_session` subclass; existing BPE callers unchanged. Additive, no project changes |
+| ~b9264–b9279 | `src/llama-graph.cpp` | `llm_graph_input_attn_kv_iswa::set_input()` / `can_reuse()` now guard the base and SWA tensor accesses behind `if (self_k_idxs && self_k_idxs->buffer)` / `if (self_k_idxs_swa && self_k_idxs_swa->buffer)`. Fixes crashes on models with only-SWA or only-non-SWA attention layers. Internal, no project impact |
+| ~b9264–b9279 | `src/models/qwen35.cpp` + `src/models/qwen35moe.cpp` | MTP draft sub-graph now builds an `inp_out_ids` input and applies `ggml_get_rows(cur, inp_out_ids)` just before the head norm, so only the requested output rows are projected. Bug fix for MTP draft path; internal, no project changes |
+| ~b9264–b9279 | `ggml/src/ggml-backend.cpp` | `ggml_backend_tensor_get_2d()` fast-path condition fixed: now checks `iface.get_tensor_2d == NULL` (was incorrectly checking `set_tensor_2d`), so multi-copy gets correctly fall back to the per-copy loop when the backend lacks `get_tensor_2d`. Bug fix, no project changes |
+| ~b9264–b9279 | `ggml/src/ggml-vulkan/` (`ggml-vulkan.cpp`, new `vulkan-shaders/snake.comp`, `vulkan-shaders-gen.cpp`) | New Vulkan Snake activation fusion: detects the 5-op chain `MUL → SIN → SQR → MUL → ADD` (matching CUDA b9094 introduction) and dispatches a single fused `snake_{f32,f16,bf16}` kernel `y = x + sin(a*x)^2 * inv_b`. New `ggml_vk_can_fuse_snake()` validates contiguity, 2D shape, and broadcast operands `[1, C, 1, 1]`. Internal Vulkan backend, no project changes |
+| ~b9264–b9279 | `ggml/src/ggml-metal/ggml-metal-ops.cpp` + `ggml-metal.metal` | `kernel_concat` / `kernel_set` now batch multiple small rows into one threadgroup (`nrptg = min(256/ne0, ne1)`, capped at 256 threads/group) to improve small-row throughput; `kernel_concat` gains an early-return bounds check. Internal Metal backend, no project changes |
+| ~b9264–b9279 | `ggml/src/ggml-hexagon/` (`ggml-hexagon.cpp`, `htp/ssm-conv.c`, `htp/rope-ops.c`) | SSM_CONV HVX kernel rewritten with VTCM-staged 32×32 fp32 in-register transpose and per-thread tiling (1 MiB VTCM budget); strictly-contiguous gate replaced with byte-stride checks (`nb[0]==sizeof(float)` and `nb[1]==ne[0]*sizeof(float)`); `rope_cache_init` / `mrope_cache_init` marked `__attribute__((noinline))` to reduce code-bloat on Hexagon. Internal Qualcomm DSP backend, no project changes |
+| ~b9264–b9279 | `examples/save-load-state/` removed, `tests/test-save-load-state.cpp` added; `tools/{batched-bench,fit-params,quantize,perplexity}/CMakeLists.txt` | The `llama-save-load-state` example binary was removed and re-homed as a CTest target; the four remaining standalone tools were each split into a `*-impl` static library + a thin `main.cpp` wrapper (mirroring the b9245 split of cli/completion/llama-bench/server), with the entry-point renamed to `llama_batched_bench` / `llama_fit_params` / `llama_quantize` / `llama_perplexity` to satisfy `-Wmissing-declarations`. Project does not compile any of these `.cpp` files (only `server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-models.cpp` — see `CMakeLists.txt`), so no impact |
+| ~b9264–b9279 | `app/` (`CMakeLists.txt`, `llama.cpp`) | `llama-app` unified binary gains four new subcommands (`batched-bench`, `fit-params`, `quantize`, `perplexity`) and sets `LLAMA_APP_CMD` in the env before dispatching so that the router can re-inject the subcommand into spawned child argv. Guarded by `LLAMA_BUILD_APP=OFF` default — project doesn't enable it, no impact |
+| ~b9264–b9279 | `conversion/base.py` + `conversion/llama.py` | New `_set_vocab_hybriddna()` Python helper that emits a `gpt2`-style BPE vocab tagged as `tokenizer.model = "hybriddna"`; `LlamaModel.set_vocab()` dispatches to it when `tokenizer_config.json` declares `"tokenizer_class": "HybridDNATokenizer"`; `add_prefix_space` handling moved earlier in the same method. Conversion tooling only, not compiled by project |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9264
+	GIT_TAG        b9279
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 11+](https://img.shields.io/badge/Java-11%2B-informational)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit4-yellow)  
-[![llama.cpp b9264](https://img.shields.io/badge/llama.cpp-%23b9264-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9264)  
+[![llama.cpp b9279](https://img.shields.io/badge/llama.cpp-%23b9279-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9279)  
 [![Publish](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/publish.yml)  
 [![CodeQL](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml/badge.svg)](https://github.com/bernardladenthin/java-llama.cpp/actions/workflows/codeql.yml)  
 

Original file line number	Diff line number	Diff line change
`@@ -108,7 +108,7 @@ set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)`
`108`	`108`	`FetchContent_Declare(`
`109`	`109`	`llama.cpp`
`110`	`110`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`111`		`- GIT_TAG b9264`
	`111`	`+ GIT_TAG b9279`
`112`	`112`	`)`
`113`	`113`	`FetchContent_MakeAvailable(llama.cpp)`
`114`	`114`