Upgrade llama.cpp from b9621 to b9637

claude · claude · commit acf7052311d3 · 2026-06-14T20:30:12.000Z
No breaking API changes: none of the project's include surface (common.h, chat.h, speculative.h, mtmd.h, llama-cpp.h, arg.h, llama.h, download.h) is touched. The upgrade is purely additive. New capabilities gained automatically (no project code needed): - Cohere2 MoE ("North Code") model arch (MoE + MTP/NextN) with a dedicated chat parser, auto-detected via the existing specialized-template path. - Jinja chat-template engine fixes (count/d/e filter aliases, negative-step slicing, empty-separator split guard, empty-old_str replace). Vulkan unary-shader consolidation + EXPM1, WebUI gzip serving, CLI/Docker/CI/ Python-converter changes are all in TUs the project does not compile or ship. Verified: CMake configures cleanly against b9637 (ggml 0.15.1, CPU backend). docs/history/llama-cpp-breaking-changes.md gains the b9621-b9637 rows. https://claude.ai/code/session_01EQJCrQGmxCBf8WTCDuFE3X
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9621**
+Current llama.cpp pinned version: **b9637**
 
 ## Upgrading CUDA Version
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -139,7 +139,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9621
+	GIT_TAG        b9637
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 **Build:**  
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)  
 ![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)  
-[![llama.cpp b9621](https://img.shields.io/badge/llama.cpp-%23b9621-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9621)  
+[![llama.cpp b9637](https://img.shields.io/badge/llama.cpp-%23b9637-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9637)  
 [![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)  
 ![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)  
 [![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)  
diff --git a/docs/history/llama-cpp-breaking-changes.md b/docs/history/llama-cpp-breaking-changes.md
@@ -346,3 +346,9 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | ~b9555–b9621 | `ggml/src/ggml-vulkan/` + Vulkan shaders | New `VK_VALVE_shader_mixed_float_dot_product` extension support for F16→F32 fused dot products (`dot2_f16`) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required |
 | ~b9555–b9621 | `ggml/src/ggml-opencl/` + OpenCL kernels | New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects `opencl-android-aarch64` classifier build only); no project source changes required |
 | ~b9555–b9621 | `ggml/src/ggml-cuda/ssm-scan.cu` | Added `__syncthreads()` before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required |
+| b9621–b9637 | `common/chat.cpp` | New Cohere2 MoE ("North Code") chat parser `common_chat_params_init_cohere2moe` + auto-detection (template containing `<\|START_TEXT\|>` and `<\|START_ACTION\|>`). Purely additive — compiled in the `chat.cpp` TU and reached through the existing specialized-template path, so the project's `oaicompat_chat_params_parse` picks it up automatically. No project source changes required. **New feature:** Cohere2 MoE reasoning + JSON tool-call chat support |
+| b9621–b9637 | `common/jinja/runtime.cpp`, `common/jinja/value.cpp` | Jinja chat-template engine fixes: filter aliases `count`→`length`, `d`→`default`, `e`→`escape`; negative-step slice start/stop defaults; `split` raises on empty separator; `replace('', x)` now expands between every char. Compiled into `common`; improves chat-template compatibility automatically. No project source changes required |
+| b9621–b9637 | `src/llama-arch.{h,cpp}`, `src/models/cohere2moe.cpp` (new), `src/models/models.h`, `src/llama-model.cpp`, `src/llama-model-saver.cpp`, `src/llama-vocab.cpp` | New `LLM_ARCH_COHERE2MOE` architecture (MoE + MTP/NextN) with `llama_model_cohere2moe`; `cohere2moe` tokenizer pre-type (maps to `LLAMA_VOCAB_PRE_TYPE_TINY_AYA`); Cohere2 dense path gains `ffn_*_s` NVFP4 scale tensors; tied-NVFP4-`output` assert relaxed to allow sidecar LM-head scales. Additive enum/struct internal to libllama; the project includes `llama.h`, not `llama-arch.h`/`models.h`, and switches on no arch enum. No project source changes required. **New feature:** loads North-Mini-Code GGUFs |
+| b9621–b9637 | `ggml/src/ggml-vulkan/` + shaders | Unary shaders consolidated into one templated `unary.comp`; new `EXPM1` Vulkan op; GLU push-constants reworked (per-dim strides + misalign offsets); fastdiv `L` values byte-packed to stay under the 128B push-constant limit. Internal Vulkan backend — the project builds CPU/CUDA/Metal/OpenCL only, never Vulkan. No project changes required |
+| b9621–b9637 | `tools/server/server-http.cpp`, `tools/ui/`, `scripts/ui-assets.cmake` | Optional gzip-compressed WebUI asset serving (`LLAMA_UI_GZIP`, `llama_ui_use_gzip()`). The project compiles `server-context/queue/task/models` but not `server-http.cpp` or `tools/ui`, so the HTTP/WebUI layer is absent from `jllama`. No project changes required |
+| b9621–b9637 | `tools/cli/cli.cpp`, `.devops/*.Dockerfile`, `.github/`, `conversion/`, `convert_hf_to_gguf_update.py`, `gguf-py/`, `models/templates/Cohere2MoE.jinja`, `docs/`, `tests/` | CLI preserved-token wiring, Docker image `docker.io/` prefixes, CI labeler/release tweaks, Python GGUF converters, the new model template asset, doc typos, and upstream tests. None are compiled into `jllama` or shipped by the project. No project changes required |

Original file line number	Diff line number	Diff line change
`@@ -139,7 +139,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)`
`139`	`139`	`FetchContent_Declare(`
`140`	`140`	`llama.cpp`
`141`	`141`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`142`		`- GIT_TAG b9621`
	`142`	`+ GIT_TAG b9637`
`143`	`143`	`)`
`144`	`144`	`FetchContent_MakeAvailable(llama.cpp)`
`145`	`145`