Upgrade llama.cpp from b9106 to b9134

claude · claude · commit 3b019855b817 · 2026-05-13T22:02:46.000Z
Breaking changes handled: - Remove ModelParameters.setCtxSizeDraft(): --spec-draft-ctx-size CLI flag was removed in b9134 and now throws std::invalid_argument at parse time - Fix SlotParamsToJson test: task_params::to_json() renamed field "speculative.type" → "speculative.types" (now serialises a vector) Version bumps: CMakeLists.txt GIT_TAG, README.md badge/link, CLAUDE.md pinned version + new breaking-change table rows for b9106→b9134. https://claude.ai/code/session_01GYjU7CXHB6QLbxrbZcQWCn
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9106**
+Current llama.cpp pinned version: **b9134**
 
 ## Upgrading CUDA Version
 
@@ -253,6 +253,12 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9103–b9106 | `ggml/src/ggml-vulkan/ggml-vulkan.cpp` + Vulkan shaders | Vulkan flash attention refactored: `pipeline_flash_attn_f32_f16` changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (`flash_attn_f32_f16` and `flash_attn_f32_f16_int8`) that select K/V type at runtime via `FaTypeK`/`FaTypeV` spec constants; new `flash_attn_dequant.glsl` contains aliased SSBO views and an uber `dequantize4()` switch; the K/V type mismatch guard removed from `ggml_backend_vk_device_supports_op`; internal Vulkan backend refactor, no project changes required |
 | ~b9103–b9106 | `ggml/src/ggml-cuda/argsort.cu` | Added `#include <cuda/iterator>` for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required |
 | ~b9103–b9106 | `convert_hf_to_gguf.py` | Mistral Medium 3.5 mmproj support: `n_embd_text` now reads `"dim"` key instead of `"hidden_dim"`; negative `img_break_tok_id` placeholders resolved from `tekken.json` or `tokenizer.json`; conversion tool only, no project changes required |
+| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-ctx-size` / `-cd` / `--ctx-size-draft` REMOVED — throws `std::invalid_argument` at parse time; `ModelParameters.setCtxSizeDraft()` removed; no replacement (context size now managed internally by speculative engine) |
+| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-replace` / `--spec-replace` REMOVED — throws `std::invalid_argument` at parse time; no corresponding Java method existed |
+| ~b9106–b9134 | `common/speculative.h` | Full redesign: `common_speculative_type` enum values renamed `DRAFT`&#x2192;`DRAFT_SIMPLE`, `EAGLE3`&#x2192;`DRAFT_EAGLE3`; `common_params_speculative.type` (single enum) &#x2192; `.types` (vector); `common_speculative_n_max()` / `common_speculative_n_min()` REMOVED; new `common_speculative_init(params, n_seq)` no longer takes ctx; new `common_speculative_begin(spec, seq_id, prompt)`, `common_speculative_draft(spec)`, `common_speculative_accept(spec, seq_id, n)`, `common_speculative_process(spec, batch)` signatures; `common_speculative_draft_params` struct added; server sources compiled directly, no project JNI changes required |
+| ~b9106–b9134 | `common/common.h` | New `common_prompt_checkpoint` struct (contains `data_tgt` + `data_dft`) replaces the old `server_prompt_checkpoint` in `server-task.h`; compiled from upstream server sources, no project JNI changes required |
+| ~b9106–b9134 | `tools/server/server-task.cpp` | `task_params::to_json()` renamed field `"speculative.type"` &#x2192; `"speculative.types"` (now serialises the vector); test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly |
+| ~b9106–b9134 | `include/llama.h` | New `LLAMA_STATE_SEQ_FLAGS_NONE = 0` macro added; additive, no project changes required |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -97,7 +97,7 @@ set(GGML_AVX512  OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9106
+	GIT_TAG        b9134
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
-[![llama.cpp b9106](https://img.shields.io/badge/llama.cpp-%23b9106-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9106)
+[![llama.cpp b9134](https://img.shields.io/badge/llama.cpp-%23b9134-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9134)
 [![Maven Central](https://img.shields.io/maven-central/v/net.ladenthin/llama)](https://central.sonatype.com/artifact/net.ladenthin/llama)
 [![Snapshot](https://img.shields.io/badge/snapshot-latest-informational)](https://central.sonatype.com/repository/maven-snapshots/net/ladenthin/llama/)
 
diff --git a/src/main/java/net/ladenthin/llama/ModelParameters.java b/src/main/java/net/ladenthin/llama/ModelParameters.java
@@ -1263,17 +1263,6 @@ public ModelParameters setDraftPMin(float draftPMin) {
         return this;
     }
 
-    /**
-     * Set the size of the prompt context for the draft model.
-     *
-     * @param ctxSizeDraft the prompt context size for the draft model
-     * @return this builder
-     */
-    public ModelParameters setCtxSizeDraft(int ctxSizeDraft) {
-        parameters.put("--spec-draft-ctx-size", String.valueOf(ctxSizeDraft));
-        return this;
-    }
-
     /**
      * Set the comma-separated list of devices to use for offloading the draft model.
      *
diff --git a/src/test/cpp/test_server.cpp b/src/test/cpp/test_server.cpp
@@ -243,9 +243,8 @@ TEST(SlotParamsToJson, SpeculativeFields_Present) {
     task_params p;
     const json j = p.to_json();
 
-    // b8962: only speculative.type is serialised; n_max/n_min/p_min are
-    // input-only (consumed by params_from_json_cmpl, not emitted by to_json)
-    EXPECT_TRUE(j.contains("speculative.type"));
+    // b9134: field renamed speculative.type → speculative.types (now a vector)
+    EXPECT_TRUE(j.contains("speculative.types"));
 }
 
 TEST(SlotParamsToJson, GrammarTriggers_IsArrayByDefault) {

Original file line number	Diff line number	Diff line change
`@@ -97,7 +97,7 @@ set(GGML_AVX512 OFF CACHE BOOL "" FORCE)`
`97`	`97`	`FetchContent_Declare(`
`98`	`98`	`llama.cpp`
`99`	`99`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`100`		`- GIT_TAG b9106`
	`100`	`+ GIT_TAG b9134`
`101`	`101`	`)`
`102`	`102`	`FetchContent_MakeAvailable(llama.cpp)`
`103`	`103`