Merge pull request #134 from bernardladenthin/claude/gallant-bell-JKvSV

bernardladenthin · web-flow · commit 1c49078d1b59 · 2026-05-14T10:56:58.000+02:00
Upgrade llama.cpp from b9106 to b9134
diff --git a/.gitignore b/.gitignore
@@ -33,7 +33,7 @@ hs_err_pid*
 replay_pid*
 
 models/*.gguf
-src/main/cpp/de_kherud_llama_*.h
+src/main/cpp/net_ladenthin_llama_*.h
 src/main/resources_cuda_linux/
 src/main/resources/**/*.so
 src/main/resources/**/*.dylib
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9106**
+Current llama.cpp pinned version: **b9145**
 
 ## Upgrading CUDA Version
 
@@ -253,6 +253,18 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9103–b9106 | `ggml/src/ggml-vulkan/ggml-vulkan.cpp` + Vulkan shaders | Vulkan flash attention refactored: `pipeline_flash_attn_f32_f16` changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (`flash_attn_f32_f16` and `flash_attn_f32_f16_int8`) that select K/V type at runtime via `FaTypeK`/`FaTypeV` spec constants; new `flash_attn_dequant.glsl` contains aliased SSBO views and an uber `dequantize4()` switch; the K/V type mismatch guard removed from `ggml_backend_vk_device_supports_op`; internal Vulkan backend refactor, no project changes required |
 | ~b9103–b9106 | `ggml/src/ggml-cuda/argsort.cu` | Added `#include <cuda/iterator>` for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required |
 | ~b9103–b9106 | `convert_hf_to_gguf.py` | Mistral Medium 3.5 mmproj support: `n_embd_text` now reads `"dim"` key instead of `"hidden_dim"`; negative `img_break_tok_id` placeholders resolved from `tekken.json` or `tokenizer.json`; conversion tool only, no project changes required |
+| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-ctx-size` / `-cd` / `--ctx-size-draft` REMOVED — throws `std::invalid_argument` at parse time; `ModelParameters.setCtxSizeDraft()` removed; no replacement (context size now managed internally by speculative engine) |
+| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-replace` / `--spec-replace` REMOVED — throws `std::invalid_argument` at parse time; no corresponding Java method existed |
+| ~b9106–b9134 | `common/speculative.h` | Full redesign: `common_speculative_type` enum values renamed `DRAFT`&#x2192;`DRAFT_SIMPLE`, `EAGLE3`&#x2192;`DRAFT_EAGLE3`; `common_params_speculative.type` (single enum) &#x2192; `.types` (vector); `common_speculative_n_max()` / `common_speculative_n_min()` REMOVED; new `common_speculative_init(params, n_seq)` no longer takes ctx; new `common_speculative_begin(spec, seq_id, prompt)`, `common_speculative_draft(spec)`, `common_speculative_accept(spec, seq_id, n)`, `common_speculative_process(spec, batch)` signatures; `common_speculative_draft_params` struct added; server sources compiled directly, no project JNI changes required |
+| ~b9106–b9134 | `common/common.h` | New `common_prompt_checkpoint` struct (contains `data_tgt` + `data_dft`) replaces the old `server_prompt_checkpoint` in `server-task.h`; compiled from upstream server sources, no project JNI changes required |
+| ~b9106–b9134 | `tools/server/server-task.cpp` | `task_params::to_json()` renamed field `"speculative.type"` &#x2192; `"speculative.types"` (now serialises the vector); test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly |
+| ~b9106–b9134 | `include/llama.h` | New `LLAMA_STATE_SEQ_FLAGS_NONE = 0` macro added; additive, no project changes required |
+| ~b9134–b9145 | `tools/server/server-common.cpp` | New `continue_final_message` boolean request field in `oaicompat_chat_params_parse`; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when `true`, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with `add_generation_prompt=true` (throws 400); compiled from upstream server sources; `InferenceParameters.setContinueFinalMessage(boolean)` added |
+| ~b9134–b9145 | `ggml/src/ggml-sycl/` | Level Zero API integration for SYCL device memory allocation (`GGML_SYCL_SUPPORT_LEVEL_ZERO` build option, `GGML_SYCL_ENABLE_LEVEL_ZERO` runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required |
+| ~b9134–b9145 | `ggml/src/ggml-opencl/` | Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required |
+| ~b9134–b9145 | `ggml/src/ggml-cuda/allreduce.cu` | AllReduce accumulation now routed through `float` intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required |
+| ~b9134–b9145 | `ggml/src/ggml-hexagon/` | `GGML_UNARY_OP_TANH` added to Hexagon HTP backend; internal DSP backend, no project changes required |
+| ~b9134–b9145 | `ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp` | `use_subgroup_matrix` condition now also checks `sg_mat_k > 0 && sg_mat_n > 0` and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -107,7 +107,7 @@ set(GGML_AVX512  OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9106
+	GIT_TAG        b9145
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
-[![llama.cpp b9106](https://img.shields.io/badge/llama.cpp-%23b9106-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9106)
+[![llama.cpp b9145](https://img.shields.io/badge/llama.cpp-%23b9145-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9145)
 [![Maven Central](https://img.shields.io/maven-central/v/net.ladenthin/llama)](https://central.sonatype.com/artifact/net.ladenthin/llama)
 [![Snapshot](https://img.shields.io/badge/snapshot-latest-informational)](https://central.sonatype.com/repository/maven-snapshots/net/ladenthin/llama/)
 
diff --git a/src/main/java/net/ladenthin/llama/InferenceParameters.java b/src/main/java/net/ladenthin/llama/InferenceParameters.java
@@ -56,6 +56,7 @@ public final class InferenceParameters extends JsonParameters {
 	private static final String PARAM_TOP_N_SIGMA = "top_n_sigma";
 	private static final String PARAM_REASONING_FORMAT = "reasoning_format";
 	private static final String PARAM_REASONING_BUDGET_TOKENS = "reasoning_budget_tokens";
+	private static final String PARAM_CONTINUE_FINAL_MESSAGE = "continue_final_message";
 
 	/**
 	 * Creates inference parameters with the given prompt.
@@ -593,6 +594,20 @@ public InferenceParameters setReasoningBudgetTokens(int budgetTokens) {
 		return this;
 	}
 
+	/**
+	 * Continue the final assistant message rather than starting a new one (vLLM/transformers compatible alias).
+	 * When {@code true}, {@code add_generation_prompt} is implicitly set to {@code false} and the last
+	 * assistant message in the conversation is extended without appending an end-of-turn token.
+	 * Mutually exclusive with {@code add_generation_prompt=true}.
+	 *
+	 * @param continueFinalMessage {@code true} to continue the last assistant message
+	 * @return this builder
+	 */
+	public InferenceParameters setContinueFinalMessage(boolean continueFinalMessage) {
+		parameters.put(PARAM_CONTINUE_FINAL_MESSAGE, String.valueOf(continueFinalMessage));
+		return this;
+	}
+
 	InferenceParameters setStream(boolean stream) {
 		parameters.put(PARAM_STREAM, String.valueOf(stream));
 		return this;
diff --git a/src/main/java/net/ladenthin/llama/ModelParameters.java b/src/main/java/net/ladenthin/llama/ModelParameters.java
@@ -1263,17 +1263,6 @@ public ModelParameters setDraftPMin(float draftPMin) {
         return this;
     }
 
-    /**
-     * Set the size of the prompt context for the draft model.
-     *
-     * @param ctxSizeDraft the prompt context size for the draft model
-     * @return this builder
-     */
-    public ModelParameters setCtxSizeDraft(int ctxSizeDraft) {
-        parameters.put("--spec-draft-ctx-size", String.valueOf(ctxSizeDraft));
-        return this;
-    }
-
     /**
      * Set the comma-separated list of devices to use for offloading the draft model.
      *
diff --git a/src/test/cpp/test_server.cpp b/src/test/cpp/test_server.cpp
@@ -243,9 +243,8 @@ TEST(SlotParamsToJson, SpeculativeFields_Present) {
     task_params p;
     const json j = p.to_json();
 
-    // b8962: only speculative.type is serialised; n_max/n_min/p_min are
-    // input-only (consumed by params_from_json_cmpl, not emitted by to_json)
-    EXPECT_TRUE(j.contains("speculative.type"));
+    // b9134: field renamed speculative.type → speculative.types (now a vector)
+    EXPECT_TRUE(j.contains("speculative.types"));
 }
 
 TEST(SlotParamsToJson, GrammarTriggers_IsArrayByDefault) {
diff --git a/src/test/java/net/ladenthin/llama/InferenceParametersTest.java b/src/test/java/net/ladenthin/llama/InferenceParametersTest.java
@@ -292,6 +292,18 @@ public void testSetReasoningBudgetTokensDisabled() {
 		assertEquals("-1", params.parameters.get("reasoning_budget_tokens"));
 	}
 
+	@Test
+	public void testSetContinueFinalMessageTrue() {
+		InferenceParameters params = new InferenceParameters("").setContinueFinalMessage(true);
+		assertEquals("true", params.parameters.get("continue_final_message"));
+	}
+
+	@Test
+	public void testSetContinueFinalMessageFalse() {
+		InferenceParameters params = new InferenceParameters("").setContinueFinalMessage(false);
+		assertEquals("false", params.parameters.get("continue_final_message"));
+	}
+
 	// -------------------------------------------------------------------------
 	// MiroStat
 	// -------------------------------------------------------------------------
diff --git a/src/test/java/net/ladenthin/llama/LlamaModelTest.java b/src/test/java/net/ladenthin/llama/LlamaModelTest.java
@@ -926,7 +926,6 @@ public void testSpeculativeDecoding() {
 						.setModel(TestConstants.MODEL_PATH)
 						.setModelDraft(TestConstants.DRAFT_MODEL_PATH)
 						.setCtxSize(128)
-						.setCtxSizeDraft(128)
 						.setDraftMax(8)
 						.setDraftMin(1)
 						.setGpuLayers(gpuLayers)
diff --git a/src/test/java/net/ladenthin/llama/ModelParametersExtendedTest.java b/src/test/java/net/ladenthin/llama/ModelParametersExtendedTest.java
@@ -897,12 +897,6 @@ public void testSetModelDraft() {
         assertEquals("/path/to/draft.gguf", p.parameters.get("--spec-draft-model"));
     }
 
-    @Test
-    public void testSetCtxSizeDraft() {
-        ModelParameters p = new ModelParameters().setCtxSizeDraft(512);
-        assertEquals("512", p.parameters.get("--spec-draft-ctx-size"));
-    }
-
     @Test
     public void testSetDeviceDraft() {
         ModelParameters p = new ModelParameters().setDeviceDraft("cuda0");

Original file line number	Diff line number	Diff line change
`@@ -107,7 +107,7 @@ set(GGML_AVX512 OFF CACHE BOOL "" FORCE)`
`107`	`107`	`FetchContent_Declare(`
`108`	`108`	`llama.cpp`
`109`	`109`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`110`		`- GIT_TAG b9106`
	`110`	`+ GIT_TAG b9145`
`111`	`111`	`)`
`112`	`112`	`FetchContent_MakeAvailable(llama.cpp)`
`113`	`113`