Skip to content

Commit 1c49078

Browse files
Merge pull request #134 from bernardladenthin/claude/gallant-bell-JKvSV
Upgrade llama.cpp from b9106 to b9134
2 parents e6941b8 + 79aecdf commit 1c49078

10 files changed

Lines changed: 45 additions & 25 deletions

File tree

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ hs_err_pid*
3333
replay_pid*
3434

3535
models/*.gguf
36-
src/main/cpp/de_kherud_llama_*.h
36+
src/main/cpp/net_ladenthin_llama_*.h
3737
src/main/resources_cuda_linux/
3838
src/main/resources/**/*.so
3939
src/main/resources/**/*.dylib

CLAUDE.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9106**
9+
Current llama.cpp pinned version: **b9145**
1010

1111
## Upgrading CUDA Version
1212

@@ -253,6 +253,18 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
253253
| ~b9103–b9106 | `ggml/src/ggml-vulkan/ggml-vulkan.cpp` + Vulkan shaders | Vulkan flash attention refactored: `pipeline_flash_attn_f32_f16` changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (`flash_attn_f32_f16` and `flash_attn_f32_f16_int8`) that select K/V type at runtime via `FaTypeK`/`FaTypeV` spec constants; new `flash_attn_dequant.glsl` contains aliased SSBO views and an uber `dequantize4()` switch; the K/V type mismatch guard removed from `ggml_backend_vk_device_supports_op`; internal Vulkan backend refactor, no project changes required |
254254
| ~b9103–b9106 | `ggml/src/ggml-cuda/argsort.cu` | Added `#include <cuda/iterator>` for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required |
255255
| ~b9103–b9106 | `convert_hf_to_gguf.py` | Mistral Medium 3.5 mmproj support: `n_embd_text` now reads `"dim"` key instead of `"hidden_dim"`; negative `img_break_tok_id` placeholders resolved from `tekken.json` or `tokenizer.json`; conversion tool only, no project changes required |
256+
| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-ctx-size` / `-cd` / `--ctx-size-draft` REMOVED — throws `std::invalid_argument` at parse time; `ModelParameters.setCtxSizeDraft()` removed; no replacement (context size now managed internally by speculative engine) |
257+
| ~b9106–b9134 | `common/arg.cpp` | CLI option `--spec-draft-replace` / `--spec-replace` REMOVED — throws `std::invalid_argument` at parse time; no corresponding Java method existed |
258+
| ~b9106–b9134 | `common/speculative.h` | Full redesign: `common_speculative_type` enum values renamed `DRAFT`&#x2192;`DRAFT_SIMPLE`, `EAGLE3`&#x2192;`DRAFT_EAGLE3`; `common_params_speculative.type` (single enum) &#x2192; `.types` (vector); `common_speculative_n_max()` / `common_speculative_n_min()` REMOVED; new `common_speculative_init(params, n_seq)` no longer takes ctx; new `common_speculative_begin(spec, seq_id, prompt)`, `common_speculative_draft(spec)`, `common_speculative_accept(spec, seq_id, n)`, `common_speculative_process(spec, batch)` signatures; `common_speculative_draft_params` struct added; server sources compiled directly, no project JNI changes required |
259+
| ~b9106–b9134 | `common/common.h` | New `common_prompt_checkpoint` struct (contains `data_tgt` + `data_dft`) replaces the old `server_prompt_checkpoint` in `server-task.h`; compiled from upstream server sources, no project JNI changes required |
260+
| ~b9106–b9134 | `tools/server/server-task.cpp` | `task_params::to_json()` renamed field `"speculative.type"` &#x2192; `"speculative.types"` (now serialises the vector); test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly |
261+
| ~b9106–b9134 | `include/llama.h` | New `LLAMA_STATE_SEQ_FLAGS_NONE = 0` macro added; additive, no project changes required |
262+
| ~b9134–b9145 | `tools/server/server-common.cpp` | New `continue_final_message` boolean request field in `oaicompat_chat_params_parse`; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when `true`, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with `add_generation_prompt=true` (throws 400); compiled from upstream server sources; `InferenceParameters.setContinueFinalMessage(boolean)` added |
263+
| ~b9134–b9145 | `ggml/src/ggml-sycl/` | Level Zero API integration for SYCL device memory allocation (`GGML_SYCL_SUPPORT_LEVEL_ZERO` build option, `GGML_SYCL_ENABLE_LEVEL_ZERO` runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required |
264+
| ~b9134–b9145 | `ggml/src/ggml-opencl/` | Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required |
265+
| ~b9134–b9145 | `ggml/src/ggml-cuda/allreduce.cu` | AllReduce accumulation now routed through `float` intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required |
266+
| ~b9134–b9145 | `ggml/src/ggml-hexagon/` | `GGML_UNARY_OP_TANH` added to Hexagon HTP backend; internal DSP backend, no project changes required |
267+
| ~b9134–b9145 | `ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp` | `use_subgroup_matrix` condition now also checks `sg_mat_k > 0 && sg_mat_n > 0` and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required |
256268

257269
## Build Commands
258270

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ set(GGML_AVX512 OFF CACHE BOOL "" FORCE)
107107
FetchContent_Declare(
108108
llama.cpp
109109
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
110-
GIT_TAG b9106
110+
GIT_TAG b9145
111111
)
112112
FetchContent_MakeAvailable(llama.cpp)
113113

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
2-
[![llama.cpp b9106](https://img.shields.io/badge/llama.cpp-%23b9106-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9106)
2+
[![llama.cpp b9145](https://img.shields.io/badge/llama.cpp-%23b9145-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9145)
33
[![Maven Central](https://img.shields.io/maven-central/v/net.ladenthin/llama)](https://central.sonatype.com/artifact/net.ladenthin/llama)
44
[![Snapshot](https://img.shields.io/badge/snapshot-latest-informational)](https://central.sonatype.com/repository/maven-snapshots/net/ladenthin/llama/)
55

src/main/java/net/ladenthin/llama/InferenceParameters.java

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ public final class InferenceParameters extends JsonParameters {
5656
private static final String PARAM_TOP_N_SIGMA = "top_n_sigma";
5757
private static final String PARAM_REASONING_FORMAT = "reasoning_format";
5858
private static final String PARAM_REASONING_BUDGET_TOKENS = "reasoning_budget_tokens";
59+
private static final String PARAM_CONTINUE_FINAL_MESSAGE = "continue_final_message";
5960

6061
/**
6162
* Creates inference parameters with the given prompt.
@@ -593,6 +594,20 @@ public InferenceParameters setReasoningBudgetTokens(int budgetTokens) {
593594
return this;
594595
}
595596

597+
/**
598+
* Continue the final assistant message rather than starting a new one (vLLM/transformers compatible alias).
599+
* When {@code true}, {@code add_generation_prompt} is implicitly set to {@code false} and the last
600+
* assistant message in the conversation is extended without appending an end-of-turn token.
601+
* Mutually exclusive with {@code add_generation_prompt=true}.
602+
*
603+
* @param continueFinalMessage {@code true} to continue the last assistant message
604+
* @return this builder
605+
*/
606+
public InferenceParameters setContinueFinalMessage(boolean continueFinalMessage) {
607+
parameters.put(PARAM_CONTINUE_FINAL_MESSAGE, String.valueOf(continueFinalMessage));
608+
return this;
609+
}
610+
596611
InferenceParameters setStream(boolean stream) {
597612
parameters.put(PARAM_STREAM, String.valueOf(stream));
598613
return this;

src/main/java/net/ladenthin/llama/ModelParameters.java

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1263,17 +1263,6 @@ public ModelParameters setDraftPMin(float draftPMin) {
12631263
return this;
12641264
}
12651265

1266-
/**
1267-
* Set the size of the prompt context for the draft model.
1268-
*
1269-
* @param ctxSizeDraft the prompt context size for the draft model
1270-
* @return this builder
1271-
*/
1272-
public ModelParameters setCtxSizeDraft(int ctxSizeDraft) {
1273-
parameters.put("--spec-draft-ctx-size", String.valueOf(ctxSizeDraft));
1274-
return this;
1275-
}
1276-
12771266
/**
12781267
* Set the comma-separated list of devices to use for offloading the draft model.
12791268
*

src/test/cpp/test_server.cpp

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -243,9 +243,8 @@ TEST(SlotParamsToJson, SpeculativeFields_Present) {
243243
task_params p;
244244
const json j = p.to_json();
245245

246-
// b8962: only speculative.type is serialised; n_max/n_min/p_min are
247-
// input-only (consumed by params_from_json_cmpl, not emitted by to_json)
248-
EXPECT_TRUE(j.contains("speculative.type"));
246+
// b9134: field renamed speculative.type → speculative.types (now a vector)
247+
EXPECT_TRUE(j.contains("speculative.types"));
249248
}
250249

251250
TEST(SlotParamsToJson, GrammarTriggers_IsArrayByDefault) {

src/test/java/net/ladenthin/llama/InferenceParametersTest.java

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -292,6 +292,18 @@ public void testSetReasoningBudgetTokensDisabled() {
292292
assertEquals("-1", params.parameters.get("reasoning_budget_tokens"));
293293
}
294294

295+
@Test
296+
public void testSetContinueFinalMessageTrue() {
297+
InferenceParameters params = new InferenceParameters("").setContinueFinalMessage(true);
298+
assertEquals("true", params.parameters.get("continue_final_message"));
299+
}
300+
301+
@Test
302+
public void testSetContinueFinalMessageFalse() {
303+
InferenceParameters params = new InferenceParameters("").setContinueFinalMessage(false);
304+
assertEquals("false", params.parameters.get("continue_final_message"));
305+
}
306+
295307
// -------------------------------------------------------------------------
296308
// MiroStat
297309
// -------------------------------------------------------------------------

src/test/java/net/ladenthin/llama/LlamaModelTest.java

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -926,7 +926,6 @@ public void testSpeculativeDecoding() {
926926
.setModel(TestConstants.MODEL_PATH)
927927
.setModelDraft(TestConstants.DRAFT_MODEL_PATH)
928928
.setCtxSize(128)
929-
.setCtxSizeDraft(128)
930929
.setDraftMax(8)
931930
.setDraftMin(1)
932931
.setGpuLayers(gpuLayers)

src/test/java/net/ladenthin/llama/ModelParametersExtendedTest.java

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -897,12 +897,6 @@ public void testSetModelDraft() {
897897
assertEquals("/path/to/draft.gguf", p.parameters.get("--spec-draft-model"));
898898
}
899899

900-
@Test
901-
public void testSetCtxSizeDraft() {
902-
ModelParameters p = new ModelParameters().setCtxSizeDraft(512);
903-
assertEquals("512", p.parameters.get("--spec-draft-ctx-size"));
904-
}
905-
906900
@Test
907901
public void testSetDeviceDraft() {
908902
ModelParameters p = new ModelParameters().setDeviceDraft("cuda0");

0 commit comments

Comments
 (0)