You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+13-1Lines changed: 13 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9106**
9
+
Current llama.cpp pinned version: **b9145**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -253,6 +253,18 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
253
253
|~b9103–b9106 |`ggml/src/ggml-vulkan/ggml-vulkan.cpp` + Vulkan shaders | Vulkan flash attention refactored: `pipeline_flash_attn_f32_f16` changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (`flash_attn_f32_f16` and `flash_attn_f32_f16_int8`) that select K/V type at runtime via `FaTypeK`/`FaTypeV` spec constants; new `flash_attn_dequant.glsl` contains aliased SSBO views and an uber `dequantize4()` switch; the K/V type mismatch guard removed from `ggml_backend_vk_device_supports_op`; internal Vulkan backend refactor, no project changes required |
254
254
|~b9103–b9106 |`ggml/src/ggml-cuda/argsort.cu`| Added `#include <cuda/iterator>` for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required |
255
255
|~b9103–b9106 |`convert_hf_to_gguf.py`| Mistral Medium 3.5 mmproj support: `n_embd_text` now reads `"dim"` key instead of `"hidden_dim"`; negative `img_break_tok_id` placeholders resolved from `tekken.json` or `tokenizer.json`; conversion tool only, no project changes required |
256
+
|~b9106–b9134 |`common/arg.cpp`| CLI option `--spec-draft-ctx-size` / `-cd` / `--ctx-size-draft` REMOVED — throws `std::invalid_argument` at parse time; `ModelParameters.setCtxSizeDraft()` removed; no replacement (context size now managed internally by speculative engine) |
257
+
|~b9106–b9134 |`common/arg.cpp`| CLI option `--spec-draft-replace` / `--spec-replace` REMOVED — throws `std::invalid_argument` at parse time; no corresponding Java method existed |
258
+
|~b9106–b9134 |`common/speculative.h`| Full redesign: `common_speculative_type` enum values renamed `DRAFT`→`DRAFT_SIMPLE`, `EAGLE3`→`DRAFT_EAGLE3`; `common_params_speculative.type` (single enum) →`.types` (vector); `common_speculative_n_max()` / `common_speculative_n_min()` REMOVED; new `common_speculative_init(params, n_seq)` no longer takes ctx; new `common_speculative_begin(spec, seq_id, prompt)`, `common_speculative_draft(spec)`, `common_speculative_accept(spec, seq_id, n)`, `common_speculative_process(spec, batch)` signatures; `common_speculative_draft_params` struct added; server sources compiled directly, no project JNI changes required |
259
+
|~b9106–b9134 |`common/common.h`| New `common_prompt_checkpoint` struct (contains `data_tgt` + `data_dft`) replaces the old `server_prompt_checkpoint` in `server-task.h`; compiled from upstream server sources, no project JNI changes required |
260
+
|~b9106–b9134 |`tools/server/server-task.cpp`|`task_params::to_json()` renamed field `"speculative.type"`→`"speculative.types"` (now serialises the vector); test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly |
261
+
|~b9106–b9134 |`include/llama.h`| New `LLAMA_STATE_SEQ_FLAGS_NONE = 0` macro added; additive, no project changes required |
262
+
|~b9134–b9145 |`tools/server/server-common.cpp`| New `continue_final_message` boolean request field in `oaicompat_chat_params_parse`; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when `true`, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with `add_generation_prompt=true` (throws 400); compiled from upstream server sources; `InferenceParameters.setContinueFinalMessage(boolean)` added |
263
+
|~b9134–b9145 |`ggml/src/ggml-sycl/`| Level Zero API integration for SYCL device memory allocation (`GGML_SYCL_SUPPORT_LEVEL_ZERO` build option, `GGML_SYCL_ENABLE_LEVEL_ZERO` runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required |
264
+
|~b9134–b9145 |`ggml/src/ggml-opencl/`| Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required |
265
+
|~b9134–b9145 |`ggml/src/ggml-cuda/allreduce.cu`| AllReduce accumulation now routed through `float` intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required |
266
+
|~b9134–b9145 |`ggml/src/ggml-hexagon/`|`GGML_UNARY_OP_TANH` added to Hexagon HTP backend; internal DSP backend, no project changes required |
267
+
|~b9134–b9145 |`ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp`|`use_subgroup_matrix` condition now also checks `sg_mat_k > 0 && sg_mat_n > 0` and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required |
0 commit comments