You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
33
+
- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001`–`0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
34
+
- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001`–`0004`) apply unchanged; the project binds none of the new symbols.
33
35
-`configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
34
36
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
Copy file name to clipboardExpand all lines: CLAUDE.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9829**
9
+
Current llama.cpp pinned version: **b9840**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
286
286
ships no UI):
287
287
```bash
288
288
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
Copy file name to clipboardExpand all lines: docs/history/llama-cpp-breaking-changes.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -398,3 +398,12 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
398
398
| b9803–b9829 |`ggml/src/ggml-opencl/` (FA q4_0/q8_0 KV, +5 new kernel files) + `ggml/src/ggml-cuda/{cpy,out-prod}.cu` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-sycl/{norm,softmax}.cpp` + `ggml/src/ggml-openvino/`| Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new `fa_tune.h`, `flash_attn_pre_f16.cl`, `flash_attn_f32_q{4,8}_0.cl`, `cvt.cl`/`set_rows.cl` SoA quant variants); CUDA adds a `cudaMemcpy2DAsync` fast path for strided same-type copies, batched `cublasSgemmBatched` out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. `GGML_GLU_OP_SWIGLU_OAI`, softmax attention-sinks). No API surface visible to `jllama.cpp`; the OpenCL set only affects the `opencl-android-aarch64` classifier, CUDA only `cuda13-linux-x86-64`. No project source changes required |
399
399
| b9803–b9829 |`common/common.{h,cpp}` + `common/speculative.cpp` + `common/arg.{cpp,h}` + `tools/mtmd/clip*.{h,cpp}`| Internal upstream churn: new `COM_*`/`SPC_*` logging macros (the `LOG_*` calls inside `common.cpp`/`speculative.cpp`/`reasoning-budget.cpp` were rewrapped, several `LOG_INF`→`LOG_TRC` quieting); `common_models_handler` gained `plan_spec`/`plan_voc` for `--spec-draft-hf`/`--hf-repo-v` downloads + duplicate-task dedup; `clip` hardened GGUF array reads (`get_arr_f32`, even-pinpoints / mean-std validation, `n_merge` defaults to 1). All consumed inside upstream-compiled `common`/`mtmd`; `grep -rn "common_models_handler\|COM_TRC\|n_merge" src/main/cpp src/test/cpp` → zero matches. No project source changes required |
400
400
| b9803–b9829 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply + reverse-apply cleanly** against b9829 via `git apply --check` / `git apply --reverse --check` over the actual b9829 sources fetched from `api.github.com` (github.com git-clone — incl. `FetchContent` of `nlohmann/json` and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the `test-export-graph-ops` rename and the `server.cpp` GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The **`server-stream.cpp` link fix** in `CMakeLists.txt` is required by the b9829 server-TU `#include`s (verified against the upstream diff: `server-context`/`server-http`/`server-models` reference symbols defined only in `server-stream.cpp`). Full build + `ctest` (target 454/454) to be confirmed by the CI pipeline. |
401
+
| b9829–b9839 |`common/regex-partial.{cpp,h}` (removed) + `common/CMakeLists.txt` + `tests/test-regex-partial.cpp` (removed) + `tests/CMakeLists.txt`| The standalone reversed-partial-regex matcher (`common_regex`, `regex_to_reversed_partial_regex`, `common_regex_match`/`common_string_range`) was **deleted** — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 `json-partial` removal). Project references none of these symbols — verified `grep -rn "regex-partial\|common_regex\|regex_to_reversed\|COMMON_REGEX" src/main/cpp src/test/cpp`→ zero matches; the deleted upstream test isn't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
402
+
| b9829–b9839 | `common/common.h` + `common/speculative.cpp` + `conversion/*.py` + `gguf-py/` + `src/llama-arch.{cpp,h}` + `src/llama-{context,graph,model}.cpp` + `src/models/dflash.cpp` (new) + `docs/speculative.md` | **New feature** — DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`, PR #22105): a new `LLM_ARCH_DFLASH` arch + `common_speculative_impl_draft_dflash` that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds `COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH` (so `COMMON_SPECULATIVE_TYPE_COUNT` 9→10, `static_assert` bumped) and a `self_kq_mask && self_kq_mask->buffer` guard in `llm_graph_input_attn_kv::set_input` for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no `common_speculative_*`/arch symbol — all consumed inside upstream-compiled `common`/`libllama`. No project source changes required. Could later surface as a `--spec-type` inference parameter |
403
+
| b9829–b9839 |`common/chat.cpp` + `models/templates/openbmb-MiniCPM5-1B.jinja` (new) + `tests/test-chat*.cpp`|**New model support** — MiniCPM5 chat template (`common_chat_params_init_minicpm5`): XML tool calls `<function name="…"><param name="…">…</param></function>` with CDATA-escaped string values + `<think>` reasoning. Detected by `common_chat_try_specialized_template` and handled inside the compiled-in `chat.cpp`, so it flows through the embedded server / `LlamaModel` chat path automatically. Upstream test additions aren't built here (`LLAMA_BUILD_TESTS` OFF). No project source changes required |
404
+
| b9829–b9839 |`common/arg.cpp` + `common/chat.cpp` + `common/jinja/caps.{cpp,h}` + `tools/server/server-context.cpp`|**New feature** — `--reasoning-preserve` / `--no-reasoning-preserve` (`LLAMA_ARG_REASONING_PRESERVE`): preserve the reasoning trace across the **full** chat history (not just the last assistant message) when the template advertises the `supports_preserve_reasoning` capability; `server-context.cpp` adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by `ModelParameters`. **Note:** the b9839 `server-context.cpp` additions sit in `load_model` after the `chat_params` block — disjoint from the load-progress-callback guard `patches/0002` targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter |
405
+
| b9829–b9839 |`common/jinja/runtime.{cpp,h}` + `common/jinja/value.cpp` + `tools/ui/**` + `tests/test-jinja.cpp` + `tools/server/server-{models,stream}.cpp`| Internal/cosmetic only: Jinja gains an AST `visitor` + `runtime::debug_dump_program` (template debugging) and `min`/`max` array filters; `server-models.cpp`/`server-stream.cpp` add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into `jllama`); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required |
406
+
| b9829–b9839 |`common/arg.cpp` (lambda capture + `--offline` examples) | Behaviour-neutral upstream churn: the `common_models_handler_apply``on_done` lambda now captures `first_path` by value (dangling-reference fix) and `--offline` gained `LLAMA_EXAMPLE_COMMON`/`LLAMA_EXAMPLE_DOWNLOAD``set_examples` tags. The project's `ModelParameters.setOffline(boolean)` (`--offline`) already exists; both changes are inside upstream-compiled `arg.cpp` and don't touch the `patches/0001` hunks. No project source changes required |
407
+
| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline``set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
408
+
| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` → zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
409
+
| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
0 commit comments