You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pure version bump folded into the 5.0.3 release. The b9840->b9842 range is
internal-only with no API impact:
- common/preset.cpp: canonical_tag() helper canonicalizes the tag suffix of
INI preset section names (CLI/common feature; not bound by the JNI layer).
- ggml-vulkan.cpp: graph-submission heuristic switched from weight-matrix
bytes to estimated FLOPs (Vulkan-backend perf tuning; vulkan-windows
classifier only).
All four local patches (0001-0004) apply unchanged (no patch-target file or
OuteTTS generator anchor touched). Updated CMakeLists GIT_TAG/LLAMA_TAG,
README badge, CLAUDE.md pin references, CHANGELOG 5.0.3 range, and appended
b9840-b9842 rows to docs/history/llama-cpp-breaking-changes.md.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01URUX3HiqQ1wzJnT8qn8c8E
Copy file name to clipboardExpand all lines: CHANGELOG.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
13
13
14
14
Feature release. Headline addition is a full OpenAI-compatible embedded HTTP
15
15
server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio
16
-
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9840**.
16
+
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9842**.
17
17
18
18
### Added
19
19
-**OpenAI-compatible HTTP server** (`server` package, built on the JDK's `com.sun.net.httpserver` — no new runtime dependency; embeddable and the fat-jar `Main-Class`). Serves `POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions` (token-by-token streaming), `/v1/embeddings`, `/v1/rerank`, `/infill`, `GET /v1/models`, `GET /health`, and `GET /props` (every route also reachable without the `/v1` prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
- Upgraded llama.cpp from **b9555 to b9840** across ten incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. All four local patches (`0001`–`0004`) apply across the range.
33
+
- Upgraded llama.cpp from **b9555 to b9842** across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in `common/preset.cpp`; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (`0001`–`0004`) apply unchanged across the range.
34
34
- Replaced the `--skip-download` flag with `--offline` (llama.cpp b9803).
35
35
-`Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (`SessionState` extracted as a testable concurrency contract).
36
36
-`configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003`), instead of validating and discarding the value.
Copy file name to clipboardExpand all lines: CLAUDE.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9840**
9
+
Current llama.cpp pinned version: **b9842**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
286
286
ships no UI):
287
287
```bash
288
288
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
Copy file name to clipboardExpand all lines: docs/history/llama-cpp-breaking-changes.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -407,3 +407,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
407
407
| b9829–b9839 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9839 via `git apply --check` over the actual b9839 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). Patch 0001's `common/arg.{cpp,h}` target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new `--reasoning-preserve` opt, the `--offline``set_examples`, and the `on_done` capture fix — none overlap the patched hunks); 0002's `server-context.cpp` load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
408
408
| b9839–b9840 | `src/llama-arch.{cpp,h}` + `src/llama-model.{cpp,h}` + `src/llama-hparams.h` + `src/llama-graph.{cpp,h}` + `src/llama-kv-cache-dsv4.{cpp,h}` (new) + `src/models/deepseek4.cpp` (new) + `src/llama-kv-cache{,-iswa}.{cpp,h}` + `src/llama-model-loader.cpp` + `src/CMakeLists.txt` + `conversion/*.py` + `gguf-py/` + `models/templates/deepseek-ai-DeepSeek-V4.jinja` (new) | **New model support** — DeepSeek-V4 (`LLM_ARCH_DEEPSEEK4` / `deepseek4`): a brand-new arch with its own compressed KV cache (`llama_kv_cache_dsv4`: raw SWA + CSA/HCA/lightning-indexer compressor states), `sqrtsoftplus` MoE gating (`LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4`), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. `build_moe_ffn` gained an optional trailing `selected_experts_in` param (defaults `nullptr`); `llama_kv_cache_iswa` gained an hparams-taking ctor overload; `llama_kv_cache` exposes `get_layer_ids()`/`get_k_storage()`. **All internal to upstream-compiled `libllama`** — upstream's own `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified `grep -rn "DEEPSEEK4\|dsv4\|DSV4\|SQRT_SOFTPLUS\|sqrtsoftplus\|selected_experts_in\|HYPER_CONNECTION\|hash_layer" src/main/cpp src/test/cpp` → zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / `LlamaModel` path. |
409
409
| b9839–b9840 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9840 via `git apply --check` over the actual b9840 sources fetched from `raw.githubusercontent.com` (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run — exit 0 for all four). The b9839→b9840 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream `tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline |
410
+
| b9840–b9842 | `common/preset.cpp` + `ggml/src/ggml-vulkan/ggml-vulkan.cpp` | Internal-only, no API surface. (1) `common/preset.cpp` adds a `canonical_tag()` helper and **canonicalizes the tag suffix** of INI preset section names (everything after the last `:` is upper-cased / normalized via a `<regex>`), so `[model:q4_k_m]` and `[model:Q4_K_M]` resolve to one preset. Preset loading is a CLI/`common` feature; the project's C++ never calls `common_preset*` — verified `grep -rn "common_preset\|canonical_tag\|load_from_ini" src/main/cpp src/test/cpp` → zero matches. (2) `ggml-vulkan.cpp` reworks the per-graph **command-buffer submission heuristic** from "weight-matrix bytes per submit" to "estimated FLOPs per submit" (new `ggml_vk_get_node_flops()` over MUL_MAT/CONV/FLASH_ATTN nodes; `last_total_mul_mat_bytes` → `last_total_flops`; submit every ~200 GFLOP, still bounded by `max_nodes_per_submit`). Pure Vulkan-backend perf tuning, behaviour-neutral to callers; only affects the `vulkan-windows-x86-64` classifier. No project source changes required. |
411
+
| b9840–b9842 | upstream verification (sandbox) | All four patches (`0001`–`0004`) re-verified to **apply cleanly** against b9842 via `git apply --check` over the actual b9842 sources (github.com git-clone is blocked in this sandbox, so a full `FetchContent` build could not run). The b9840→b9842 range touches **no** patch-target file (`common/arg.{cpp,h}`, `tools/server/server-context.{cpp,h}`, `server-common.cpp`, `test-chat.cpp`, the ~34 standalone mains) and **no** OuteTTS generator anchor (`tools/tts/tts.cpp` unchanged) — the only edits are `common/preset.cpp` and `ggml-vulkan.cpp` — so all patch hunks/offsets are byte-identical to b9840. Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
0 commit comments