Skip to content

Commit 334fd86

Browse files
Merge pull request #281 from bernardladenthin/claude/java-llama-cpp-release-cn3o8z
Release 5.0.3: OpenAI-compatible server, multimodal, and llama.cpp b9842
2 parents 7889287 + 1af4f4d commit 334fd86

6 files changed

Lines changed: 66 additions & 43 deletions

File tree

CHANGELOG.md

Lines changed: 48 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,43 +9,62 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
99

1010
## [Unreleased]
1111

12+
## [5.0.3] - 2026-06-29
13+
14+
Feature release. Headline addition is a full OpenAI-compatible embedded HTTP
15+
server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio
16+
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9842**.
17+
1218
### Added
13-
- `CODE_OF_CONDUCT.md` (Contributor Covenant 2.0).
14-
- `docs/RELEASE.md` capturing the maintainer-facing release procedure (moved out of CHANGELOG).
15-
- OpenSSF Best Practices badge (project 12862) on README.
16-
- OpenAI-compatible `parallel_tool_calls` support: `ChatRequest.withParallelToolCalls(Boolean)` / `getParallelToolCalls()`, `InferenceParameters.withParallelToolCalls(boolean)`, and pass-through in the `/v1/chat/completions` server mapper.
17-
- Real-model tool-calling integration tests for blocking and streaming required tool calls (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct), wired into CI and `validate-models`.
18-
- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
19-
- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
19+
- **OpenAI-compatible HTTP server** (`server` package, built on the JDK's `com.sun.net.httpserver` — no new runtime dependency; embeddable and the fat-jar `Main-Class`). Serves `POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions` (token-by-token streaming), `/v1/embeddings`, `/v1/rerank`, `/infill`, `GET /v1/models`, `GET /health`, and `GET /props` (every route also reachable without the `/v1` prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
20+
- **Multi-protocol surfaces** over the same inference core (pure translation, no second inference path): **Ollama-native** (`/api/version`, `/api/tags`, `/api/show`, `/api/chat` NDJSON, `/api/generate`), **Anthropic Messages** (`POST /v1/messages`, SSE), and **OpenAI Responses** (`POST /v1/responses`, SSE).
21+
- **Agentic tool-calling**: `parallel_tool_calls` support (`ChatRequest.withParallelToolCalls(Boolean)`, `InferenceParameters.withParallelToolCalls(boolean)`, server-mapper pass-through), the `ToolCallingAgent` chat loop (JSON-serialized tool-result errors), and `ToolCallDeltaAccumulator` for reconstructing streamed tool calls; real-model integration tests (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct).
22+
- **Text-to-speech** (`TextToSpeech`): OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. The OuteTTS DSP is derived at build time from upstream `tts.cpp` rather than hand-copied.
23+
- **Audio input** via OpenAI `input_audio` content parts (`ContentPart.audioFile`), for Ultravox / Qwen2.5-Omni-class models.
24+
- **End-to-end vision input** across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify distinct red/blue images produce the correct semantic answers. Explicit `setMmprojAuto(boolean)` / `setMmprojOffload(boolean)` controls (`--no-mmproj-auto` / `--no-mmproj-offload`).
2025
- Per-request KV controls: `InferenceParameters.withSlotId(int)` and `withCacheReuse(int)`.
21-
- Per-request DRY sampling to `InferenceParameters` (`dry_multiplier`/`dry_base`/`dry_allowed_length`/`dry_penalty_last_n`/`dry_sequence_breakers`).
22-
- `ModelParameters.enableSwaFull()` (`--swa-full`): keep full-size SWA KV cache to enable cross-request prompt-prefix reuse.
23-
- Typed cache observability through `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, and `ServerMetrics.getSlotMetrics()`.
24-
- Authenticated JSON `GET /metrics` and `GET /slots` endpoints on the embedded server.
26+
- Per-request DRY sampling on `InferenceParameters` (`dry_multiplier` / `dry_base` / `dry_allowed_length` / `dry_penalty_last_n` / `dry_sequence_breakers`).
27+
- `ModelParameters.enableSwaFull()` (`--swa-full`): keep a full-size SWA KV cache to enable cross-request prompt-prefix reuse.
28+
- Typed cache observability: `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, `ServerMetrics.getSlotMetrics()`, plus authenticated JSON `GET /metrics` and `GET /slots`.
29+
- **Windows GPU native classifiers**: `cuda13-windows-x86-64`, `vulkan-windows-x86-64`, `opencl-windows-x86-64`, and the `msvc-windows` CPU classifier (the default Windows CPU JAR flipped to the Ninja Multi-Config generator).
30+
- `log_helpers.hpp` — pure, unit-tested log-formatting helpers (`log_level_name`, `format_log_as_json`).
2531

2632
### Changed
27-
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories in the project family.
28-
- Reconciled Java baseline to **11+** across `pom.xml`, README badge, `CLAUDE.md`, and `CONTRIBUTING.md`.
29-
- README license badge corrected from "Apache 2.0" to "MIT" (matches `LICENSE` file and `pom.xml`).
30-
- `pom.xml` SCM URL: `tree/master``tree/main` (default branch renamed).
31-
- Upgraded llama.cpp from b9151 to b9172.
32-
- Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
33-
- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001``0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
34-
- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001``0004`) apply unchanged; the project binds none of the new symbols.
35-
- `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
36-
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
33+
- Upgraded llama.cpp from **b9555 to b9842** across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in `common/preset.cpp`; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (`0001``0004`) apply unchanged across the range.
34+
- Replaced the `--skip-download` flag with `--offline` (llama.cpp b9803).
35+
- `Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (`SessionState` extracted as a testable concurrency contract).
36+
- `configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003`), instead of validating and discarding the value.
37+
- **Android minimum API level raised from 24 to 28** (Android 9.0 Pie), satisfied via bionic's weak-symbol mechanism rather than `__ANDROID_API__`.
38+
- CI: rolled out the sccache → Depot shared compiler cache across all native build jobs (incl. nvcc wrapping for full-arch CUDA and the Windows Ninja path), fork-PR token-gating, and a shared GGUF model cache.
39+
- `LlamaLoader` native-library extraction is now race-safe (atomic write) and uses a private lock object instead of `synchronized` methods.
40+
- SpotBugs (effort=Max, threshold=Low) made clean and wired into CI; C++ unit suite grown to 459 tests.
3741

3842
### Fixed
39-
- Per-request `reasoning_budget_tokens` is now honored (via `patches/0004`, upstream PR ggml-org/llama.cpp#23116): `reasoning_budget_tokens=0` suppresses thinking. `ReasoningBudgetTest` now asserts the suppression directly (the previous test that pinned the unfixed-bug behavior was removed).
40-
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's upstream multimodal task path instead of silently tokenizing them as text-only prompts.
41-
- Preserved multipart image content when using the typed `ChatRequest` serializer.
43+
- Per-request `reasoning_budget_tokens` is now honored (via `patches/0004`, upstream PR ggml-org/llama.cpp#23116): `reasoning_budget_tokens=0` suppresses thinking.
44+
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's multimodal task path instead of silently tokenizing them as text-only prompts; preserved multipart image content in the typed `ChatRequest` serializer.
4245
- The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
43-
- `Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state.
44-
- Cached-token usage is preserved through typed Java responses and OpenAI Responses/Anthropic blocking and streaming adapters.
46+
- Cached-token usage is preserved through typed Java responses and the OpenAI Responses / Anthropic blocking and streaming adapters.
47+
- Stabilized flaky reasoning-budget tests on Metal by using greedy sampling.
48+
49+
## [5.0.2] - 2026-06-08
50+
51+
Tracks llama.cpp **b9151 → b9555**.
4552

4653
### Added
54+
- `CODE_OF_CONDUCT.md` (Contributor Covenant 2.0).
55+
- `docs/RELEASE.md` capturing the maintainer-facing release procedure (moved out of CHANGELOG).
56+
- OpenSSF Best Practices badge (project 12862) on README.
4757
- Reasoning-budget tests (Qwen3-0.6B).
4858

59+
### Changed
60+
- **Reorganized the Java API into subpackages**`parameters` (`ModelParameters`, `InferenceParameters`, …), `value` (`LogLevel`, …), `callback`, `exception` (`LlamaException`, …), and `loader` (`LlamaLoader`, `OSInfo`). Source-incompatible for consumers: import statements for the moved types must be updated.
61+
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories, and migrated cross-repo `CLAUDE.md` sections to `workspace` pointers.
62+
- Reconciled Java baseline to **11+** across `pom.xml`, README badge, `CLAUDE.md`, and `CONTRIBUTING.md`.
63+
- README license badge corrected from "Apache 2.0" to "MIT" (matches `LICENSE` file and `pom.xml`).
64+
- `pom.xml` SCM URL: `tree/master``tree/main` (default branch renamed).
65+
- Upgraded Maven dependencies (incl. `logback-classic` 1.5.32 → 1.5.33).
66+
- Upgraded llama.cpp from **b9151 to b9555** across multiple incremental upgrades.
67+
4968
## [5.0.1] - 2026-05-14
5069

5170
### Added
@@ -110,6 +129,8 @@ Releases `1.1.1` through `4.2.0` were authored by [@kherud](https://github.com/k
110129

111130
For an architecture-level diff between the pre-fork baseline (`49be664`) and the first 5.0.0 candidate (`24918e4`), see [`docs/history/49be664_24918e4.md`](docs/history/49be664_24918e4.md). For the server-fork-deletion refactor that culminated in 5.0.0, see [`docs/history/REFACTORING.md`](docs/history/REFACTORING.md). For the chat-completion integration that landed in 5.0.0, see [`docs/history/CHAT_INTEGRATION_SUMMARY.md`](docs/history/CHAT_INTEGRATION_SUMMARY.md).
112131

113-
[Unreleased]: https://github.com/bernardladenthin/java-llama.cpp/compare/v5.0.1...HEAD
132+
[Unreleased]: https://github.com/bernardladenthin/java-llama.cpp/compare/v5.0.3...HEAD
133+
[5.0.3]: https://github.com/bernardladenthin/java-llama.cpp/compare/v5.0.2...v5.0.3
134+
[5.0.2]: https://github.com/bernardladenthin/java-llama.cpp/compare/v5.0.1...v5.0.2
114135
[5.0.1]: https://github.com/bernardladenthin/java-llama.cpp/compare/v5.0.0...v5.0.1
115136
[5.0.0]: https://github.com/bernardladenthin/java-llama.cpp/releases/tag/v5.0.0

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
66

77
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
88

9-
Current llama.cpp pinned version: **b9840**
9+
Current llama.cpp pinned version: **b9842**
1010

1111
## Upgrading CUDA Version
1212

@@ -286,7 +286,7 @@ needs no extra step here, `build-webui` re-reads the tag and rebuilds the matchi
286286
ships no UI):
287287
```bash
288288
# needs node/npm + network; embed.cpp is plain C++17 (no npm)
289-
git clone --depth 1 --branch b9840 https://github.com/ggml-org/llama.cpp /tmp/lc
289+
git clone --depth 1 --branch b9842 https://github.com/ggml-org/llama.cpp /tmp/lc
290290
( cd /tmp/lc/tools/ui && npm ci && npm run build \
291291
&& ( cd dist && find . -type f -not -path './_gzip/*' \
292292
| while read -r f; do mkdir -p "_gzip/$(dirname "$f")"; gzip -9 -c "$f" > "_gzip/$f"; done ) \
@@ -320,7 +320,7 @@ plus a cache token are present, `build.sh` adds
320320
- `SCCACHE_WEBDAV_TOKEN: ${{ secrets.DEPOT_TOKEN }}` — a Depot **organization** token, stored
321321
as the repo secret **`DEPOT_TOKEN`**.
322322

323-
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9840`), the
323+
Because `sccache` is **content-addressed** and llama.cpp is pinned (`GIT_TAG b9842`), the
324324
~280 upstream object files are byte-identical every run, so a warm cache recompiles only the
325325
*changed* files. Depot's cache is **shared across all branches** (unlike GitHub's
326326
per-branch `actions/cache`), so every branch builds incrementally; a `b<nnnn>` version bump
@@ -1021,7 +1021,7 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
10211021

10221022
#### Upstream source location (in CMake build tree)
10231023

1024-
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9840`.
1024+
llama.cpp is fetched via CMake FetchContent, pinned to `GIT_TAG b9842`.
10251025

10261026
**GoogleTest** is a separate `BUILD_TESTING`-only FetchContent (`GIT_TAG v1.17.0`), used solely
10271027
by the `jllama_test` C++ unit-test binary — not by the shipped library, and not coupled to the

CMakeLists.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -143,7 +143,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
143143
FetchContent_Declare(
144144
llama.cpp
145145
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
146-
GIT_TAG b9840
146+
GIT_TAG b9842
147147
PATCH_COMMAND ${CMAKE_COMMAND}
148148
-DPATCH_DIR=${CMAKE_CURRENT_SOURCE_DIR}/patches
149149
-DLLAMA_SRC=<SOURCE_DIR>
@@ -166,7 +166,7 @@ execute_process(
166166
COMMAND ${CMAKE_COMMAND}
167167
-DTTS_SRC=${llama.cpp_SOURCE_DIR}/tools/tts/tts.cpp
168168
-DOUT_CPP=${JLLAMA_TTS_GEN_CPP}
169-
-DLLAMA_TAG=b9840
169+
-DLLAMA_TAG=b9842
170170
-P ${CMAKE_CURRENT_SOURCE_DIR}/cmake/generate-tts-upstream.cmake
171171
RESULT_VARIABLE JLLAMA_TTS_GEN_RESULT
172172
)

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
**Build:**
88
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
99
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
10-
[![llama.cpp b9840](https://img.shields.io/badge/llama.cpp-%23b9840-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9840)
10+
[![llama.cpp b9842](https://img.shields.io/badge/llama.cpp-%23b9842-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9842)
1111
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
1212
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
1313
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)
@@ -119,7 +119,7 @@ Access this library via Maven (released versions on Maven Central):
119119
<dependency>
120120
<groupId>net.ladenthin</groupId>
121121
<artifactId>llama</artifactId>
122-
<version>5.0.2</version>
122+
<version>5.0.3</version>
123123
</dependency>
124124
```
125125

@@ -184,54 +184,54 @@ classifier — those are mutually exclusive — and optionally a CPU Windows bui
184184
<dependency>
185185
<groupId>net.ladenthin</groupId>
186186
<artifactId>llama</artifactId>
187-
<version>5.0.2</version>
187+
<version>5.0.3</version>
188188
</dependency>
189189

190190
<!-- CUDA on Linux x86-64 (requires CUDA 13 runtime on the host) -->
191191
<dependency>
192192
<groupId>net.ladenthin</groupId>
193193
<artifactId>llama</artifactId>
194-
<version>5.0.2</version>
194+
<version>5.0.3</version>
195195
<classifier>cuda13-linux-x86-64</classifier>
196196
</dependency>
197197

198198
<!-- OpenCL/Adreno on Android (requires device-provided OpenCL ICD) -->
199199
<dependency>
200200
<groupId>net.ladenthin</groupId>
201201
<artifactId>llama</artifactId>
202-
<version>5.0.2</version>
202+
<version>5.0.3</version>
203203
<classifier>opencl-android-aarch64</classifier>
204204
</dependency>
205205

206206
<!-- CUDA on Windows x86-64 (requires CUDA 13 Toolkit on the host) -->
207207
<dependency>
208208
<groupId>net.ladenthin</groupId>
209209
<artifactId>llama</artifactId>
210-
<version>5.0.2</version>
210+
<version>5.0.3</version>
211211
<classifier>cuda13-windows-x86-64</classifier>
212212
</dependency>
213213

214214
<!-- Vulkan on Windows x86-64 (NVIDIA/AMD/Intel; vulkan-1.dll from the driver) -->
215215
<dependency>
216216
<groupId>net.ladenthin</groupId>
217217
<artifactId>llama</artifactId>
218-
<version>5.0.2</version>
218+
<version>5.0.3</version>
219219
<classifier>vulkan-windows-x86-64</classifier>
220220
</dependency>
221221

222222
<!-- OpenCL on Windows x86-64 (requires a driver-provided OpenCL ICD) -->
223223
<dependency>
224224
<groupId>net.ladenthin</groupId>
225225
<artifactId>llama</artifactId>
226-
<version>5.0.2</version>
226+
<version>5.0.3</version>
227227
<classifier>opencl-windows-x86-64</classifier>
228228
</dependency>
229229

230230
<!-- Windows CPU natives built with the MSVC / Visual Studio generator -->
231231
<dependency>
232232
<groupId>net.ladenthin</groupId>
233233
<artifactId>llama</artifactId>
234-
<version>5.0.2</version>
234+
<version>5.0.3</version>
235235
<classifier>msvc-windows</classifier>
236236
</dependency>
237237
```

0 commit comments

Comments
 (0)