You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CHANGELOG: rewrite 5.0.2 and 5.0.3 from actual git history
- 5.0.3: comprehensive feature set (OpenAI server + multi-protocol,
vision/audio/TTS multimodality, slot-bound sessions, Windows GPU
classifiers, Android API 28); correct llama.cpp range b9555 -> b9840.
- 5.0.2: accurate range b9151 -> b9555; document the source-incompatible
Java package reorganization into subpackages.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01URUX3HiqQ1wzJnT8qn8c8E
Copy file name to clipboardExpand all lines: CHANGELOG.md
+33-22Lines changed: 33 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,48 +11,59 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
11
11
12
12
## [5.0.3] - 2026-06-29
13
13
14
+
Feature release. Headline addition is a full OpenAI-compatible embedded HTTP
15
+
server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio
16
+
input, text-to-speech) and slot-bound sessions. Tracks llama.cpp **b9555 → b9840**.
17
+
14
18
### Added
15
-
- OpenAI-compatible `parallel_tool_calls` support: `ChatRequest.withParallelToolCalls(Boolean)` / `getParallelToolCalls()`, `InferenceParameters.withParallelToolCalls(boolean)`, and pass-through in the `/v1/chat/completions` server mapper.
16
-
- Real-model tool-calling integration tests for blocking and streaming required tool calls (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct), wired into CI and `validate-models`.
17
-
- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
18
-
- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
19
+
-**OpenAI-compatible HTTP server** (`server` package, built on the JDK's `com.sun.net.httpserver` — no new runtime dependency; embeddable and the fat-jar `Main-Class`). Serves `POST /v1/chat/completions` (streaming SSE + non-streaming), `/v1/completions` (token-by-token streaming), `/v1/embeddings`, `/v1/rerank`, `/infill`, `GET /v1/models`, `GET /health`, and `GET /props` (every route also reachable without the `/v1` prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
20
+
-**Multi-protocol surfaces** over the same inference core (pure translation, no second inference path): **Ollama-native** (`/api/version`, `/api/tags`, `/api/show`, `/api/chat` NDJSON, `/api/generate`), **Anthropic Messages** (`POST /v1/messages`, SSE), and **OpenAI Responses** (`POST /v1/responses`, SSE).
21
+
-**Agentic tool-calling**: `parallel_tool_calls` support (`ChatRequest.withParallelToolCalls(Boolean)`, `InferenceParameters.withParallelToolCalls(boolean)`, server-mapper pass-through), the `ToolCallingAgent` chat loop (JSON-serialized tool-result errors), and `ToolCallDeltaAccumulator` for reconstructing streamed tool calls; real-model integration tests (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct).
22
+
-**Text-to-speech** (`TextToSpeech`): OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech) pipeline; `synthesize(text)` returns a 24 kHz mono 16-bit WAV byte stream. The OuteTTS DSP is derived at build time from upstream `tts.cpp` rather than hand-copied.
23
+
-**Audio input** via OpenAI `input_audio` content parts (`ContentPart.audioFile`), for Ultravox / Qwen2.5-Omni-class models.
24
+
-**End-to-end vision input** across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify distinct red/blue images produce the correct semantic answers. Explicit `setMmprojAuto(boolean)` / `setMmprojOffload(boolean)` controls (`--no-mmproj-auto` / `--no-mmproj-offload`).
19
25
- Per-request KV controls: `InferenceParameters.withSlotId(int)` and `withCacheReuse(int)`.
- Typed cache observability through `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, and `ServerMetrics.getSlotMetrics()`.
23
-
-Authenticated JSON `GET /metrics`and `GET /slots` endpoints on the embedded server.
24
-
-Windows GPU native classifiers: `cuda13-windows-x86-64`, `vulkan-windows-x86-64`, and `opencl-windows-x86-64`; the default Windows CPU JAR flipped to the Ninja Multi-Config generator with an `msvc-windows` classifier preserving the Visual Studio build.
-`ModelParameters.enableSwaFull()` (`--swa-full`): keep a full-size SWA KV cache to enable cross-request prompt-prefix reuse.
28
+
- Typed cache observability: `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, `ServerMetrics.getSlotMetrics()`, plus authenticated JSON `GET /metrics` and `GET /slots`.
29
+
-**Windows GPU native classifiers**: `cuda13-windows-x86-64`, `vulkan-windows-x86-64`, `opencl-windows-x86-64`, and the `msvc-windows` CPU classifier (the default Windows CPU JAR flipped to the Ninja Multi-Config generator).
- Upgraded llama.cpp from b9172 to b9803 across multiple incremental upgrades.
28
-
- Upgraded llama.cpp from b9803 to b9829. Compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`, required because `server-context`/`server-http`/`server-models` now reference its symbols; refreshed `patches/0001` for the `tests/test-export-graph-ops.cpp` rename and the `server.cpp` GC-init context shift.
29
-
- Upgraded llama.cpp from b9829 to b9839. Pure version bump — no project source changes: all four patches (`0001`–`0004`) apply unchanged against b9839, and every upstream change in the range is absorbed inside upstream-compiled translation units. Brings DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, a server `--reasoning-preserve` flag (preserve reasoning trace across the full history when the template supports it), and Jinja `min`/`max` array filters; removes the now-unused `common/regex-partial.{cpp,h}` (partial-regex matching is fully inside the PEG parser), which the project never referenced.
30
-
- Upgraded llama.cpp from b9839 to b9840. Pure version bump — no project source changes: the range is entirely the new **DeepSeek-V4** architecture (new `deepseek4` arch + dedicated `llama-kv-cache-dsv4` cache, `sqrtsoftplus` MoE gating, hyper-connection/compressor hparams + tensors, conversion scripts and embedded chat template), all absorbed inside upstream-compiled `libllama` and the Python converters. Upstream's `src/CMakeLists.txt` adds the new `llama-kv-cache-dsv4.cpp` itself (built via FetchContent). All four patches (`0001`–`0004`) apply unchanged; the project binds none of the new symbols.
31
-
-`configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003` until merged), instead of validating it and discarding the value.
32
-
- Extracted the `chatWithTools` agent loop into `ToolCallingAgent`; tool-result errors (unknown tool / handler exception) are now JSON-serialized so tool names containing special characters remain valid JSON.
33
+
- Upgraded llama.cpp from **b9555 to b9840** across ten incremental upgrades. Notable upstream features now reachable: DRY sampling, `--swa-full`, DFlash block-diffusion speculative decoding (`--spec-type draft-dflash`), the MiniCPM5 XML tool-call chat template, the server `--reasoning-preserve` flag, Jinja `min`/`max` array filters, and the **DeepSeek-V4** architecture (b9840). The b9829 bump additionally compiles the new upstream `server-stream.cpp` (resumable-streaming SSE replay buffer) into `libjllama`. All four local patches (`0001`–`0004`) apply across the range.
34
+
- Replaced the `--skip-download` flag with `--offline` (llama.cpp b9803).
35
+
-`Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (`SessionState` extracted as a testable concurrency contract).
36
+
-`configureParallelInference` now applies `slot_prompt_similarity` live via `server_context::set_slot_prompt_similarity()` (upstream PR ggml-org/llama.cpp#22393, carried as `patches/0003`), instead of validating and discarding the value.
37
+
-**Android minimum API level raised from 24 to 28** (Android 9.0 Pie), satisfied via bionic's weak-symbol mechanism rather than `__ANDROID_API__`.
38
+
- CI: rolled out the sccache → Depot shared compiler cache across all native build jobs (incl. nvcc wrapping for full-arch CUDA and the Windows Ninja path), fork-PR token-gating, and a shared GGUF model cache.
39
+
-`LlamaLoader` native-library extraction is now race-safe (atomic write) and uses a private lock object instead of `synchronized` methods.
40
+
- SpotBugs (effort=Max, threshold=Low) made clean and wired into CI; C++ unit suite grown to 459 tests.
33
41
34
42
### Fixed
35
-
- Per-request `reasoning_budget_tokens` is now honored (via `patches/0004`, upstream PR ggml-org/llama.cpp#23116): `reasoning_budget_tokens=0` suppresses thinking. `ReasoningBudgetTest` now asserts the suppression directly (the previous test that pinned the unfixed-bug behavior was removed).
36
-
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's upstream multimodal task path instead of silently tokenizing them as text-only prompts.
37
-
- Preserved multipart image content when using the typed `ChatRequest` serializer.
43
+
- Per-request `reasoning_budget_tokens` is now honored (via `patches/0004`, upstream PR ggml-org/llama.cpp#23116): `reasoning_budget_tokens=0` suppresses thinking.
44
+
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's multimodal task path instead of silently tokenizing them as text-only prompts; preserved multipart image content in the typed `ChatRequest` serializer.
38
45
- The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
39
-
-`Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state.
40
-
-Cached-token usage is preserved through typed Java responses and OpenAI Responses/Anthropic blocking and streaming adapters.
46
+
-Cached-token usage is preserved through typed Java responses and the OpenAI Responses / Anthropic blocking and streaming adapters.
47
+
-Stabilized flaky reasoning-budget tests on Metal by using greedy sampling.
41
48
42
49
## [5.0.2] - 2026-06-08
43
50
51
+
Tracks llama.cpp **b9151 → b9555**.
52
+
44
53
### Added
45
54
-`CODE_OF_CONDUCT.md` (Contributor Covenant 2.0).
46
55
-`docs/RELEASE.md` capturing the maintainer-facing release procedure (moved out of CHANGELOG).
47
56
- OpenSSF Best Practices badge (project 12862) on README.
48
57
- Reasoning-budget tests (Qwen3-0.6B).
49
58
50
59
### Changed
51
-
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories in the project family.
60
+
-**Reorganized the Java API into subpackages** — `parameters` (`ModelParameters`, `InferenceParameters`, …), `value` (`LogLevel`, …), `callback`, `exception` (`LlamaException`, …), and `loader` (`LlamaLoader`, `OSInfo`). Source-incompatible for consumers: import statements for the moved types must be updated.
61
+
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories, and migrated cross-repo `CLAUDE.md` sections to `workspace` pointers.
52
62
- Reconciled Java baseline to **11+** across `pom.xml`, README badge, `CLAUDE.md`, and `CONTRIBUTING.md`.
53
63
- README license badge corrected from "Apache 2.0" to "MIT" (matches `LICENSE` file and `pom.xml`).
0 commit comments