Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning from version 5.0.0 onward. Pre-fork releases (1.x–4.2.0) were authored by kherud/java-llama.cpp.

Unreleased

5.0.3 - 2026-06-29

Feature release. Headline addition is a full OpenAI-compatible embedded HTTP server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio input, text-to-speech) and slot-bound sessions. Tracks llama.cpp b9555 → b9842.

Added

OpenAI-compatible HTTP server (server package, built on the JDK's com.sun.net.httpserver — no new runtime dependency; embeddable and the fat-jar Main-Class). Serves POST /v1/chat/completions (streaming SSE + non-streaming), /v1/completions (token-by-token streaming), /v1/embeddings, /v1/rerank, /infill, GET /v1/models, GET /health, and GET /props (every route also reachable without the /v1 prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
Multi-protocol surfaces over the same inference core (pure translation, no second inference path): Ollama-native (/api/version, /api/tags, /api/show, /api/chat NDJSON, /api/generate), Anthropic Messages (POST /v1/messages, SSE), and OpenAI Responses (POST /v1/responses, SSE).
Agentic tool-calling: parallel_tool_calls support (ChatRequest.withParallelToolCalls(Boolean), InferenceParameters.withParallelToolCalls(boolean), server-mapper pass-through), the ToolCallingAgent chat loop (JSON-serialized tool-result errors), and ToolCallDeltaAccumulator for reconstructing streamed tool calls; real-model integration tests (ToolCallingIntegrationTest, Qwen2.5-1.5B-Instruct).
Text-to-speech (TextToSpeech): OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech) pipeline; synthesize(text) returns a 24 kHz mono 16-bit WAV byte stream. The OuteTTS DSP is derived at build time from upstream tts.cpp rather than hand-copied.
Audio input via OpenAI input_audio content parts (ContentPart.audioFile), for Ultravox / Qwen2.5-Omni-class models.
End-to-end vision input across blocking, typed ChatRequest, streaming, and OpenAI-compatible request mapping; real-model tests verify distinct red/blue images produce the correct semantic answers. Explicit setMmprojAuto(boolean) / setMmprojOffload(boolean) controls (--no-mmproj-auto / --no-mmproj-offload).
Per-request KV controls: InferenceParameters.withSlotId(int) and withCacheReuse(int).
Per-request DRY sampling on InferenceParameters (dry_multiplier / dry_base / dry_allowed_length / dry_penalty_last_n / dry_sequence_breakers).
ModelParameters.enableSwaFull() (--swa-full): keep a full-size SWA KV cache to enable cross-request prompt-prefix reuse.
Typed cache observability: Usage.getCachedTokens(), Usage.getProcessedPromptTokens(), SlotMetrics, ServerMetrics.getSlotMetrics(), plus authenticated JSON GET /metrics and GET /slots.
Windows GPU native classifiers: cuda13-windows-x86-64, vulkan-windows-x86-64, opencl-windows-x86-64, and the msvc-windows CPU classifier (the default Windows CPU JAR flipped to the Ninja Multi-Config generator).
log_helpers.hpp — pure, unit-tested log-formatting helpers (log_level_name, format_log_as_json).

Changed

Upgraded llama.cpp from b9555 to b9842 across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, --swa-full, DFlash block-diffusion speculative decoding (--spec-type draft-dflash), the MiniCPM5 XML tool-call chat template, the server --reasoning-preserve flag, Jinja min/max array filters, and the DeepSeek-V4 architecture (b9840). The b9829 bump additionally compiles the new upstream server-stream.cpp (resumable-streaming SSE replay buffer) into libjllama. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in common/preset.cpp; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (0001–0004) apply unchanged across the range.
Replaced the --skip-download flag with --offline (llama.cpp b9803).
Session now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (SessionState extracted as a testable concurrency contract).
configureParallelInference now applies slot_prompt_similarity live via server_context::set_slot_prompt_similarity() (upstream PR ggml-org/llama.cpp#22393, carried as patches/0003), instead of validating and discarding the value.
Android minimum API level raised from 24 to 28 (Android 9.0 Pie), satisfied via bionic's weak-symbol mechanism rather than __ANDROID_API__.
CI: rolled out the sccache → Depot shared compiler cache across all native build jobs (incl. nvcc wrapping for full-arch CUDA and the Windows Ninja path), fork-PR token-gating, and a shared GGUF model cache.
LlamaLoader native-library extraction is now race-safe (atomic write) and uses a private lock object instead of synchronized methods.
SpotBugs (effort=Max, threshold=Low) made clean and wired into CI; C++ unit suite grown to 459 tests.

Fixed

Per-request reasoning_budget_tokens is now honored (via patches/0004, upstream PR ggml-org/llama.cpp#23116): reasoning_budget_tokens=0 suppresses thinking.
Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's multimodal task path instead of silently tokenizing them as text-only prompts; preserved multipart image content in the typed ChatRequest serializer.
The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
Cached-token usage is preserved through typed Java responses and the OpenAI Responses / Anthropic blocking and streaming adapters.
Stabilized flaky reasoning-budget tests on Metal by using greedy sampling.

5.0.2 - 2026-06-08

Tracks llama.cpp b9151 → b9555.

Added

CODE_OF_CONDUCT.md (Contributor Covenant 2.0).
docs/RELEASE.md capturing the maintainer-facing release procedure (moved out of CHANGELOG).
OpenSSF Best Practices badge (project 12862) on README.
Reasoning-budget tests (Qwen3-0.6B).

Changed

Reorganized the Java API into subpackages — parameters (ModelParameters, InferenceParameters, …), value (LogLevel, …), callback, exception (LlamaException, …), and loader (LlamaLoader, OSInfo). Source-incompatible for consumers: import statements for the moved types must be updated.
Unified CONTRIBUTING.md and SECURITY.md structure with sibling repositories, and migrated cross-repo CLAUDE.md sections to workspace pointers.
Reconciled Java baseline to 11+ across pom.xml, README badge, CLAUDE.md, and CONTRIBUTING.md.
README license badge corrected from "Apache 2.0" to "MIT" (matches LICENSE file and pom.xml).
pom.xml SCM URL: tree/master → tree/main (default branch renamed).
Upgraded Maven dependencies (incl. logback-classic 1.5.32 → 1.5.33).
Upgraded llama.cpp from b9151 to b9555 across multiple incremental upgrades.

5.0.1 - 2026-05-14

Added

InferenceParameters.setContinueFinalMessage(boolean) for the vLLM/transformers-compatible prefill-assistant heuristic (llama.cpp b9134+).
Tests for setContinueFinalMessage.
Comprehensive Javadoc on public APIs (PR #129).
Maven Central badge on README (PR #130).

Changed

Bumped project version to 5.0.1-SNAPSHOT (PR #127), then released as 5.0.1 (PR #135).
Refactored GitHub release workflow to parallelise snapshot and release jobs (PR #128).
Removed snapshot build documentation and badge (PR #131).
Upgraded Windows CI to windows-2025 with Visual Studio 2026 (PR #132).
Switched Windows MSVC runtime from dynamic (/MD) to static (/MT) to eliminate the msvcp140.dll runtime dependency (PR #133).
Upgraded llama.cpp from b9106 to b9134 (PR #134), then to b9150 (PR #136), then to b9151 (PR #139).
Refactored CI workflow with explicit snapshot/tag check gates (PR #137).
Removed setCtxSizeDraft() — the underlying CLI flag was deleted upstream in llama.cpp b9106.

Fixed

fix(publish): quoted gate job names to avoid YAML colon-in-scalar parse errors (PR #138).
Release routing in the publish workflow now correctly distinguishes snapshot vs. tag pushes.

5.0.0 - 2026-05-11

First release of the fork under the net.ladenthin:llama Maven coordinates. ~100 merged pull requests since baseline 49be664 (the last pre-fork upstream commit).

Added

First publish to Maven Central under net.ladenthin:llama.
Pre-built native libraries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86).
Java API surface: LlamaModel, ModelParameters, InferenceParameters, LlamaIterator / LlamaIterable for streaming, chat completion (chatComplete, generateChat, chatCompleteText), embeddings, reranking, infilling, raw JSON endpoint handlers, slot management (saveSlot, restoreSlot, eraseSlot), and getModelMeta().
chatComplete() for OpenAI-compatible chat completions, re-implemented from scratch based on a patch by @vaiju1981 (PR #61; see docs/history/CHAT_INTEGRATION_SUMMARY.md).
mmproj, reasoning-budget, sigma, and sleep-idle parameters added to ModelParameters.
JaCoCo code-coverage reporting integrated with Coveralls and Codecov (PR #124).
CodeQL static-analysis workflow on push, PR, and a weekly schedule.
Automated Claude Code review workflow on pull requests.
Dependabot for Maven and GitHub Actions dependency updates.
Automatic snapshot release workflow on main push (PR #105) publishing to the Sonatype Central snapshot repository.
CUDA, Metal, and Vulkan build support via local CMake build.
Android integration documented in README.
All system properties (net.ladenthin.llama.*) and LogLevel values documented.
CLAUDE.md maintainer guide covering upstream upgrade procedure and the b5022→b9172 breaking-change table.

Changed

Migrated Maven group and artifact from de.kherud:java-llama.cpp to net.ladenthin:llama (PR #101).
Migrated Maven Central publishing from OSSRH (Legacy) to the Sonatype Central Publisher Portal.
Deleted the hand-ported server.hpp fork (~3,780 lines) and linked the upstream llama.cpp server source files directly into jllama. ~4,100 C++ lines removed in total; future upstream upgrades become a CMake version bump. The Java API is unchanged. See docs/history/REFACTORING.md.
Compiled upstream server-context / queue / task / models directly into jllama (PR #96).
Unified CI into a single publish.yml workflow with cross-compilation, testing, coverage, and release stages.
Upgraded CUDA from 12.1 to 13.2 (PR #50).
Upgraded llama.cpp from b8913 through b9106 across multiple incremental upgrades.
setDraftMax / setDraftMin now emit the canonical --spec-draft-n-max / --spec-draft-n-min flags (llama.cpp b9016 removed the old aliases).
Bumped CI GitHub Actions: actions/checkout v4 → v6, actions/upload-artifact v6 → v7, actions/download-artifact v6 → v8, codeql-action v3 → v4.

Fixed

Javadoc warnings resolved across the public API by adding missing comments.
cache_idle_slots slot-parameter handling aligned with the upstream rename (b8841 → b8854).

Pre-fork history (kherud/java-llama.cpp 1.x–4.2.0)

Releases 1.1.1 through 4.2.0 were authored by @kherud on the upstream repository. The full upstream release notes are at https://github.com/kherud/java-llama.cpp/releases. The fork's baseline is upstream commit 49be664 (tagged v4.2.0, 2025-06-20).

For an architecture-level diff between the pre-fork baseline (49be664) and the first 5.0.0 candidate (24918e4), see docs/history/49be664_24918e4.md. For the server-fork-deletion refactor that culminated in 5.0.0, see docs/history/REFACTORING.md. For the chat-completion integration that landed in 5.0.0, see docs/history/CHAT_INTEGRATION_SUMMARY.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

Unreleased

5.0.3 - 2026-06-29

Added

Changed

Fixed

5.0.2 - 2026-06-08

Added

Changed

5.0.1 - 2026-05-14

Added

Changed

Fixed

5.0.0 - 2026-05-11

Added

Changed

Fixed

Pre-fork history (kherud/java-llama.cpp 1.x–4.2.0)

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

Unreleased

5.0.3 - 2026-06-29

Added

Changed

Fixed

5.0.2 - 2026-06-08

Added

Changed

5.0.1 - 2026-05-14

Added

Changed

Fixed

5.0.0 - 2026-05-11

Added

Changed

Fixed

Pre-fork history (kherud/java-llama.cpp 1.x–4.2.0)