All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog,
and this project adheres to Semantic Versioning
from version 5.0.0 onward. Pre-fork releases (1.x–4.2.0) were authored by
kherud/java-llama.cpp.
5.0.3 - 2026-06-29
Feature release. Headline addition is a full OpenAI-compatible embedded HTTP server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio input, text-to-speech) and slot-bound sessions. Tracks llama.cpp b9555 → b9842.
- OpenAI-compatible HTTP server (
serverpackage, built on the JDK'scom.sun.net.httpserver— no new runtime dependency; embeddable and the fat-jarMain-Class). ServesPOST /v1/chat/completions(streaming SSE + non-streaming),/v1/completions(token-by-token streaming),/v1/embeddings,/v1/rerank,/infill,GET /v1/models,GET /health, andGET /props(every route also reachable without the/v1prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue. - Multi-protocol surfaces over the same inference core (pure translation, no second inference path): Ollama-native (
/api/version,/api/tags,/api/show,/api/chatNDJSON,/api/generate), Anthropic Messages (POST /v1/messages, SSE), and OpenAI Responses (POST /v1/responses, SSE). - Agentic tool-calling:
parallel_tool_callssupport (ChatRequest.withParallelToolCalls(Boolean),InferenceParameters.withParallelToolCalls(boolean), server-mapper pass-through), theToolCallingAgentchat loop (JSON-serialized tool-result errors), andToolCallDeltaAccumulatorfor reconstructing streamed tool calls; real-model integration tests (ToolCallingIntegrationTest, Qwen2.5-1.5B-Instruct). - Text-to-speech (
TextToSpeech): OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech) pipeline;synthesize(text)returns a 24 kHz mono 16-bit WAV byte stream. The OuteTTS DSP is derived at build time from upstreamtts.cpprather than hand-copied. - Audio input via OpenAI
input_audiocontent parts (ContentPart.audioFile), for Ultravox / Qwen2.5-Omni-class models. - End-to-end vision input across blocking, typed
ChatRequest, streaming, and OpenAI-compatible request mapping; real-model tests verify distinct red/blue images produce the correct semantic answers. ExplicitsetMmprojAuto(boolean)/setMmprojOffload(boolean)controls (--no-mmproj-auto/--no-mmproj-offload). - Per-request KV controls:
InferenceParameters.withSlotId(int)andwithCacheReuse(int). - Per-request DRY sampling on
InferenceParameters(dry_multiplier/dry_base/dry_allowed_length/dry_penalty_last_n/dry_sequence_breakers). ModelParameters.enableSwaFull()(--swa-full): keep a full-size SWA KV cache to enable cross-request prompt-prefix reuse.- Typed cache observability:
Usage.getCachedTokens(),Usage.getProcessedPromptTokens(),SlotMetrics,ServerMetrics.getSlotMetrics(), plus authenticated JSONGET /metricsandGET /slots. - Windows GPU native classifiers:
cuda13-windows-x86-64,vulkan-windows-x86-64,opencl-windows-x86-64, and themsvc-windowsCPU classifier (the default Windows CPU JAR flipped to the Ninja Multi-Config generator). log_helpers.hpp— pure, unit-tested log-formatting helpers (log_level_name,format_log_as_json).
- Upgraded llama.cpp from b9555 to b9842 across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling,
--swa-full, DFlash block-diffusion speculative decoding (--spec-type draft-dflash), the MiniCPM5 XML tool-call chat template, the server--reasoning-preserveflag, Jinjamin/maxarray filters, and the DeepSeek-V4 architecture (b9840). The b9829 bump additionally compiles the new upstreamserver-stream.cpp(resumable-streaming SSE replay buffer) intolibjllama. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization incommon/preset.cpp; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (0001–0004) apply unchanged across the range. - Replaced the
--skip-downloadflag with--offline(llama.cpp b9803). Sessionnow pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (SessionStateextracted as a testable concurrency contract).configureParallelInferencenow appliesslot_prompt_similaritylive viaserver_context::set_slot_prompt_similarity()(upstream PR ggml-org/llama.cpp#22393, carried aspatches/0003), instead of validating and discarding the value.- Android minimum API level raised from 24 to 28 (Android 9.0 Pie), satisfied via bionic's weak-symbol mechanism rather than
__ANDROID_API__. - CI: rolled out the sccache → Depot shared compiler cache across all native build jobs (incl. nvcc wrapping for full-arch CUDA and the Windows Ninja path), fork-PR token-gating, and a shared GGUF model cache.
LlamaLoadernative-library extraction is now race-safe (atomic write) and uses a private lock object instead ofsynchronizedmethods.- SpotBugs (effort=Max, threshold=Low) made clean and wired into CI; C++ unit suite grown to 459 tests.
- Per-request
reasoning_budget_tokensis now honored (viapatches/0004, upstream PR ggml-org/llama.cpp#23116):reasoning_budget_tokens=0suppresses thinking. - Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's multimodal task path instead of silently tokenizing them as text-only prompts; preserved multipart image content in the typed
ChatRequestserializer. - The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
- Cached-token usage is preserved through typed Java responses and the OpenAI Responses / Anthropic blocking and streaming adapters.
- Stabilized flaky reasoning-budget tests on Metal by using greedy sampling.
5.0.2 - 2026-06-08
Tracks llama.cpp b9151 → b9555.
CODE_OF_CONDUCT.md(Contributor Covenant 2.0).docs/RELEASE.mdcapturing the maintainer-facing release procedure (moved out of CHANGELOG).- OpenSSF Best Practices badge (project 12862) on README.
- Reasoning-budget tests (Qwen3-0.6B).
- Reorganized the Java API into subpackages —
parameters(ModelParameters,InferenceParameters, …),value(LogLevel, …),callback,exception(LlamaException, …), andloader(LlamaLoader,OSInfo). Source-incompatible for consumers: import statements for the moved types must be updated. - Unified
CONTRIBUTING.mdandSECURITY.mdstructure with sibling repositories, and migrated cross-repoCLAUDE.mdsections toworkspacepointers. - Reconciled Java baseline to 11+ across
pom.xml, README badge,CLAUDE.md, andCONTRIBUTING.md. - README license badge corrected from "Apache 2.0" to "MIT" (matches
LICENSEfile andpom.xml). pom.xmlSCM URL:tree/master→tree/main(default branch renamed).- Upgraded Maven dependencies (incl.
logback-classic1.5.32 → 1.5.33). - Upgraded llama.cpp from b9151 to b9555 across multiple incremental upgrades.
5.0.1 - 2026-05-14
InferenceParameters.setContinueFinalMessage(boolean)for the vLLM/transformers-compatible prefill-assistant heuristic (llama.cpp b9134+).- Tests for
setContinueFinalMessage. - Comprehensive Javadoc on public APIs (PR #129).
- Maven Central badge on README (PR #130).
- Bumped project version to 5.0.1-SNAPSHOT (PR #127), then released as 5.0.1 (PR #135).
- Refactored GitHub release workflow to parallelise snapshot and release jobs (PR #128).
- Removed snapshot build documentation and badge (PR #131).
- Upgraded Windows CI to
windows-2025with Visual Studio 2026 (PR #132). - Switched Windows MSVC runtime from dynamic (
/MD) to static (/MT) to eliminate themsvcp140.dllruntime dependency (PR #133). - Upgraded llama.cpp from b9106 to b9134 (PR #134), then to b9150 (PR #136), then to b9151 (PR #139).
- Refactored CI workflow with explicit snapshot/tag check gates (PR #137).
- Removed
setCtxSizeDraft()— the underlying CLI flag was deleted upstream in llama.cpp b9106.
fix(publish):quoted gate job names to avoid YAML colon-in-scalar parse errors (PR #138).- Release routing in the publish workflow now correctly distinguishes snapshot vs. tag pushes.
5.0.0 - 2026-05-11
First release of the fork under the net.ladenthin:llama Maven coordinates. ~100 merged pull requests since baseline 49be664 (the last pre-fork upstream commit).
- First publish to Maven Central under
net.ladenthin:llama. - Pre-built native libraries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86).
- Java API surface:
LlamaModel,ModelParameters,InferenceParameters,LlamaIterator/LlamaIterablefor streaming, chat completion (chatComplete,generateChat,chatCompleteText), embeddings, reranking, infilling, raw JSON endpoint handlers, slot management (saveSlot,restoreSlot,eraseSlot), andgetModelMeta(). chatComplete()for OpenAI-compatible chat completions, re-implemented from scratch based on a patch by @vaiju1981 (PR #61; seedocs/history/CHAT_INTEGRATION_SUMMARY.md).mmproj, reasoning-budget, sigma, and sleep-idle parameters added toModelParameters.- JaCoCo code-coverage reporting integrated with Coveralls and Codecov (PR #124).
- CodeQL static-analysis workflow on push, PR, and a weekly schedule.
- Automated Claude Code review workflow on pull requests.
- Dependabot for Maven and GitHub Actions dependency updates.
- Automatic snapshot release workflow on
mainpush (PR #105) publishing to the Sonatype Central snapshot repository. - CUDA, Metal, and Vulkan build support via local CMake build.
- Android integration documented in README.
- All system properties (
net.ladenthin.llama.*) andLogLevelvalues documented. CLAUDE.mdmaintainer guide covering upstream upgrade procedure and the b5022→b9172 breaking-change table.
- Migrated Maven group and artifact from
de.kherud:java-llama.cpptonet.ladenthin:llama(PR #101). - Migrated Maven Central publishing from OSSRH (Legacy) to the Sonatype Central Publisher Portal.
- Deleted the hand-ported
server.hppfork (~3,780 lines) and linked the upstreamllama.cppserver source files directly intojllama. ~4,100 C++ lines removed in total; future upstream upgrades become a CMake version bump. The Java API is unchanged. Seedocs/history/REFACTORING.md. - Compiled upstream server-context / queue / task / models directly into jllama (PR #96).
- Unified CI into a single
publish.ymlworkflow with cross-compilation, testing, coverage, and release stages. - Upgraded CUDA from 12.1 to 13.2 (PR #50).
- Upgraded llama.cpp from b8913 through b9106 across multiple incremental upgrades.
setDraftMax/setDraftMinnow emit the canonical--spec-draft-n-max/--spec-draft-n-minflags (llama.cpp b9016 removed the old aliases).- Bumped CI GitHub Actions:
actions/checkoutv4 → v6,actions/upload-artifactv6 → v7,actions/download-artifactv6 → v8,codeql-actionv3 → v4.
- Javadoc warnings resolved across the public API by adding missing comments.
cache_idle_slotsslot-parameter handling aligned with the upstream rename (b8841 → b8854).
Releases 1.1.1 through 4.2.0 were authored by @kherud on the upstream repository. The full upstream release notes are at
https://github.com/kherud/java-llama.cpp/releases. The fork's baseline is upstream commit 49be664 (tagged v4.2.0, 2025-06-20).
For an architecture-level diff between the pre-fork baseline (49be664) and the first 5.0.0 candidate (24918e4), see docs/history/49be664_24918e4.md. For the server-fork-deletion refactor that culminated in 5.0.0, see docs/history/REFACTORING.md. For the chat-completion integration that landed in 5.0.0, see docs/history/CHAT_INTEGRATION_SUMMARY.md.