Skip to content

Latest commit

 

History

History
136 lines (111 loc) · 12.8 KB

File metadata and controls

136 lines (111 loc) · 12.8 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning from version 5.0.0 onward. Pre-fork releases (1.x4.2.0) were authored by kherud/java-llama.cpp.

5.0.3 - 2026-06-29

Feature release. Headline addition is a full OpenAI-compatible embedded HTTP server with multi-protocol surfaces, plus end-to-end multimodal (vision, audio input, text-to-speech) and slot-bound sessions. Tracks llama.cpp b9555 → b9842.

Added

  • OpenAI-compatible HTTP server (server package, built on the JDK's com.sun.net.httpserver — no new runtime dependency; embeddable and the fat-jar Main-Class). Serves POST /v1/chat/completions (streaming SSE + non-streaming), /v1/completions (token-by-token streaming), /v1/embeddings, /v1/rerank, /infill, GET /v1/models, GET /health, and GET /props (every route also reachable without the /v1 prefix), with optional bearer auth and CORS — drives editor clients such as VS Code Copilot, Cline, Roo Code, and Continue.
  • Multi-protocol surfaces over the same inference core (pure translation, no second inference path): Ollama-native (/api/version, /api/tags, /api/show, /api/chat NDJSON, /api/generate), Anthropic Messages (POST /v1/messages, SSE), and OpenAI Responses (POST /v1/responses, SSE).
  • Agentic tool-calling: parallel_tool_calls support (ChatRequest.withParallelToolCalls(Boolean), InferenceParameters.withParallelToolCalls(boolean), server-mapper pass-through), the ToolCallingAgent chat loop (JSON-serialized tool-result errors), and ToolCallDeltaAccumulator for reconstructing streamed tool calls; real-model integration tests (ToolCallingIntegrationTest, Qwen2.5-1.5B-Instruct).
  • Text-to-speech (TextToSpeech): OuteTTS (text-to-codes) + WavTokenizer (codes-to-speech) pipeline; synthesize(text) returns a 24 kHz mono 16-bit WAV byte stream. The OuteTTS DSP is derived at build time from upstream tts.cpp rather than hand-copied.
  • Audio input via OpenAI input_audio content parts (ContentPart.audioFile), for Ultravox / Qwen2.5-Omni-class models.
  • End-to-end vision input across blocking, typed ChatRequest, streaming, and OpenAI-compatible request mapping; real-model tests verify distinct red/blue images produce the correct semantic answers. Explicit setMmprojAuto(boolean) / setMmprojOffload(boolean) controls (--no-mmproj-auto / --no-mmproj-offload).
  • Per-request KV controls: InferenceParameters.withSlotId(int) and withCacheReuse(int).
  • Per-request DRY sampling on InferenceParameters (dry_multiplier / dry_base / dry_allowed_length / dry_penalty_last_n / dry_sequence_breakers).
  • ModelParameters.enableSwaFull() (--swa-full): keep a full-size SWA KV cache to enable cross-request prompt-prefix reuse.
  • Typed cache observability: Usage.getCachedTokens(), Usage.getProcessedPromptTokens(), SlotMetrics, ServerMetrics.getSlotMetrics(), plus authenticated JSON GET /metrics and GET /slots.
  • Windows GPU native classifiers: cuda13-windows-x86-64, vulkan-windows-x86-64, opencl-windows-x86-64, and the msvc-windows CPU classifier (the default Windows CPU JAR flipped to the Ninja Multi-Config generator).
  • log_helpers.hpp — pure, unit-tested log-formatting helpers (log_level_name, format_log_as_json).

Changed

  • Upgraded llama.cpp from b9555 to b9842 across eleven incremental upgrades. Notable upstream features now reachable: DRY sampling, --swa-full, DFlash block-diffusion speculative decoding (--spec-type draft-dflash), the MiniCPM5 XML tool-call chat template, the server --reasoning-preserve flag, Jinja min/max array filters, and the DeepSeek-V4 architecture (b9840). The b9829 bump additionally compiles the new upstream server-stream.cpp (resumable-streaming SSE replay buffer) into libjllama. The final b9840→b9842 step is internal-only (preset INI section-tag canonicalization in common/preset.cpp; a Vulkan graph-submission heuristic switched from weight-matrix bytes to estimated FLOPs) — no project source changes, no API impact, all four local patches (00010004) apply unchanged across the range.
  • Replaced the --skip-download flag with --offline (llama.cpp b9803).
  • Session now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state (SessionState extracted as a testable concurrency contract).
  • configureParallelInference now applies slot_prompt_similarity live via server_context::set_slot_prompt_similarity() (upstream PR ggml-org/llama.cpp#22393, carried as patches/0003), instead of validating and discarding the value.
  • Android minimum API level raised from 24 to 28 (Android 9.0 Pie), satisfied via bionic's weak-symbol mechanism rather than __ANDROID_API__.
  • CI: rolled out the sccache → Depot shared compiler cache across all native build jobs (incl. nvcc wrapping for full-arch CUDA and the Windows Ninja path), fork-PR token-gating, and a shared GGUF model cache.
  • LlamaLoader native-library extraction is now race-safe (atomic write) and uses a private lock object instead of synchronized methods.
  • SpotBugs (effort=Max, threshold=Low) made clean and wired into CI; C++ unit suite grown to 459 tests.

Fixed

  • Per-request reasoning_budget_tokens is now honored (via patches/0004, upstream PR ggml-org/llama.cpp#23116): reasoning_budget_tokens=0 suppresses thinking.
  • Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's multimodal task path instead of silently tokenizing them as text-only prompts; preserved multipart image content in the typed ChatRequest serializer.
  • The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
  • Cached-token usage is preserved through typed Java responses and the OpenAI Responses / Anthropic blocking and streaming adapters.
  • Stabilized flaky reasoning-budget tests on Metal by using greedy sampling.

5.0.2 - 2026-06-08

Tracks llama.cpp b9151 → b9555.

Added

  • CODE_OF_CONDUCT.md (Contributor Covenant 2.0).
  • docs/RELEASE.md capturing the maintainer-facing release procedure (moved out of CHANGELOG).
  • OpenSSF Best Practices badge (project 12862) on README.
  • Reasoning-budget tests (Qwen3-0.6B).

Changed

  • Reorganized the Java API into subpackagesparameters (ModelParameters, InferenceParameters, …), value (LogLevel, …), callback, exception (LlamaException, …), and loader (LlamaLoader, OSInfo). Source-incompatible for consumers: import statements for the moved types must be updated.
  • Unified CONTRIBUTING.md and SECURITY.md structure with sibling repositories, and migrated cross-repo CLAUDE.md sections to workspace pointers.
  • Reconciled Java baseline to 11+ across pom.xml, README badge, CLAUDE.md, and CONTRIBUTING.md.
  • README license badge corrected from "Apache 2.0" to "MIT" (matches LICENSE file and pom.xml).
  • pom.xml SCM URL: tree/mastertree/main (default branch renamed).
  • Upgraded Maven dependencies (incl. logback-classic 1.5.32 → 1.5.33).
  • Upgraded llama.cpp from b9151 to b9555 across multiple incremental upgrades.

5.0.1 - 2026-05-14

Added

  • InferenceParameters.setContinueFinalMessage(boolean) for the vLLM/transformers-compatible prefill-assistant heuristic (llama.cpp b9134+).
  • Tests for setContinueFinalMessage.
  • Comprehensive Javadoc on public APIs (PR #129).
  • Maven Central badge on README (PR #130).

Changed

  • Bumped project version to 5.0.1-SNAPSHOT (PR #127), then released as 5.0.1 (PR #135).
  • Refactored GitHub release workflow to parallelise snapshot and release jobs (PR #128).
  • Removed snapshot build documentation and badge (PR #131).
  • Upgraded Windows CI to windows-2025 with Visual Studio 2026 (PR #132).
  • Switched Windows MSVC runtime from dynamic (/MD) to static (/MT) to eliminate the msvcp140.dll runtime dependency (PR #133).
  • Upgraded llama.cpp from b9106 to b9134 (PR #134), then to b9150 (PR #136), then to b9151 (PR #139).
  • Refactored CI workflow with explicit snapshot/tag check gates (PR #137).
  • Removed setCtxSizeDraft() — the underlying CLI flag was deleted upstream in llama.cpp b9106.

Fixed

  • fix(publish): quoted gate job names to avoid YAML colon-in-scalar parse errors (PR #138).
  • Release routing in the publish workflow now correctly distinguishes snapshot vs. tag pushes.

5.0.0 - 2026-05-11

First release of the fork under the net.ladenthin:llama Maven coordinates. ~100 merged pull requests since baseline 49be664 (the last pre-fork upstream commit).

Added

  • First publish to Maven Central under net.ladenthin:llama.
  • Pre-built native libraries for Linux (x86-64, aarch64), macOS (x86-64, arm64), and Windows (x86-64, x86).
  • Java API surface: LlamaModel, ModelParameters, InferenceParameters, LlamaIterator / LlamaIterable for streaming, chat completion (chatComplete, generateChat, chatCompleteText), embeddings, reranking, infilling, raw JSON endpoint handlers, slot management (saveSlot, restoreSlot, eraseSlot), and getModelMeta().
  • chatComplete() for OpenAI-compatible chat completions, re-implemented from scratch based on a patch by @vaiju1981 (PR #61; see docs/history/CHAT_INTEGRATION_SUMMARY.md).
  • mmproj, reasoning-budget, sigma, and sleep-idle parameters added to ModelParameters.
  • JaCoCo code-coverage reporting integrated with Coveralls and Codecov (PR #124).
  • CodeQL static-analysis workflow on push, PR, and a weekly schedule.
  • Automated Claude Code review workflow on pull requests.
  • Dependabot for Maven and GitHub Actions dependency updates.
  • Automatic snapshot release workflow on main push (PR #105) publishing to the Sonatype Central snapshot repository.
  • CUDA, Metal, and Vulkan build support via local CMake build.
  • Android integration documented in README.
  • All system properties (net.ladenthin.llama.*) and LogLevel values documented.
  • CLAUDE.md maintainer guide covering upstream upgrade procedure and the b5022→b9172 breaking-change table.

Changed

  • Migrated Maven group and artifact from de.kherud:java-llama.cpp to net.ladenthin:llama (PR #101).
  • Migrated Maven Central publishing from OSSRH (Legacy) to the Sonatype Central Publisher Portal.
  • Deleted the hand-ported server.hpp fork (~3,780 lines) and linked the upstream llama.cpp server source files directly into jllama. ~4,100 C++ lines removed in total; future upstream upgrades become a CMake version bump. The Java API is unchanged. See docs/history/REFACTORING.md.
  • Compiled upstream server-context / queue / task / models directly into jllama (PR #96).
  • Unified CI into a single publish.yml workflow with cross-compilation, testing, coverage, and release stages.
  • Upgraded CUDA from 12.1 to 13.2 (PR #50).
  • Upgraded llama.cpp from b8913 through b9106 across multiple incremental upgrades.
  • setDraftMax / setDraftMin now emit the canonical --spec-draft-n-max / --spec-draft-n-min flags (llama.cpp b9016 removed the old aliases).
  • Bumped CI GitHub Actions: actions/checkout v4 → v6, actions/upload-artifact v6 → v7, actions/download-artifact v6 → v8, codeql-action v3 → v4.

Fixed

  • Javadoc warnings resolved across the public API by adding missing comments.
  • cache_idle_slots slot-parameter handling aligned with the upstream rename (b8841 → b8854).

Pre-fork history (kherud/java-llama.cpp 1.x–4.2.0)

Releases 1.1.1 through 4.2.0 were authored by @kherud on the upstream repository. The full upstream release notes are at https://github.com/kherud/java-llama.cpp/releases. The fork's baseline is upstream commit 49be664 (tagged v4.2.0, 2025-06-20).

For an architecture-level diff between the pre-fork baseline (49be664) and the first 5.0.0 candidate (24918e4), see docs/history/49be664_24918e4.md. For the server-fork-deletion refactor that culminated in 5.0.0, see docs/history/REFACTORING.md. For the chat-completion integration that landed in 5.0.0, see docs/history/CHAT_INTEGRATION_SUMMARY.md.