Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9555**
Current llama.cpp pinned version: **b9621**

## Upgrading CUDA Version

Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9555
GIT_TAG b9621
)
FetchContent_MakeAvailable(llama.cpp)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
**Build:**
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows%20%7C%20Android-lightgrey)
[![llama.cpp b9555](https://img.shields.io/badge/llama.cpp-%23b9555-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9555)
[![llama.cpp b9621](https://img.shields.io/badge/llama.cpp-%23b9621-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9621)
[![JPMS](https://img.shields.io/badge/JPMS-modular%20JAR-25A162)](https://openjdk.org/projects/jigsaw/)
![JUnit](https://img.shields.io/badge/tested%20with-JUnit6-25A162)
[![JSpecify](https://img.shields.io/badge/JSpecify-1.0.0%20%40NullMarked-25A162)](https://jspecify.dev)
Expand Down
6 changes: 6 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
<!--
SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>

SPDX-License-Identifier: MIT
-->

# TODO — java-llama.cpp

Open work items for this repo. Cross-cutting tracking lives in
Expand Down
6 changes: 6 additions & 0 deletions docs/feature-investigation-similar-projects.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
<!--
SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>

SPDX-License-Identifier: MIT
-->

# Feature Investigation — ideas from pure-Java sibling runtimes and `llamacpp4j`

Comparison sources (all surveyed in one pass for this document):
Expand Down
20 changes: 20 additions & 0 deletions docs/history/llama-cpp-breaking-changes.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
<!--
SPDX-FileCopyrightText: 2026 Bernard Ladenthin <bernard.ladenthin@gmail.com>

SPDX-License-Identifier: MIT
-->

# llama.cpp upstream breaking changes — version-range changelog

Per-version-range record of upstream API breaks observed in the b5022 &#x2192; latest range, what the affected upstream files are, and the project-side fix (or "no project changes required" when the break stayed inside an upstream-compiled translation unit).
Expand Down Expand Up @@ -326,3 +332,17 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
| ~b9549&ndash;b9553 | `conversion/mistral.py` + `convert_hf_to_gguf.py` | Python conversion-script robustness only: `hparams["llama_4_scaling"]` and `"moe" in hparams` replaced with `hparams.get(...)` / `is not None` guards so a present-but-null key no longer crashes conversion. Python tooling, not part of the JNI build. No impact |
| ~b9549&ndash;b9553 | upstream build / verification | Local build with `GIT_TAG b9553` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit, and `ctest --test-dir build --output-on-failure` reports **440/440 tests passing** (435 prior + 5 new `Samplers_*` tests). The sole breaking change in this range (the `common_sampler_types_from_names` signature) is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |
| ~b9553&ndash;b9555 | `.devops/intel.Dockerfile` + `ggml/src/ggml-metal/ggml-metal-device.cpp` + `tests/test-backend-ops.cpp` | Tiny maintenance bump &mdash; **no API change and no new feature**. (1) `intel.Dockerfile`: Intel GPU userspace driver pins bumped (IGC `v2.20.5`&#x2192;`v2.34.4`, compute-runtime `25.40.35563.10`&#x2192;`26.18.38308.1`, IGDGMM `22.8.2`&#x2192;`22.10.0`) with the old multi-GPU-safe versions commented out; upstream's own Docker image only &mdash; this project ships its own `publish.yml` and does not consume `.devops/`. No impact. (2) `ggml-metal-device.cpp`: bugfix to the Metal im2col pipeline selector &mdash; the standard-vs-`_ext` kernel choice now keys off the actual conv-kernel footprint (`KH*KW`, with `KH = is_2D ? ne01 : 1`, `KW = ne00`) instead of the raw `ne00*ne01` product, fixing kernel selection for 1-D convolutions. Backend-internal Metal TU compiled via FetchContent; no API surface visible to `jllama.cpp`, and only affects the macOS/Metal backend at runtime. (3) `tests/test-backend-ops.cpp`: one extra `test_im2col` case (`{3000,384,1,1}` / `{3,384,384,1}`) added &mdash; upstream test only, not linked into the JNI build. **No project source changes required; no new Java-API-exposable feature.** Build verification deferred to CI (`publish.yml`) / a developer host as usual |
| ~b9555–b9621 | `ggml/include/ggml.h` + `ggml/src/ggml.c` + `ggml/src/ggml-cuda/gated_delta_net.cu` + `ggml/src/ggml-metal/ggml-metal.metal` + `ggml/src/ggml-vulkan/vulkan-shaders/gated_delta_net.comp` | `ggml_gated_delta_net` state tensor reshaped again: the 3D `(S_v*S_v*H, K, n_seqs)` layout is now the 4D `[S_v, S_v, H, n_seqs]` with an explicit `int64_t K` seventh parameter (snapshot count, K=1 is final-state-only). Signature: `ggml_gated_delta_net(ctx, q, k, v, g, beta, state, K)` (was 6-argument). Snapshot-slot ordering also flipped to most-recent-first. Internal Qwen3.5 / Qwen3-Next recurrent-attention kernel; project does not call `ggml_gated_delta_net` directly — no project source changes required |
| ~b9555–b9621 | `ggml/include/ggml.h` | New `ggml_col2im_1d(ctx, a, s0, oc, p0)` function and `GGML_OP_COL2IM_1D` enum value added; `GGML_OP_COUNT` incremented 96 → 97. Additive; not called by project — no project source changes required |
| ~b9555–b9621 | `common/fit.h` + `tools/server/server-context.cpp` | `common_get_device_memory_data()` return type changed: now returns `common_device_memory_data_vec` (typedef for `std::vector<common_device_memory_data>`). New `common_device_memory_data` struct carries `.total`, `.free`, `.model`, `.context`, `.compute` fields directly (previously the caller reached them via `.mb.model` etc.). `fit.h` also dropped its `#include "ggml-backend.h"` and `#include "../src/llama-ext.h"` lines (those types are no longer needed at the header level). Consumed exclusively in upstream-compiled `server-context.cpp` (field-accessor update from `.mb.model` → `.model` etc. was applied upstream); project does not include `fit.h` or call `common_get_device_memory_data()` directly — no project source changes required |
| ~b9555–b9621 | `tools/mtmd/mtmd-helper.h` + `tools/mtmd/mtmd-helper.cpp` + `tools/server/server-common.cpp` | `mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` return type changed: both now return `mtmd_helper_bitmap_wrapper` struct (contains `bitmap` + `video_ctx` fields) instead of `mtmd_bitmap*`. All call sites updated in upstream `server-common.cpp`. Project does not call these functions from `src/main/cpp/` (verified via grep: zero matches) — no project source changes required |
| ~b9555–b9621 | `tools/mtmd/mtmd-helper.h` + `tools/mtmd/mtmd-helper.cpp` | New video pipeline: `mtmd_helper_video_context`, `mtmd_helper_video_*` API family (init/free/decode), ffmpeg-based frame extraction. New `--video` CLI flag in `common/arg.cpp`; new `input_video` content type in `server-common.cpp`. Multimodal helper additions flow through the upstream-compiled `mtmd-helper.cpp` and `server-common.cpp`; project does not reference any `mtmd_helper_video_*` symbol — no project source changes required. Could be exposed in a future Java API as `InferenceParameters.setVideoPath(String)` |
| ~b9555–b9621 | `common/common.h` | New `common_params` fields: `path_prompts_log_dir` (prompt-logging output directory, string) and `mtmd_batch_max_tokens` (multimodal batch token limit, default 1024). Both additive with harmless defaults. Not surfaced by `ModelParameters` today — could be added in a future enhancement. No project source changes required |
| ~b9555–b9621 | `src/llama-ext.h` | New EAGLE3 speculative-decoding support APIs: `llama_set_embeddings_layer_inp(ctx, lid, value)`, `llama_get_embeddings_layer_inp(ctx, lid)`, `llama_model_target_layer_ids(model)` → `const int32_t*`, `llama_model_target_layer_ids_n(model)` → `uint32_t`. New `LLM_ARCH_EAGLE3` model architecture; new `llama_model_eagle3` struct in upstream model sources. EAGLE3 enables full encoder+decoder graph implementation for speculative decoding. All consumed inside upstream-compiled `speculative.cpp` and model TUs; project does not reference any of these symbols — no project source changes required. Could be exposed later as a speculative-decoding backend type in `ModelParameters` |
| ~b9555–b9621 | `src/llama-graph.h` + `src/llama-graph.cpp` | `llm_graph_result::set_outputs()` signature changed: now takes a `const llm_graph_params &` parameter (was no-parameter). New `t_layer_inp` vector added to `llm_graph_result` for layer-input embedding extraction (used by EAGLE3). Internal graph-building API; not called from project sources — no project source changes required |
| ~b9555–b9621 | `src/llama-context.cpp` | `llama_context` now initializes `embeddings_layer_inp` storage for EAGLE3 layer-input extraction; `n_outputs_max` is forced to `n_batch` when `llama_model_has_encoder()` returns true (encoder models always need all outputs). Internal context lifecycle; no project sources reference these fields — no project source changes required |
| ~b9555–b9621 | `vendor/cpp-httplib/httplib.h` + `httplib.cpp` | cpp-httplib bumped to v0.47.0. Compiled automatically via FetchContent — no project source changes required |
| ~b9555–b9621 | `ggml/src/ggml-cuda/ggml-cuda.cu` | `ggml_concat` on CUDA now handles F16, BF16, I8, I16, I32, I64 element types in addition to F32; `active_count` tracking added to CUDA context to prevent memory leak from lazy `cudaMemGetInfo` context creation. Internal CUDA backend, no project changes required |
| ~b9555–b9621 | `ggml/src/ggml-vulkan/` + Vulkan shaders | New `VK_VALVE_shader_mixed_float_dot_product` extension support for F16→F32 fused dot products (`dot2_f16`) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required |
| ~b9555–b9621 | `ggml/src/ggml-opencl/` + OpenCL kernels | New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects `opencl-android-aarch64` classifier build only); no project source changes required |
| ~b9555–b9621 | `ggml/src/ggml-cuda/ssm-scan.cu` | Added `__syncthreads()` before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required |
Loading