Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co

Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

Current llama.cpp pinned version: **b9071**
Current llama.cpp pinned version: **b9094**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✓ Correctly updated version from b9071 to b9094 as required by the upgrade procedure (also updated in CMakeLists.txt and README.md).


## Upgrading CUDA Version

Expand Down Expand Up @@ -228,6 +228,18 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
| ~b9049–b9071 | `ggml/src/ggml-cuda/out-prod.cu` | CUDA outer-product uses `cublasSgemmStridedBatched` for batched path (dps2==1, ne2>1); HIP/MUSA compat headers gain the alias; performance improvement, no project changes required |
| ~b9049–b9071 | `tools/mtmd/` | MiniCPM-V 4.6 multimodal support added (`PROJECTOR_TYPE_MINICPMV4_6`, ViT merger graph, new tensor names); additive, no project changes required |
| ~b9049–b9071 | `tools/server/webui/` | LLM-based conversation title generation; CSS animation `fill-mode-forwards` fixes; UI-only changes compiled into upstream server, no project changes required |
| ~b9071–b9094 | `ggml/src/ggml-cuda/allreduce.cu` + `allreduce.cuh` (NEW) | 2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via `GGML_CUDA_ALLREDUCE` env var (`nccl`/`internal`/`none`); compiled automatically via FetchContent, no project changes required |
| ~b9071–b9094 | `ggml/src/ggml-cuda/snake.cu` + `snake.cuh` (NEW) | Fused CUDA Snake activation kernel (`y = x + sin(a*x)^2 * inv_b`) for BigVGAN/Vocos audio models; fuses 5-op chain `MUL→SIN→SQR→MUL→ADD` at graph level; F32/F16/BF16; compiled automatically, no project changes required |
| ~b9071–b9094 | `ggml/src/ggml-cuda/ggml-cuda.cu` | Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to `ggml_backend_cuda_comm_context` with `try_allreduce` function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required |
| ~b9071–b9094 | `ggml/src/ggml-sycl/` | Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required |
| ~b9071–b9094 | `ggml/src/ggml-hexagon/` | GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required |
| ~b9071–b9094 | `src/models/sarvam.cpp` (NEW) | Sarvam-MoE model (`sarvamai/sarvam-30b`); reuses BailingMoeV2 arch; new vocab pre-type `LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51`; additive, no project changes required |
| ~b9071–b9094 | `src/models/gemma4.cpp` | Gemma4 split gate/up experts: `ffn_gate_up_exps` now TENSOR_NOT_REQUIRED; fallback to separate `ffn_gate_exps`/`ffn_up_exps`; NVFP4 per_expert_scale folding; internal model-loading, no project changes required |
| ~b9071–b9094 | `tools/server/server-context.h` + `server-context.cpp` | New `get_model_info()` method on `server_context`; `/v1/models` response now includes `"n_ctx"` field (value: `slot_n_ctx`); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently) |
| ~b9071–b9094 | `tools/server/server-http.h` + `server.cpp` | `handlers` map moved from private to public in `server_http_context`; new `register_gcp_compat()` method exposes GCP/Vertex AI Prediction Protocol endpoint reading `AIP_MODE`/`AIP_PREDICT_ROUTE`/`AIP_HEALTH_ROUTE`/`AIP_HTTP_PORT` env vars; compiled from upstream sources, no project changes required |
| ~b9071–b9094 | `tools/server/server-models.h` + `server.cpp` | Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required |
| ~b9071–b9094 | `common/reasoning-budget.cpp` | Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required |
| ~b9071–b9094 | `tools/server/webui/` | Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required |
Comment on lines +231 to +242

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent changelog entries (b9071–b9094). Each row clearly documents:

  1. What changed and where (file paths)
  2. The technical detail (kernel improvements, new models, API additions)
  3. Project impact (all correctly marked as "no project changes required" or "transparent to Java callers")

Particularly good: The model info API note explicitly confirms that Java callers transparently receive new fields, and the reasoning budget behavior change is correctly identified as internal sampler-only with zero JNI impact.


## Build Commands

Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ set(GGML_AVX512 OFF CACHE BOOL "" FORCE)
FetchContent_Declare(
llama.cpp
GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
GIT_TAG b9071
GIT_TAG b9094
)
FetchContent_MakeAvailable(llama.cpp)

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
[![llama.cpp b9071](https://img.shields.io/badge/llama.cpp-%23b9071-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9071)
[![llama.cpp b9094](https://img.shields.io/badge/llama.cpp-%23b9094-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9094)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✓ Badge and link correctly updated from b9071 to b9094, keeping the badge href in sync with the displayed version.

[![Snapshot](https://img.shields.io/badge/snapshot-latest-informational)](https://github.com/bernardladenthin/java-llama.cpp/releases/tag/snapshot)

# Java Bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp)
Expand Down
Loading