Upgrade llama.cpp from b9071 to b9094 by bernardladenthin · Pull Request #115 · bernardladenthin/java-llama.cpp

bernardladenthin · 2026-05-10T09:53:31Z

Summary

This PR upgrades the pinned llama.cpp version from b9071 to b9094, incorporating upstream improvements across CUDA kernels, model support, server functionality, and UI components.

Key Changes

CUDA & Backend Enhancements

AllReduce Pipeline: New 2-GPU PCIe AllReduce for tensor parallelism (Volta+ required), controlled via GGML_CUDA_ALLREDUCE env var
Snake Activation Kernel: Fused CUDA kernel for BigVGAN/Vocos audio models, optimizing the y = x + sin(a*x)^2 * inv_b operation
Flash Attention: Extended head size support (192) for MiMo-V2.5/V2.5-Pro/V2-Flash models with GQA
Multi-GPU Communication: Refactored context to ggml_backend_cuda_comm_context with try_allreduce function pointer
SYCL Backend: Q5_K memory layout reordering and MMVQ kernel for Intel GPUs; improved K/V buffer handling
Hexagon Backend: HVX-vectorized GATED_DELTA_NET and L2_NORM operations

Model Support

Sarvam-MoE: New model support (sarvamai/sarvam-30b) with new vocab pre-type LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51
Gemma4: Updated expert handling with fallback support for separate gate/up experts and NVFP4 per-expert scale folding

Server Improvements

Model Info API: New get_model_info() method; /v1/models endpoint now includes n_ctx field
GCP/Vertex AI Compatibility: New register_gcp_compat() method supporting AIP environment variables
Router Communication: Child-to-parent model info propagation via new CMD_CHILD_TO_ROUTER_INFO command
Reasoning Budget: Adjusted token logit handling (only competing tokens set to -INFINITY)

UI Updates

Settings Registry: Refactored settings configuration structure
Route Changes: MCP route updated (#/settings/mcp → #/mcp-servers); settings route generalized

Notes

All changes are compiled from upstream sources. No JNI layer modifications are required; Java callers transparently receive new fields and functionality.

https://claude.ai/code/session_01Jbb6GBWWnZ94gsdFeuKHNw

No breaking changes to JNI layer. New upstream features: 2-GPU PCIe AllReduce (CUDA), Snake activation fusion (CUDA), flash attention head size 192, Q5_K SYCL reorder, Hexagon GATED_DELTA_NET/L2_NORM, Sarvam-MoE model, Gemma4 split experts, n_ctx in /v1/models, GCP/Vertex AI compat server endpoint, router child→parent model info propagation. https://claude.ai/code/session_01Jbb6GBWWnZ94gsdFeuKHNw

claude · 2026-05-10T09:54:27Z

 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.

-Current llama.cpp pinned version: **b9071**
+Current llama.cpp pinned version: **b9094**


✓ Correctly updated version from b9071 to b9094 as required by the upgrade procedure (also updated in CMakeLists.txt and README.md).

claude · 2026-05-10T09:54:37Z

Review Summary

✅ Version Upgrade Procedure: Correctly updated all three required files:

CMakeLists.txt: GIT_TAG b9071 to b9094
README.md: Badge and link updated
CLAUDE.md: Version line updated, plus comprehensive changelog

✅ Changelog Documentation: The added 12 rows (b9071–b9094) are comprehensive and well-structured:

Clearly categorized by subsystem (CUDA, SYCL, Hexagon, models, server, UI)
Includes impact assessment for each change
Explicitly confirms no breaking changes to JNI layer in commit message
New model info API includes new field in API response, transparent to Java callers
All backend and kernel changes are properly noted as internal only

✅ Key Safety Notes:

GCP/Vertex AI compat registration is new and additive
Router child-to-parent model info propagation is internal and consistent
Reasoning budget logit behavior change is internal sampler-only with no JNI impact
All upstream source compilation is handled automatically via FetchContent

No blockers detected. The upgrade follows documented procedures and introduces no visible breaking changes to the Java API.

claude · 2026-05-10T09:54:41Z

+| ~b9071–b9094 | `ggml/src/ggml-cuda/allreduce.cu` + `allreduce.cuh` (NEW) | 2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via `GGML_CUDA_ALLREDUCE` env var (`nccl`/`internal`/`none`); compiled automatically via FetchContent, no project changes required |
+| ~b9071–b9094 | `ggml/src/ggml-cuda/snake.cu` + `snake.cuh` (NEW) | Fused CUDA Snake activation kernel (`y = x + sin(a*x)^2 * inv_b`) for BigVGAN/Vocos audio models; fuses 5-op chain `MUL→SIN→SQR→MUL→ADD` at graph level; F32/F16/BF16; compiled automatically, no project changes required |
+| ~b9071–b9094 | `ggml/src/ggml-cuda/ggml-cuda.cu` | Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to `ggml_backend_cuda_comm_context` with `try_allreduce` function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required |
+| ~b9071–b9094 | `ggml/src/ggml-sycl/` | Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required |
+| ~b9071–b9094 | `ggml/src/ggml-hexagon/` | GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required |
+| ~b9071–b9094 | `src/models/sarvam.cpp` (NEW) | Sarvam-MoE model (`sarvamai/sarvam-30b`); reuses BailingMoeV2 arch; new vocab pre-type `LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51`; additive, no project changes required |
+| ~b9071–b9094 | `src/models/gemma4.cpp` | Gemma4 split gate/up experts: `ffn_gate_up_exps` now TENSOR_NOT_REQUIRED; fallback to separate `ffn_gate_exps`/`ffn_up_exps`; NVFP4 per_expert_scale folding; internal model-loading, no project changes required |
+| ~b9071–b9094 | `tools/server/server-context.h` + `server-context.cpp` | New `get_model_info()` method on `server_context`; `/v1/models` response now includes `"n_ctx"` field (value: `slot_n_ctx`); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently) |
+| ~b9071–b9094 | `tools/server/server-http.h` + `server.cpp` | `handlers` map moved from private to public in `server_http_context`; new `register_gcp_compat()` method exposes GCP/Vertex AI Prediction Protocol endpoint reading `AIP_MODE`/`AIP_PREDICT_ROUTE`/`AIP_HEALTH_ROUTE`/`AIP_HTTP_PORT` env vars; compiled from upstream sources, no project changes required |
+| ~b9071–b9094 | `tools/server/server-models.h` + `server.cpp` | Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required |
+| ~b9071–b9094 | `common/reasoning-budget.cpp` | Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required |
+| ~b9071–b9094 | `tools/server/webui/` | Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required |


Excellent changelog entries (b9071–b9094). Each row clearly documents:

What changed and where (file paths)

The technical detail (kernel improvements, new models, API additions)

Project impact (all correctly marked as "no project changes required" or "transparent to Java callers")

Particularly good: The model info API note explicitly confirms that Java callers transparently receive new fields, and the reasoning budget behavior change is correctly identified as internal sampler-only with zero JNI impact.

claude · 2026-05-10T09:54:43Z

@@ -1,5 +1,5 @@
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
-[![llama.cpp b9071](https://img.shields.io/badge/llama.cpp-%23b9071-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9071)
+[![llama.cpp b9094](https://img.shields.io/badge/llama.cpp-%23b9094-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9094)


✓ Badge and link correctly updated from b9071 to b9094, keeping the badge href in sync with the displayed version.

claude Bot reviewed May 10, 2026

View reviewed changes

bernardladenthin merged commit 202404b into main May 10, 2026
23 of 24 checks passed

bernardladenthin deleted the claude/sweet-dirac-63Uij branch May 10, 2026 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade llama.cpp from b9071 to b9094#115

Upgrade llama.cpp from b9071 to b9094#115
bernardladenthin merged 1 commit into
mainfrom
claude/sweet-dirac-63Uij

bernardladenthin commented May 10, 2026

Uh oh!

claude Bot May 10, 2026

Uh oh!

claude Bot commented May 10, 2026

Uh oh!

claude Bot May 10, 2026

Uh oh!

claude Bot May 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bernardladenthin commented May 10, 2026

Summary

Key Changes

CUDA & Backend Enhancements

Model Support

Server Improvements

UI Updates

Notes

Uh oh!

claude Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 10, 2026

Review Summary

Uh oh!

claude Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants