Upgrade llama.cpp from b9094 to b9102

claude · claude · commit 73b9b53f3376 · 2026-05-11T10:46:12.000Z
Key changes in b9102: - Internal CUDA AllReduce pipeline (no NCCL required, works on Windows/PCIe) - SYCL IM2COL_3D support for Intel GPU backend - Bug fix: backend sampling now correctly tracks cur_p.selected for n_probs - Bug fix: post_sampling_probs now works with backend sampling - n_vocab loading moved to per-model load_arch_hparams() (internal refactor) - httplib 0.43.4: chunk-size security fix (manual hex parsing vs strtoul) - ggml version patch 0.11.0 → 0.11.1 No project-level JNI or Java changes required. https://claude.ai/code/session_01QopdxqEvbkhiaaBRqBzgzc
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 
 Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
 
-Current llama.cpp pinned version: **b9094**
+Current llama.cpp pinned version: **b9102**
 
 ## Upgrading CUDA Version
 
@@ -240,6 +240,15 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
 | ~b9071–b9094 | `tools/server/server-models.h` + `server.cpp` | Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required |
 | ~b9071–b9094 | `common/reasoning-budget.cpp` | Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required |
 | ~b9071–b9094 | `tools/server/webui/` | Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required |
+| ~b9094–b9102 | `ggml/src/ggml-cuda/allreduce.cu` + `allreduce.cuh` | Internal CUDA AllReduce pipeline refactored with `ggml_cuda_ar_pipeline` struct; `ggml_cuda_ar_pipeline_init(devices, n_devices)` / `_free` / `_allreduce` APIs; supports 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+); chunked kernel path (small tensors) vs copy-engine path (large tensors); `GGML_CUDA_ALLREDUCE` env = `nccl`/`internal`/`none`; env tuning vars `GGML_CUDA_AR_COPY_THRESHOLD` / `GGML_CUDA_AR_COPY_CHUNK_BYTES` / `GGML_CUDA_AR_BF16_THRESHOLD`; HIP/MUSA builds return nullptr stub; compiled automatically via FetchContent, no project changes required |
+| ~b9094–b9102 | `ggml/src/ggml-cuda/ggml-cuda.cu` | `GGML_LOG_WARN_ONCE` macro added; `ggml_backend_cuda_comm_context` gains `try_allreduce` fn pointer and `ar_pipeline`; three dispatch fns: `try_allreduce_nccl`, `try_allreduce_internal`, `try_allreduce_butterfly`; init chain: `comm_init_nccl` → `comm_init_internal` → `comm_init_none`; platform default Linux→NCCL, Windows→internal; no project changes required |
+| ~b9094–b9102 | `ggml/src/ggml-sycl/ggml-sycl.cpp` + `im2col.cpp` + `im2col.hpp` | New `ggml_sycl_im2col_3d` function; `GGML_OP_IM2COL_3D` now supported on Intel GPU via SYCL; 2D im2col kernel rewritten with tile-based `IC_KH_KW` thread decomposition; new `SYCL_IM2COL_BLOCK_SIZE 256`; additive, no project changes required |
+| ~b9094–b9102 | `ggml/CMakeLists.txt` | GGML version patch bumped 0.11.0 → 0.11.1; no project changes required |
+| ~b9094–b9102 | `common/sampling.cpp` | Bug fix in `common_sampler_sample`: `set_logits` now called at the top before backend-sampling check; backend sampling token-selection now scans all of `cur_p.data` to find matching token (instead of artificial 1-element array), fixing `cur_p.selected` for downstream `n_probs`; post-sampling probabilities now work correctly with backend sampling |
+| ~b9094–b9102 | `tools/server/server-context.cpp` | `need_logits` renamed to `need_pre_sample_logits`; only set when `n_probs > 0 && !post_sampling_probs`; backend sampling now works with `post_sampling_probs`; 0.0-probability tokens filtered from `result.probs`; compiled from upstream, no project JNI changes required |
+| ~b9094–b9102 | `src/llama-model.cpp` | `n_vocab` loading moved from `llama_model_base::load_hparams()` to per-model `load_arch_hparams()` (e.g. `src/models/deepseek2.cpp`, `src/models/llama.cpp`); internal model-loading refactor, no project changes required |
+| ~b9094–b9102 | `src/llama-model.cpp` | `ggml/src/ggml-virtgpu/ggml-backend-device.cpp` gains `#include <mutex>` for `std::once_flag`; internal backend fix, no project changes required |
+| ~b9094–b9102 | `vendor/cpp-httplib/httplib.cpp` + `httplib.h` | Security fix: chunk-size parsing replaced `strtoul` with manual hex-digit scanning to prevent overflow and reject invalid chunk extensions; version bumped to 0.43.4; compiled automatically, no project changes required |
 
 ## Build Commands
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -97,7 +97,7 @@ set(GGML_AVX512  OFF CACHE BOOL "" FORCE)
 FetchContent_Declare(
 	llama.cpp
 	GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git
-	GIT_TAG        b9094
+	GIT_TAG        b9102
 )
 FetchContent_MakeAvailable(llama.cpp)
 
diff --git a/README.md b/README.md
@@ -1,5 +1,5 @@
 ![Java 8+](https://img.shields.io/badge/Java-8%2B-informational)
-[![llama.cpp b9094](https://img.shields.io/badge/llama.cpp-%23b9094-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9094)
+[![llama.cpp b9102](https://img.shields.io/badge/llama.cpp-%23b9102-informational)](https://github.com/ggml-org/llama.cpp/releases/tag/b9102)
 [![Snapshot](https://img.shields.io/badge/snapshot-latest-informational)](https://github.com/bernardladenthin/java-llama.cpp/releases/tag/snapshot)
 
 # Java Bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp)

Original file line number	Diff line number	Diff line change
`@@ -97,7 +97,7 @@ set(GGML_AVX512 OFF CACHE BOOL "" FORCE)`
`97`	`97`	`FetchContent_Declare(`
`98`	`98`	`llama.cpp`
`99`	`99`	`GIT_REPOSITORY https://github.com/ggerganov/llama.cpp.git`
`100`		`- GIT_TAG b9094`
	`100`	`+ GIT_TAG b9102`
`101`	`101`	`)`
`102`	`102`	`FetchContent_MakeAvailable(llama.cpp)`
`103`	`103`