You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Bump GIT_TAG, README badge, CLAUDE.md.
- Rename project-side cache pin from LLAMA_BUILD_WEBUI to LLAMA_BUILD_UI:
the top-level CMakeLists no longer forwards the deprecated name (the
shim survived only in tools/ui/CMakeLists.txt, which is not configured
in FetchContent mode).
- No C++ source changes required. server-context.cpp now includes
common/fit.h to estimate draft/MTP VRAM when fit_params is on; the
include resolves via the existing llama-common include path and the
feature is purely additive.
Local: 435/435 ctest pass.
Copy file name to clipboardExpand all lines: CLAUDE.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9297**
9
+
Current llama.cpp pinned version: **b9305**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -408,6 +408,17 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
408
408
|~b9284–b9297 |`ggml/src/ggml-vulkan/CMakeLists.txt`|`find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
409
409
|~b9284–b9297 |`ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`) | ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
410
410
|~b9284–b9297 |`tools/perplexity/perplexity.cpp`|`log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |
411
+
|~b9297–b9305 | upstream `CMakeLists.txt`| Top-level backward-compat shims that forwarded `LLAMA_BUILD_WEBUI` → `LLAMA_BUILD_UI` and `LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_USE_PREBUILT_UI` were REMOVED (they now live only in `tools/ui/CMakeLists.txt`). **Java impact**: project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` no longer hits the shim at top level. `tools/ui` is not configured in FetchContent mode (`LLAMA_BUILD_TOOLS=OFF`), so the old setting was inert in practice, but the project's `CMakeLists.txt:107` was renamed to `set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE)` for clarity and to defend against future flips of `LLAMA_BUILD_UI` default |
412
+
|~b9297–b9305 |`common/common.h`|`LLAMA_UI_DEFAULT_ENABLED` macro removed; `common_params::ui` default is now unconditionally `true`. Not referenced by project, no changes required |
413
+
|~b9297–b9305 |`common/fit.{h,cpp}`|`common_get_device_memory_data()` made non-static and exported from `fit.h` (was a file-local helper). `fit.h` now also pulls in `ggml-backend.h`, `llama.h`, and `../src/llama-ext.h`. Used by upstream `tools/server/server-context.cpp` (compiled directly into jllama). The `#include "../src/llama-ext.h"` resolves relative to fit.h's location (`common/../src/llama-ext.h`), so no extra include paths are required. No project source changes |
414
+
|~b9297–b9305 |`tools/server/server-context.cpp`| New `#include "fit.h"` and a new draft/MTP memory measurement block: when `params_base.fit_params` is set AND the speculative config includes a draft model or `COMMON_SPECULATIVE_TYPE_DRAFT_MTP`, `common_get_device_memory_data()` is called against the draft model (or a copy of the target params with `LLAMA_CONTEXT_TYPE_MTP` for MTP) and the resulting per-device `model + context + compute` bytes are added to `params_base.fit_params_target` before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. `ModelParameters.setFit(boolean)` defaults to `on`, so this kicks in automatically when a user configures a draft model — no Java-side wiring required |
415
+
|~b9297–b9305 |`tools/server/server-context.cpp`|`[mtmd] estimated memory usage of mmproj` log line reworded to `estimated worst-case memory usage`; log only, no behavioural change |
416
+
|~b9297–b9305 |`tools/server/server-http.cpp`| UI serving path migrated from per-asset extern arrays (`index_html`, `bundle_js`, …) and the `LLAMA_BUILD_UI` macro to a runtime `llama_ui_find_asset()` lookup gated on the new `LLAMA_UI_HAS_ASSETS` macro generated by the new `llama-ui-embed` host tool. Project does NOT compile `server-http.cpp` (only `server-context.cpp`/`server-queue.cpp`/`server-task.cpp`/`server-models.cpp`), no impact |
417
+
|~b9297–b9305 |`tools/ui/` (`CMakeLists.txt`, new `embed.cpp`, new `sources.cmake`, new `scripts/ui-assets.cmake`, removed `scripts/ui-download.cmake` + `scripts/xxd.cmake`, removed `ui.cpp`+`ui.h`) | Full UI build pipeline rewrite: `xxd.cmake`+`ui-download.cmake` replaced by a host-compiled `llama-ui-embed` C++ tool that generates `ui.cpp`/`ui.h` (declaring a `g_assets[]` table and `llama_ui_find_asset()` lookup, plus `LLAMA_UI_HAS_ASSETS` macro) from arbitrary asset files; new `scripts/ui-assets.cmake` orchestrates asset provisioning with a clearer priority (pre-built `tools/ui/dist` → npm build → HF Bucket); `tools/ui` is now an `add_custom_target` always re-run per build. The deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI`/`LLAMA_WEBUI_HF_BUCKET` moved here from the top-level `CMakeLists.txt`. Project does not build the UI (`LLAMA_BUILD_TOOLS=OFF` in FetchContent mode), no impact |
418
+
|~b9297–b9305 |`ggml/include/ggml-alloc.h`| Comment-only API documentation update for `ggml_backend_alloc_ctx_tensors_from_buft`. No project changes required |
419
+
|~b9297–b9305 |`ggml/src/ggml-backend-meta.cpp`| Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required |
420
+
|~b9297–b9305 |`ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c`|`hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
421
+
|~b9297–b9305 |`ggml/src/ggml-opencl/ggml-opencl.cpp`|`GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required |
0 commit comments