You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CLAUDE.md
+25-1Lines changed: 25 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
6
6
7
7
Java bindings for [llama.cpp](https://github.com/ggerganov/llama.cpp) via JNI, providing a high-level API for LLM inference in Java. The Java layer communicates with a native C++ library through JNI.
8
8
9
-
Current llama.cpp pinned version: **b9245**
9
+
Current llama.cpp pinned version: **b9264**
10
10
11
11
## Upgrading CUDA Version
12
12
@@ -321,6 +321,30 @@ Also review the project `CMakeLists.txt` for build-system-level breaks (e.g. ren
|~b9222–b9245 |`convert_hf_to_gguf.py`, `convert_lora_to_gguf.py`, `examples/save-load-state/save-load-state.cpp`, `examples/llama-eval/*`, `tools/cli/README.md`, `tools/server/README.md`, `docs/speculative.md`, `docs/backend/SYCL.md`| Doc/example/tooling updates only. Not compiled by this project |
323
323
|~b9222–b9245 |`tools/ui/*`| WebUI source reorganisation (enum file renames `*.ts` → `*.enums.ts`, new chat components, Tailwind plugin imports). Project sets `LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE` in `CMakeLists.txt`, so the UI is never built — no impact |
324
+
|~b9245–b9264 |`src/llama-chat.{h,cpp}`|`LLM_CHAT_TEMPLATE_HUNYUAN_OCR` renamed to `LLM_CHAT_TEMPLATE_HUNYUAN_VL` (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required |
325
+
|~b9245–b9264 |`tools/mtmd/clip-impl.h` + `tools/mtmd/models/`|`PROJECTOR_TYPE_HUNYUANOCR` removed and merged into `PROJECTOR_TYPE_HUNYUANVL`; `hunyuanocr.cpp` renamed to `hunyuanvl.cpp`; clip graph class `clip_graph_hunyuanocr` renamed to `clip_graph_hunyuanvl`. Not referenced by project — no source changes required |
326
+
|~b9245–b9264 |`tools/mtmd/clip.h`|`clip_is_minicpmv()` and `clip_is_glm()` removed from public API. Not referenced by project — no source changes required |
327
+
|~b9245–b9264 |`tools/mtmd/clip.h` (`struct clip_context_params`) | New `bool no_alloc` field added (initialized via `mtmd_context_params_default()`). Additive default-zero — no project changes required |
328
+
|~b9245–b9264 |`tools/mtmd/mtmd.h`| New `mtmd_get_memory_usage()` C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project |
329
+
|~b9245–b9264 |`tools/mtmd/clip-model.h`| New `enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST }` replacing the `bool image_resize_pad` flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links `mtmd` as-is |
330
+
|~b9245–b9264 |`common/common.h` (`struct common_params_speculative_draft`) | New `bool backend_sampling = true` field — offloads draft sampling to the backend. Additive default-on; Java `ModelParameters` doesn't set it, so the upstream default applies. Backend sampler auto-disables when `split_mode == TENSOR` in `src/llama-context.cpp` — safe |
331
+
|~b9245–b9264 |`common/speculative.cpp`|`common_speculative_impl_draft_mtp` now registers a per-seq backend sampler chain (top-k 10) on `ctx_dft` via `llama_set_sampler`; cleaned up in destructor. Falls back to CPU sampler if `llama_set_sampler` fails. Internal to upstream-compiled speculative module, no project call sites |
332
+
|~b9245–b9264 |`app/` (new) | New optional unified `llama` binary (`llama-app` target) dispatching to `serve`/`cli`/`completion`/`bench`. Guarded by `LLAMA_BUILD_APP=OFF` default — project doesn't enable it |
333
+
|~b9245–b9264 |`tools/{cli,completion,llama-bench,server}/CMakeLists.txt`| Each tool split into a `*-impl` static library (the logic) plus a thin `main.cpp` wrapper; the `main()` in `cli.cpp`/`completion.cpp`/`llama-bench.cpp`/`server.cpp` is renamed to `llama_cli`/`llama_completion`/`llama_bench`/`llama_server` and now satisfies `-Wmissing-declarations` via a forward decl. Project does NOT compile any of these `.cpp` files — only `server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-models.cpp` (see `CMakeLists.txt:237`/`:302`) — so no impact |
334
+
|~b9245–b9264 |`tools/server/server-context.cpp`| Adds mmproj memory estimation: when `params_base.fit_params` is set, calls `mtmd_get_memory_usage(mmproj_path, mparams)` and adds the per-device cost into `params_base.fit_params_target` before `common_init_from_params`. Also calls `mtmd_helper_log_set(common_log_default_callback, nullptr)` once when `!is_resume`. Compiled upstream-as-is, no project call sites |
335
+
|~b9245–b9264 |`src/llama-context.cpp`| New `llama_context::set_sampler()` short-circuits with a one-shot `LLAMA_LOG_WARN` and returns `false` when `model.split_mode() == LLAMA_SPLIT_MODE_TENSOR` (backend sampling not supported with tensor split). Internal safety check, no project call sites |
336
+
|~b9245–b9264 |`common/arg.cpp`| New CLI flags `--spec-draft-backend-sampling` / `--no-spec-draft-backend-sampling` and env `LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING` to toggle the new `backend_sampling` field. Not exposed by `ModelParameters`; could be added later as a Java-side enhancement |
337
+
|~b9245–b9264 |`ggml/src/ggml-cuda/CMakeLists.txt` + `common.cuh` + `binbcast.cu`, `concat.cu`, `cpy.cu`, `fattn-*.cu`, `gated_delta_net.cu`, `getrows.cu`, `mean.cu`, `mmvf.cu`, `mmvq.cu`, `norm.cu`, `quantize.cu`, `reduce_rows.cuh`, `rope.cu`, `scale.cu`, `set-rows.cu`, `softcap.cu`, `ssm-conv.cu`, `ssm-scan.cu`, `sumrows.cu`, `topk-moe.cu`, `unary.cu`| New PDL (Programmatic Dependent Launch) infrastructure: `GGML_CUDA_USE_PDL` build flag (CUDART ≥ 11.8, non-HIP/MUSA); `ggml_cuda_pdl_sync()` / `ggml_cuda_pdl_lc()` device helpers (active on Hopper sm_90+); `ggml_cuda_kernel_launch_params` + `ggml_cuda_kernel_launch()` host template that calls `cudaLaunchKernelEx` with stream-serialization attribute when `GGML_CUDA_PDL` env var allows. Adds `90-virtual` (Hopper) to default `CMAKE_CUDA_ARCHITECTURES` when CUDA ≥ 11.8. Internal CUDA backend, no project changes required |
338
+
|~b9245–b9264 |`ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp` + `ggml-metal.metal`| New 4-element `kernel_pad_*_4` variant (currently disabled — `is_c4 = false`); `kernel_pad` rewritten with 1024-element-per-block tiling for larger tensors; `kernel_cpy_*` rewritten to use `tpitg` rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend |
339
+
|~b9245–b9264 |`ggml/src/ggml-hexagon/htp/` (`hmx-matmul-ops.c`, `hmx-ops.h`, `matmul-ops.c`, `main.c`) | HMX matmul refactor: K-loop tiled in 32-tile blocks with `Q6_activation_hf_mxmem_RR_deep`; the out-stationary fallback path for large M·K·N was deleted; function rename `hmx_mat_mul_permuted_w16a32` → `hmx_matmul_f16_f32`, `hmx_mat_mul_permuted_qk_0_d16a32` → `hmx_matmul_q_f32`, `hmx_mat_mul_permuted_w16a32_batched_params_t` → `hmx_matmul_f16_f32_batched_params_t`. HMX power-up code reorganized (`HAP_power_set_HMX_v2` now combines power-on + clock in one step for `__HVX_ARCH__ ≥ 75`). Internal Qualcomm DSP backend |
340
+
|~b9245–b9264 |`ggml/src/ggml-opencl/ggml-opencl.cpp`| Lazy kernel compilation: `argsort` and `flash_attn` programs are now built only when first needed (`load_cl_kernels_argsort` / `load_cl_kernels_flash_attn` called from `supports_op`); new device-supported probe in `ggml_opencl_is_device_supported` runs at registration time; renamed `ggml_cl2_init`/`ggml_cl2_free` → `ggml_cl_init`/`ggml_cl_free`; OpenCL contexts now live as long as the process. Internal OpenCL backend |
341
+
|~b9245–b9264 |`ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp`| Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes `BLOCK_SIZE` outputs per step. Internal Vulkan backend |
342
+
|~b9245–b9264 |`src/models/delta-net-base.cpp`| Renamed local variables (`state_in_3d`→`s_3d`, `state_3d`→`s_3d_pad`) when reshaping the recurrent state; behaviour unchanged |
343
+
|~b9245–b9264 |`tools/mtmd/mtmd-image.cpp`|`img_tool::resize()` takes a `pad_style` enum (was `bool add_padding`); new `PAD_NEAREST` rounding path for Pillow byte-parity; `mtmd_image_preprocessor_deepseekocr::preprocess` rewritten with `static constexpr` resolution table and `RESIZE_ALGO_BICUBIC_PILLOW` + `PAD_NEAREST`. Internal mtmd, project links as-is |
344
+
|~b9245–b9264 |`tools/mtmd/models/deepseekocr.cpp`| Extracted `build_sam(ggml_tensor *inp_raw)` member function from the monolithic build path; FA mask casting to F16 only when `flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED`. Internal |
345
+
|~b9245–b9264 |`conversion/hunyuan.py`, `gguf-py/gguf/constants.py`, `gguf-py/gguf/tensor_mapping.py`| HunyuanOCR / HunyuanVL unified in conversion: `VisionProjectorType.HUNYUANOCR` removed; `HunYuanVLForConditionalGeneration` registers a single `HunyuanVLVisionModel` + `HunyuanVLTextModel`; `vit.perceive.*` tensor mappings now only mention `HunyuanVL`. Python tooling, not compiled by project |
346
+
|~b9245–b9264 |`CMakeLists.txt` (upstream) | New `LLAMA_BUILD_APP` option (default OFF); deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_BUILD_UI`/`LLAMA_USE_PREBUILT_UI` preserved. Project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` still works unchanged |
347
+
|~b9245–b9264 |`.devops/*.Dockerfile`, `.github/workflows/build-and-test-snapdragon.yml`, `scripts/snapdragon/`, `docs/backend/snapdragon/`, `tools/cli/README.md`, `tools/server/README.md`, `tools/mtmd/tests/`| Docker images add `conversion/` dir; snapdragon toolchain bumped v0.3 → v0.6 with `+dotprod+i8mm`; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project |
0 commit comments