llama.cpp upstream breaking changes — version-range changelog

Per-version-range record of upstream API breaks observed in the b5022 → latest range, what the affected upstream files are, and the project-side fix (or "no project changes required" when the break stayed inside an upstream-compiled translation unit).

Used during llama.cpp version bumps: when upgrading, scan this file from the row matching the current pinned version forward to the target, apply any rows marked as needing project source changes, and append a new row covering the upgrade range. See the "Upgrading/Downgrading llama.cpp Version" section in ../../CLAUDE.md for the upgrade workflow.

Version	File	Change
~b7217–b7433	`common/common.h`, `include/llama-cpp.h`	`common_init_result` became `common_init_result_ptr`; access changed to `->model()` / `->context()` / `->free_context()`
~b7433	`common/arg.h`	`n_parallel` default changed to sentinel `-1` (auto); Java bindings must resolve to `1` before model load
~b7217–b7783	`common/arg.h` → `common/download.h`	`common_remote_get_content` and `common_remote_params` split into new `download.h`; `headers` changed from `vector<string>` to `vector<pair>`
~b7783	`common/common.h`	`build_info` string moved into `common.h`; local definition must be removed
~b7783–b7858	`common/chat.h`	`common_chat_syntax` renamed to `common_chat_parser_params`; `to_json_oaicompat<json>()` template removed (no template arg); `ensure_tool_call_ids_set()` → `set_tool_call_ids()`
~b7858–b7864	`common/speculative.h`	Full redesign: `common_speculative_init(ctx_tgt, ctx_dft)` → `common_speculative_init(params_speculative, ctx)`; `common_speculative_gen_draft` → `common_speculative_draft`; new `common_speculative_accept()`; `common_speculative_params` struct replaced by `common_params_speculative`; draft model loaded via `llama_model_load_from_file` into `llama_model_ptr`
~b7858–b7864	`common/common.h`	`params_speculative`: `.model.path`/`.hf_repo` replaced by `.has_dft()`/`.mparams_dft`; new `.model_dft` and `.cparams_dft` fields; `speculative.type` enum added (`COMMON_SPECULATIVE_TYPE_NONE`)
~b7858–b7864	`server.hpp` (internal)	`slot_action.slot_id` → `slot_action.id_slot`; `llama_init_dft` removed from `server_context`; `model_dft` changed from `llama_model*` to `llama_model_ptr`; `slot.ctx_tgt`/`ctx_dft` removed
~b7864	`common/mtmd.h`	`mtmd_init_params.verbosity` field removed
~b7904–b8190	`common/common.h`	`params_base.model_alias` changed from `std::string` to a container; use `*model_alias.begin()` instead of direct string cast
~b8778–b8808	`tools/mtmd/mtmd.h`	`MTMD_DEFAULT_IMAGE_MARKER` macro removed; `mtmd_image_tokens_get_nx/ny` deprecated; new `mtmd_decoder_pos` struct + `mtmd_image_tokens_get_decoder_pos()`; `mtmd_context_params_default()` now sets `image_marker = nullptr` (throws `"custom image_marker is not supported anymore"` if non-null); upstream server adds randomized `get_media_marker()` in `server-common.h` — our `server.hpp` is unaffected since it does not include that header and uses `mtmd_default_marker()` consistently
~b8808–b8831	project `CMakeLists.txt`	CMake target `common` renamed to `llama-common`; update `target_link_libraries` for `jllama` and `jllama_test`
~b8808–b8831	`common/common.h` → new `common/build-info.h`	`build_info` `std::string` removed; replaced by `llama_build_info()` (`const char*`) in new `build-info.h`; add `#include "build-info.h"` in `server.hpp` and `utils.hpp`; call sites: `std::string(llama_build_info())` in `server.hpp` (6×), `llama_build_info()` in `jllama.cpp` (1×) and `utils.hpp` (1×)
~b8808–b8831	`ggml/src/ggml.c`	New `ggml_graph_next_uid()` calls `_InterlockedIncrement64` via `<intrin.h>` on x86; intrinsic unavailable on 32-bit MSVC; fix: `src/main/cpp/compat/ggml_x86_compat.c` provides `__cdecl _InterlockedIncrement64` via `InterlockedIncrement64` (CMPXCHG8B), added to `ggml-base` via `target_sources` guarded by `MSVC AND CMAKE_SIZEOF_VOID_P EQUAL 4`
~b8838–b8841	`src/llama-model.h`	Attention bias fields renamed: `bq`→`wq_b`, `bk`→`wk_b`, `bv`→`wv_b`, `bo`→`wo_b`, `bqkv`→`wqkv_b`; internal to llama.cpp, no impact on this project
~b8841–b8854	`common/common.h`	`common_params::clear_idle` renamed to `cache_idle_slots`; new `common_context_seq_rm_type` enum + `common_context_can_seq_rm()` replacing `common_speculative_is_compat()`; `get_model_endpoint()` → `common_get_model_endpoint()`
~b8841–b8854	`tools/mtmd/mtmd.h` + `mtmd-helper.h`	`mtmd_decoder_pos` gains `z` field; `mtmd_image_tokens_get_decoder_pos()` + `mtmd_helper_image_get_decoder_pos()` gain new `pos_0` parameter
~b8841–b8854	project `utils.hpp` / `server.hpp`	`server_tokens::get_text_tokens()` split: `get_tokens()` returns raw `const llama_tokens &`; new `get_text_tokens()` returns filtered copy (removes `LLAMA_TOKEN_NULL` mtmd placeholders); save/load and context-shift call sites updated to `get_tokens()`
~b8854–b8887	`common/chat.h`	`common_chat_msg_diff_to_json_oaicompat` removed; moved to `tools/server/server-chat.cpp`; project defines it locally in `server.hpp` — importing server-chat.cpp is impractical because it pulls in `convert_transcriptions_to_chatcmpl` → `get_media_marker` → `server-common.cpp`
~b8854–b8887	`common/common.h`	`common_params::reasoning_budget` and `reasoning_budget_message` moved into `common_params::sampling` sub-struct as `reasoning_budget_tokens`; update: `params_base.reasoning_budget` → `params_base.sampling.reasoning_budget_tokens`
~b8854–b8887	`common/fit.h` (new)	`llama_params_fit` and `llama_memory_breakdown_print` removed from `include/llama.h`; now `common_fit_params` / `common_memory_breakdown_print` in new `common/fit.h`; not used directly by project
~b8887–b8913	`tools/server/server-chat.h`	`convert_transcriptions_to_chatcmpl` gained a new `const common_chat_templates * tmpls` second parameter; not called by project's `server.hpp` — handled automatically by upstream `server-chat.cpp`
~b8887–b8913	`tools/server/server-task.cpp`	`n_discard` clamped to non-negative: `params.n_discard = std::max(0, params.n_discard)`; applied in project's `server.hpp` after the `json_value` parse
~b8887–b8913	`tools/server/server-common.cpp`	`parallel_tool_calls` now defaults to `caps["supports_parallel_tool_calls"]` instead of hardcoded `false`; handled automatically by upstream file
~b8887–b8913	`common/chat.h`	New additive `common_chat_prompt_preset` struct and `common_chat_get_asr_prompt()` function; no project changes required
~b8887–b8913	`common/common.h`	New `string_starts_with(std::string_view, char)` overload added; no project changes required
~b8887–b8913	`tools/mtmd/mtmd.cpp`	Added `LLAMA_ROPE_TYPE_NONE` case to rope-type switch; internal fix, no project changes required
~b8913–b8953	`common/debug.h`	`base_callback_data` renamed to `common_debug_cb_user_data`; template `common_debug_cb_eval<false/true>` replaced by plain `common_debug_cb_eval`; not used by this project
~b8913–b8953	`tools/server/server-http.h`	New `uploaded_file` struct; `files` map type changed from `map<string, raw_buffer>` to `map<string, uploaded_file>`; upstream server sources compiled directly — no project impact
~b8913–b8953	`src/llama-quant.cpp`	Default quantization ftype changed from `LLAMA_FTYPE_MOSTLY_Q5_1` to `LLAMA_FTYPE_MOSTLY_Q8_0`; upstream only
~b8913–b8953	`src/models/llama.cpp`, `qwen3.cpp`, `qwen3moe.cpp`	Removed duplicate `ggml_mul` for `wo_s` scale (now handled exclusively by `build_attn`); upstream only
~b8953–b8962	`common/common.h`	`struct cpu_params` → `struct common_cpu_params`; `cpu_get_num_physical_cores()` → `common_cpu_get_num_physical_cores()`; `cpu_get_num_math()` → `common_cpu_get_num_math()`; not used directly by project
~b8953–b8962	`common/common.h`	`common_params_speculative` fully restructured with nested sub-structs: `.mparams_dft`/`.model_dft`/`.cparams_dft`/`.n_max`/`.n_min`/`.p_split`/`.p_min` → `.draft.mparams`/`.draft.model`/`.draft.cparams`/`.draft.n_max`/`.draft.n_min`/`.draft.p_split`/`.draft.p_min`; ngram fields moved to `.ngram_cache`/`.ngram_mod`/`.ngram_simple`/etc sub-structs; not referenced by project directly
~b8953–b8962	`common/arg.h`	`is_sparam` bool split into `is_sampling` + `is_spec`; `set_sparam()` split into `set_sampling()` + `set_spec()`; not used by project
~b8953–b8962	`tools/server/server-task.cpp`	`task_params::to_json()` drops `"speculative.n_max"`, `"speculative.n_min"`, `"speculative.p_min"` from output; only `"speculative.type"` remains; test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly
~b8953–b8962	`common/speculative.h`	New public API: `common_speculative_n_max()` and `common_speculative_n_min()` added; server-context.cpp uses these instead of direct field access; no project changes required
~b8962–b8982	`common/sampling.h`	`common_sampler_accept` 3rd param renamed `accept_grammar` → `is_generated`; semantics broadened: `false` now also skips reasoning budget update (not just grammar); no project call sites affected
~b8962–b8982	`common/reasoning-budget.h`	Two overloads merged: `prefill_tokens` variant removed; new single overload takes `initial_state = REASONING_BUDGET_IDLE`; prefill now fed via `llama_sampler_accept()` loop after init; not called directly by project
~b8962–b8982	`ggml/src/ggml-cuda/ssm-conv.cuh`	`ggml_cuda_op_ssm_conv` gained optional `bias_add_node` param; `SSM_CONV + ADD + SILU` fusion now supported; internal CUDA code, no project changes required
~b8962–b8982	`common/speculative.cpp`	Draft token confidence check (`p_min`) moved before push to result: low-confidence tokens are now discarded entirely rather than included then ignored; behavior fix, no project changes required
~b8962–b8982	`tools/server/server-context.cpp`	`n_draft_total` accounting moved to draft generation site instead of acceptance site (bug fix); upstream only
~b8982–b8994	`ggml/src/ggml-cuda.cu`	`ggml_backend_cuda_i` struct: `.get_tensor_2d_async` and `.set_tensor_2d_async` function pointers were swapped (get pointed to set impl and vice versa); corrected; internal CUDA backend, no project changes required
~b8982–b8994	`ggml/src/ggml-vulkan.cpp`	`ggml_vk_buffer_write_2d_async` and `ggml_vk_buffer_write_2d` gained a `dpitch` parameter; Vulkan now implements `set_tensor_2d`/`get_tensor_2d` in buffer interface; internal backend code, no project changes required
~b8982–b8994	`common/speculative.cpp`	Checkpoint helpers renamed: `draft_create_checkpoint` → `create_checkpoint`, `draft_restore_checkpoint` → `restore_checkpoint`; `ckpt_size` field removed (size computed from context directly); internal speculative module, not called by project
~b8982–b8994	`common/arg.cpp`	CLI option typo fixed: `--spec--draft-p-split` → `--spec-draft-p-split` (extra dash removed); CLI-only, no project changes required
~b8982–b8994	`src/llama-mmap.cpp`	Windows large-file (>2 GB) fix: `ftell`/`fseek` replaced with `_ftelli64`/`_fseeki64`; upstream only
~b8982–b8994	`tools/server/httplib.h`	cpp-httplib bumped to v0.43.2: Windows `FILE_SHARE_WRITE` fix, Linux DNS cancel race fix, mbedTLS `close_notify` fix; upstream server header, no project changes required
~b8982–b8994	`tools/server/server-context.cpp`	New `LLAMA_TRACE` env variable enables slot acceptance tracing; upstream only
~b8994–b9004	`ggml/src/ggml-vulkan/ggml-vulkan.cpp`	`vk_fa_pipeline_state` gains `k_type`/`v_type` fields; `get_fa_tuning_params_coopmat2` now takes separate `k_type`/`v_type` params; mixed K/V type FA pipeline creation refactored to `CREATE_FA_CM2_MIXED()` macro; `flash_attn_cm2.comp` shader uses runtime `FaTypeK`/`FaTypeV` spec constants (spec constants 12–15 added); `DECODEFUNC`/`NEEDS_INIT_IQ_SHMEM` macros removed; internal Vulkan backend, no project changes required
~b8994–b9004	`ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp`	`get_mul_mat_fast_pipeline` vectorized-path condition fixed: `dst->ne[1] % 4 == 0` check removed (was preventing vectorization for non-multiple-of-4 batch sizes); internal WebGPU backend, no project changes required
~b8994–b9004	`ggml/src/ggml-hexagon/`	Hexagon HTP backend: FA `exp2` half-precision option, unary-op non-contiguous tensor fix; internal DSP backend, no project changes required
~b8994–b9004	`tools/server/webui/`	Major frontend component reorganization (Svelte/TypeScript); purely UI, no C++ or JNI impact
~b9004–b9016	`src/llama-io.h`	`llama_io_read_i` interface changed: `read(size_t)→read(void,size_t)`, `read_to(void,size_t)` removed, new `read_tensor(tensor,offset,size)` added; `llama_io_write_buffer`/`llama_io_read_buffer` now batch backend tensor ops in destructors for performance; internal state-save/load path, not called by project
~b9004–b9016	`tools/server/server-context.cpp`	Static `server_get_checkpoint()` (returns by value) renamed to `server_prompt_checkpoint_update()` (takes `server_prompt_checkpoint &` by reference, in-place update); compiled directly into jllama, no call site in project code
~b9004–b9016	`common/arg.cpp` + docs	Speculative decoding CLI args renamed: `--draft`/`--draft-n`/`--draft-max` and `--draft-min`/`--draft-n-min` were REMOVED (handler `throw`s `std::invalid_argument` at parse time, not just deprecated); other draft flags (`--draft-p-min`, `--ctx-size-draft`, `--device-draft`, `--gpu-layers-draft`, `--model-draft`) kept as aliases for new canonical `--spec-draft-` names. Java impact: `ModelParameters.setDraftMax`/`setDraftMin` produced removed flags → threw at model load; fixed to canonical `--spec-draft-n-max`/`--spec-draft-n-min`. Other `setDraft` methods updated to canonical names for forward compatibility. Env vars also renamed (`LLAMA_ARG_DRAFT_MAX`→`LLAMA_ARG_SPEC_DRAFT_N_MAX`, etc.)
~b9004–b9016	`ggml/src/ggml-cuda/ggml-cuda.cu`	PCI bus ID detection replaced `snprintf` with `cudaDeviceGetPCIBusId` (buffer 16→32 bytes); HIP/MUSA compat headers gain `cudaDeviceGetPCIBusId` alias; internal CUDA backend
~b9004–b9016	`ggml/src/ggml-opencl/`	Adreno MoE MXFP4: new `kernel_convert_block_mxfp4_trans4_ns`/`restore` kernels in `cvt.cl`; new `gemm_moe_mxfp4_f32_ns`, `gemv_moe_mxfp4_f32_ns`, `moe_reorder_b`, `moe_sort_by_expert` kernel files; GPU-side router reorder replaces CPU-side preprocessing; `q_img` created for GEMM path; internal OpenCL backend
~b9004–b9016	`ggml/src/ggml-vulkan/ggml-vulkan.cpp`	`GGML_VK_MAX_NODES 8192` macro removed (node limit now determined differently); internal Vulkan backend
~b9004–b9016	`ggml/src/ggml-webgpu/`	`ggml_webgpu_row_norm_pipeline_key` gains `src_type`/`dst_type` fields; `GGML_OP_NORM` now supported alongside `GGML_OP_RMS_NORM`/`GGML_OP_L2_NORM`; `row_norm.wgsl` gains SRC_TYPE/DST_TYPE parameterization and NORM two-pass algorithm; internal WebGPU backend
~b9004–b9016	`src/llama-model.cpp`	`rope_yarn_log_mul` `get_key` call changed from `required=0.0f` to `required=false`; fixes Mistral YaRN log_mul loading; internal model loading, no project impact
~b9004–b9016	`common/chat.cpp`	`common_chat_templates_generation_prompt()` extracted from `common_chat_templates_apply_jinja()`; internal refactor, no API change
~b9016–b9022	`src/llama-model.h` + `src/llama-model.cpp` + `src/models/`	`llama_model` becomes abstract base with pure virtual methods (`load_stats`, `load_hparams`, `load_vocab`, `load_tensors`, `load_arch_hparams`, `load_arch_tensors`, `build_arch_graph`); `load_arch()` removed; new intermediate `llama_model_base` class provides concrete implementations; per-arch subclasses (e.g. `llama_model_llama`, `llama_model_gemma2`) in `src/models/`; factory `llama_model_create(llm_arch, params)` and `llama_model_create(ml, params)` replace direct instantiation; `LLAMA_LOAD_LOCALS` convenience macro added; public C API (`llama_model_load_from_file` etc.) unchanged — no project impact
~b9016–b9022	`src/models/`	Many model files renamed: `cohere2-iswa.cpp`→`cohere2.cpp`, `gemma2-iswa.cpp`→`gemma2.cpp`, `gemma3n-iswa.cpp`→`gemma3n.cpp`, `gemma4-iswa.cpp`→`gemma4.cpp`, `mimo2-iswa.cpp`→`mimo2.cpp`, `openai-moe-iswa.cpp`→`openai-moe.cpp`, `pangu-embedded.cpp`→`pangu-embed.cpp`, `qwen3vl-moe.cpp`→`qwen3vlmoe.cpp`, `step35-iswa.cpp`→`step35.cpp`; new model files added (`deepseek2ocr.cpp`, `glm-dsa.cpp`, `granite-moe.cpp`, `hunyuan-vl.cpp`, `jina-bert-v2/v3.cpp`, `lfm2moe.cpp`, `llama-embed.cpp`, `mamba2.cpp`, `minicpm.cpp`, `mistral4.cpp`, `nemotron-h-moe.cpp`, `nomic-bert.cpp`, `nomic-bert-moe.cpp`, `phimoe.cpp`); upstream only, no project changes required
~b9016–b9022	`tools/server/server-context.cpp`	`server_prompt_checkpoint_update` (the renamed function from b9016) static function signature changed from returning by value to taking `server_prompt_checkpoint &` by reference; compiled directly into jllama, no project call site
~b9016–b9022	`tools/server/server-tools.cpp`	New built-in `get_datetime` tool added via new `server_tool_get_datetime` struct in `build_tools()`; no project changes required (handled automatically by compiled upstream source)
~b9016–b9022	`common/chat-auto-parser-generator.cpp`	`force_tools` variable removed from `build_tool_parser_json_native`, `build_tool_parser_tag_json`, `build_tool_parser_tag_tagged`; content before tool calls is now always `p.optional(p.content(...))` regardless of `tool_choice=required`; upstream only, no project changes required
~b9016–b9022	`common/chat-peg-parser.h/cpp`	New `optspace(const std::string & tag)` method added to `common_chat_peg_builder`; makes leading/trailing spaces in reasoning tags optional; upstream only, no project changes required
~b9016–b9022	`common/reasoning-budget.cpp`	Forced token logit now set to `+INFINITY` (previously left at whatever the model computed); reasoning budget enforcement is now absolute; upstream only, no project changes required
~b9016–b9022	`common/chat.cpp`	`thinking_start_tag` and `thinking_end_tag` now trimmed via `trim_whitespace()`; upstream only, no project changes required
~b9016–b9022	`examples/diffusion/`	`diffusion_generate` extracted from `diffusion-cli.cpp` to new `diffusion.h`/`diffusion.cpp` static library; enum names prefixed: `ORIGIN`→`DIFFUSION_ALGORITHM_ORIGIN`, `TIMESTEP_BASED`→`DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED` etc.; examples only, no project changes required
~b9022–b9049	`include/llama.h`	New `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE 2` macro added alongside existing `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY 1`; enables on-device KV cache state save/restore without host round-trip via `llama_state_seq_get_size_ext`/`get_data_ext`/`set_data_ext`; no project call-site changes required (not used by JNI layer)
~b9022–b9049	`src/llama-context.cpp`	State seq data format breaking change: `llama_state_seq_get_data`/`set_data` now prepend a 4-byte magic (`0xaf143cd8`) + 4-byte `seq_id` header; state data saved with ≤b9022 is incompatible with b9049+; internal I/O classes renamed `llama_io_write_buffer`→`llama_io_write_host`, `llama_io_read_buffer`→`llama_io_read_host`; new `llama_io_write_device`/`llama_io_read_device` classes for on-device paths; no project changes required (not called by JNI layer)
~b9022–b9049	`ggml/include/ggml.h`	New `ggml_op_hint` enum (`GGML_HINT_DEFAULT=0`, `GGML_HINT_SRC0_IS_HADAMARD=1`) and `ggml_mul_mat_set_hint()` function added for FWHT (Fast Walsh-Hadamard Transform) support; used internally in `llama-graph.cpp` / `llama-kv-cache.cpp`; no project call-site changes required
~b9022–b9049	`src/llama.cpp`	`llama_backend_init()` now auto-calls `ggml_backend_load_all()` if no backends are yet registered; `ggml_backend_load_all()` removed from `common_params_parser_init()` (was in `common/arg.cpp`); no project changes required — backend loading still happens correctly
~b9022–b9049	`tools/server/server-context.cpp`	`server_prompt_checkpoint_update()` gained an `on_device` bool parameter; speculative checkpoints now use `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY \| LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`; compiled directly into jllama from upstream source — no project call-site changes required
~b9022–b9049	`src/llama-model.cpp`	Unsupported model architecture now throws `std::runtime_error` instead of calling `GGML_ABORT`; allows callers to catch unknown-arch errors gracefully; no project changes required
~b9022–b9049	`ggml/CMakeLists.txt`	GGML version bumped 0.10.2 → 0.11.0; no project changes required
~b9022–b9049	`vendor/cpp-httplib/`	Updated to 0.43.3: `str2tag` converted to iterative loop (eliminates recursion stack depth risk), `res.body.reserve` now OOM-safe; upstream server header, no project changes required
~b9049–b9071	`common/chat.h`	`contains_media()` method added to `common_chat_msg`; `to_json_oaicompat()` now forces text concatenation when message contains media markers; additive change, no project impact
~b9049–b9071	`src/llama-arch.h/cpp` + `src/llama-hparams.h`	New `LLM_KV_ATTENTION_VALUE_SCALE` KV key and `f_attn_value_scale` hparam field added for MiMo-V2 attention value scaling; additive, no project changes required
~b9049–b9071	`src/llama.cpp`	`llama_supports_gpu_offload()` and `llama_supports_rpc()` now auto-call `ggml_backend_load_all()` if no backends are registered; behavior fix, no project changes required
~b9049–b9071	`src/llama-context.cpp`	`state_seq_set_data`: removed too-strict seq_id matching guard that was gated on `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY`; KV slot restorer now checks tensor shapes and view offsets before deciding to reallocate (avoids unnecessary realloc on shape-compatible updates); both are bug fixes, no project API changes required
~b9049–b9071	`src/models/mimo2.cpp`	MiMo-V2 extended with MTP (Multi-Token Prediction) layer support via `nextn_predict_layers`; fused `wqkv` projection; `attention_value_scale` post-attention scaling; all internal model-loading changes, no project changes required
~b9049–b9071	`ggml/src/ggml-sycl/`	SYCL implementations added for `CUMSUM`, `DIAG`, `FILL`, `SSM_SCAN`, `SOLVE_TRI` ops; additive, no project changes required
~b9049–b9071	`ggml/src/ggml-cuda/out-prod.cu`	CUDA outer-product uses `cublasSgemmStridedBatched` for batched path (dps2==1, ne2>1); HIP/MUSA compat headers gain the alias; performance improvement, no project changes required
~b9049–b9071	`tools/mtmd/`	MiniCPM-V 4.6 multimodal support added (`PROJECTOR_TYPE_MINICPMV4_6`, ViT merger graph, new tensor names); additive, no project changes required
~b9049–b9071	`tools/server/webui/`	LLM-based conversation title generation; CSS animation `fill-mode-forwards` fixes; UI-only changes compiled into upstream server, no project changes required
~b9071–b9094	`ggml/src/ggml-cuda/allreduce.cu` + `allreduce.cuh` (NEW)	2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via `GGML_CUDA_ALLREDUCE` env var (`nccl`/`internal`/`none`); compiled automatically via FetchContent, no project changes required
~b9071–b9094	`ggml/src/ggml-cuda/snake.cu` + `snake.cuh` (NEW)	Fused CUDA Snake activation kernel (`y = x + sin(ax)^2 inv_b`) for BigVGAN/Vocos audio models; fuses 5-op chain `MUL→SIN→SQR→MUL→ADD` at graph level; F32/F16/BF16; compiled automatically, no project changes required
~b9071–b9094	`ggml/src/ggml-cuda/ggml-cuda.cu`	Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to `ggml_backend_cuda_comm_context` with `try_allreduce` function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required
~b9071–b9094	`ggml/src/ggml-sycl/`	Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required
~b9071–b9094	`ggml/src/ggml-hexagon/`	GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required
~b9071–b9094	`src/models/sarvam.cpp` (NEW)	Sarvam-MoE model (`sarvamai/sarvam-30b`); reuses BailingMoeV2 arch; new vocab pre-type `LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51`; additive, no project changes required
~b9071–b9094	`src/models/gemma4.cpp`	Gemma4 split gate/up experts: `ffn_gate_up_exps` now TENSOR_NOT_REQUIRED; fallback to separate `ffn_gate_exps`/`ffn_up_exps`; NVFP4 per_expert_scale folding; internal model-loading, no project changes required
~b9071–b9094	`tools/server/server-context.h` + `server-context.cpp`	New `get_model_info()` method on `server_context`; `/v1/models` response now includes `"n_ctx"` field (value: `slot_n_ctx`); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently)
~b9071–b9094	`tools/server/server-http.h` + `server.cpp`	`handlers` map moved from private to public in `server_http_context`; new `register_gcp_compat()` method exposes GCP/Vertex AI Prediction Protocol endpoint reading `AIP_MODE`/`AIP_PREDICT_ROUTE`/`AIP_HEALTH_ROUTE`/`AIP_HTTP_PORT` env vars; compiled from upstream sources, no project changes required
~b9071–b9094	`tools/server/server-models.h` + `server.cpp`	Router child→parent model info propagation: new `CMD_CHILD_TO_ROUTER_INFO` command; `setup_child_server()` gains `const json & model_info` parameter; new `update_loaded_info()` method; `server_model_meta` gains `loaded_info` field; all internally consistent across compiled upstream sources, no project changes required
~b9071–b9094	`common/reasoning-budget.cpp`	Forced token logit no longer set to `+INFINITY`; only competing tokens set to `-INFINITY`; internal sampler behavior change, no project changes required
~b9071–b9094	`tools/server/webui/`	Settings registry refactored (`settings-config.ts`/`settings-fields.ts`/`settings-sections.ts` merged into `settings-registry.ts`); MCP route `#/settings/mcp` → `#/mcp-servers`; settings route `/settings/chat/[section]` → `/settings/[[section]]`; UI-only, no project changes required
~b9094–b9102	`ggml/src/ggml-cuda/allreduce.cu` + `allreduce.cuh`	Internal CUDA AllReduce pipeline refactored with `ggml_cuda_ar_pipeline` struct; `ggml_cuda_ar_pipeline_init(devices, n_devices)` / `_free` / `_allreduce` APIs; supports 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+); chunked kernel path (small tensors) vs copy-engine path (large tensors); `GGML_CUDA_ALLREDUCE` env = `nccl`/`internal`/`none`; env tuning vars `GGML_CUDA_AR_COPY_THRESHOLD` / `GGML_CUDA_AR_COPY_CHUNK_BYTES` / `GGML_CUDA_AR_BF16_THRESHOLD`; HIP/MUSA builds return nullptr stub; compiled automatically via FetchContent, no project changes required
~b9094–b9102	`ggml/src/ggml-cuda/ggml-cuda.cu`	`GGML_LOG_WARN_ONCE` macro added; `ggml_backend_cuda_comm_context` gains `try_allreduce` fn pointer and `ar_pipeline`; three dispatch fns: `try_allreduce_nccl`, `try_allreduce_internal`, `try_allreduce_butterfly`; init chain: `comm_init_nccl` → `comm_init_internal` → `comm_init_none`; platform default Linux→NCCL, Windows→internal; no project changes required
~b9094–b9102	`ggml/src/ggml-sycl/ggml-sycl.cpp` + `im2col.cpp` + `im2col.hpp`	New `ggml_sycl_im2col_3d` function; `GGML_OP_IM2COL_3D` now supported on Intel GPU via SYCL; 2D im2col kernel rewritten with tile-based `IC_KH_KW` thread decomposition; new `SYCL_IM2COL_BLOCK_SIZE 256`; additive, no project changes required
~b9094–b9102	`ggml/CMakeLists.txt`	GGML version patch bumped 0.11.0 → 0.11.1; no project changes required
~b9094–b9102	`common/sampling.cpp`	Bug fix in `common_sampler_sample`: `set_logits` now called at the top before backend-sampling check; backend sampling token-selection now scans all of `cur_p.data` to find matching token (instead of artificial 1-element array), fixing `cur_p.selected` for downstream `n_probs`; post-sampling probabilities now work correctly with backend sampling
~b9094–b9102	`tools/server/server-context.cpp`	`need_logits` renamed to `need_pre_sample_logits`; only set when `n_probs > 0 && !post_sampling_probs`; backend sampling now works with `post_sampling_probs`; 0.0-probability tokens filtered from `result.probs`; compiled from upstream, no project JNI changes required
~b9094–b9102	`src/llama-model.cpp`	`n_vocab` loading moved from `llama_model_base::load_hparams()` to per-model `load_arch_hparams()` (e.g. `src/models/deepseek2.cpp`, `src/models/llama.cpp`); internal model-loading refactor, no project changes required
~b9094–b9102	`src/llama-model.cpp`	`ggml/src/ggml-virtgpu/ggml-backend-device.cpp` gains `#include <mutex>` for `std::once_flag`; internal backend fix, no project changes required
~b9094–b9102	`vendor/cpp-httplib/httplib.cpp` + `httplib.h`	Security fix: chunk-size parsing replaced `strtoul` with manual hex-digit scanning to prevent overflow and reject invalid chunk extensions; version bumped to 0.43.4; compiled automatically, no project changes required
~b9102–b9103	`vendor/cpp-httplib/httplib.cpp` + `httplib.h`	cpp-httplib bumped to v0.44.0: (1) RFC 9110 §5.5 compliance — header field values are no longer percent-decoded by the recipient in `parse_header`; `Location`/`Referer` special-casing removed; callers that need URI-component decoding must call `decode_uri_component()` explicitly; (2) `ThreadPool` constructor is now exception-safe — if thread creation fails partway through, already-started workers are signalled to exit and joined before rethrowing, preventing `std::terminate` from joinable threads in the destructor; compiled automatically, no project changes required
~b9103–b9106	`ggml/src/ggml-vulkan/ggml-vulkan.cpp` + Vulkan shaders	Vulkan flash attention refactored: `pipeline_flash_attn_f32_f16` changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (`flash_attn_f32_f16` and `flash_attn_f32_f16_int8`) that select K/V type at runtime via `FaTypeK`/`FaTypeV` spec constants; new `flash_attn_dequant.glsl` contains aliased SSBO views and an uber `dequantize4()` switch; the K/V type mismatch guard removed from `ggml_backend_vk_device_supports_op`; internal Vulkan backend refactor, no project changes required
~b9103–b9106	`ggml/src/ggml-cuda/argsort.cu`	Added `#include <cuda/iterator>` for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required
~b9103–b9106	`convert_hf_to_gguf.py`	Mistral Medium 3.5 mmproj support: `n_embd_text` now reads `"dim"` key instead of `"hidden_dim"`; negative `img_break_tok_id` placeholders resolved from `tekken.json` or `tokenizer.json`; conversion tool only, no project changes required
~b9106–b9134	`common/arg.cpp`	CLI option `--spec-draft-ctx-size` / `-cd` / `--ctx-size-draft` REMOVED — throws `std::invalid_argument` at parse time; `ModelParameters.setCtxSizeDraft()` removed; no replacement (context size now managed internally by speculative engine)
~b9106–b9134	`common/arg.cpp`	CLI option `--spec-draft-replace` / `--spec-replace` REMOVED — throws `std::invalid_argument` at parse time; no corresponding Java method existed
~b9106–b9134	`common/speculative.h`	Full redesign: `common_speculative_type` enum values renamed `DRAFT`→`DRAFT_SIMPLE`, `EAGLE3`→`DRAFT_EAGLE3`; `common_params_speculative.type` (single enum) → `.types` (vector); `common_speculative_n_max()` / `common_speculative_n_min()` REMOVED; new `common_speculative_init(params, n_seq)` no longer takes ctx; new `common_speculative_begin(spec, seq_id, prompt)`, `common_speculative_draft(spec)`, `common_speculative_accept(spec, seq_id, n)`, `common_speculative_process(spec, batch)` signatures; `common_speculative_draft_params` struct added; server sources compiled directly, no project JNI changes required
~b9106–b9134	`common/common.h`	New `common_prompt_checkpoint` struct (contains `data_tgt` + `data_dft`) replaces the old `server_prompt_checkpoint` in `server-task.h`; compiled from upstream server sources, no project JNI changes required
~b9106–b9134	`tools/server/server-task.cpp`	`task_params::to_json()` renamed field `"speculative.type"` → `"speculative.types"` (now serialises the vector); test `SlotParamsToJson.SpeculativeFields_Present` updated accordingly
~b9106–b9134	`include/llama.h`	New `LLAMA_STATE_SEQ_FLAGS_NONE = 0` macro added; additive, no project changes required
~b9134–b9145	`tools/server/server-common.cpp`	New `continue_final_message` boolean request field in `oaicompat_chat_params_parse`; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when `true`, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with `add_generation_prompt=true` (throws 400); compiled from upstream server sources; `InferenceParameters.setContinueFinalMessage(boolean)` added
~b9134–b9145	`ggml/src/ggml-sycl/`	Level Zero API integration for SYCL device memory allocation (`GGML_SYCL_SUPPORT_LEVEL_ZERO` build option, `GGML_SYCL_ENABLE_LEVEL_ZERO` runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required
~b9134–b9145	`ggml/src/ggml-opencl/`	Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required
~b9134–b9145	`ggml/src/ggml-cuda/allreduce.cu`	AllReduce accumulation now routed through `float` intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required
~b9134–b9145	`ggml/src/ggml-hexagon/`	`GGML_UNARY_OP_TANH` added to Hexagon HTP backend; internal DSP backend, no project changes required
~b9134–b9145	`ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp`	`use_subgroup_matrix` condition now also checks `sg_mat_k > 0 && sg_mat_n > 0` and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required
~b9145–b9150	`ggml/src/ggml-vulkan/ggml-vulkan.cpp`	Bug fix: `mul_mat_l_int[i]` / `mul_mat_m_int[i]` / `mul_mat_s_int[i]` / `mul_mat_id_l_int[i]` / `mul_mat_id_m_int[i]` / `mul_mat_id_s_int[i]` were unconditionally set to `true` instead of mirroring the actual device pipeline capabilities from `mul_mat_l[i]` etc.; now properly initialized; internal Vulkan backend bug fix, no project changes required
~b9145–b9150	`src/unicode.cpp`	New `unicode_regex_split_custom_qwen35()` function registered for the Qwen 3.5 tokenizer regex pattern; uses `[\p{L}\p{M}]+` letter-plus-combining-mark runs vs. Qwen2's `\p{L}+`; additive internal tokenizer change, no project changes required
~b9145–b9150	`ggml/src/ggml-cpu/ggml-cpu-riscv64-spacemit/`	SpaceMIT RISC-V IME backend major refactor: IME2 kernels, expanded quantization (Q2_K, Q3_K, Q6_K, Q8_0, Q5_0, Q5_1, Q5_K, MXFP4), TCM (Tightly Coupled Memory) pool; new source files `ime2_kernels.cpp`, `ime_env.cpp`, `repack.cpp`, `rvv_kernels.cpp`, `spine_mem_pool.cpp`; guarded by `GGML_CPU_RISCV64_SPACEMIT` build flag; no project changes required
~b9150–b9151	`common/log.h`	New `LOG_TRC` macro added at `LOG_LEVEL_TRACE = 4` (between INFO=3 and DEBUG=5); `LOG_LEVEL_DEBUG` bumped from 4 to 5; new `LOG_TRCV` verbosity variant; additive, no project changes required
~b9150–b9151	`common/common.h` + `common/common.cpp`	New `common_params_print_info(const common_params &)` function: prints verbosity level, per-device memory (name, total, free), and system info at `LOG_INF` level; replaces the two-line pattern `LOG_INF("build_info: %s\n", llama_build_info()); LOG_INF("%s\n", common_params_get_system_info(params).c_str());` — updated in `jllama.cpp`
~b9150–b9151	`common/common.cpp`	`common_init()` now unconditionally calls `common_log_set_prefix(…, true)` and `common_log_set_timestamps(…, true)` before setting the log callback; log output will always include prefix and timestamps unless explicitly disabled with `--no-log-prefix` / `--no-log-timestamps`
~b9150–b9151	`common/arg.cpp`	`--log-prefix` and `--log-timestamps` now also accept negated forms `--no-log-prefix` / `--no-log-timestamps` (lambda receives a `bool value`); backing env vars renamed `LLAMA_LOG_PREFIX` → `LLAMA_ARG_LOG_PREFIX` and `LLAMA_LOG_TIMESTAMPS` → `LLAMA_ARG_LOG_TIMESTAMPS`; Java layer does not expose these, so no project changes required
~b9150–b9151	`tools/server/server-common.h`	New `SLT_TRC` and `SRV_TRC` macros (emit at `LOG_TRC` level); additive, no project changes required
~b9150–b9151	`tools/server/server-context.cpp`	New `server_slot::t_print_last` field + `print_timings_tg()` / `print_timings_pp()` methods: emit periodic in-flight token-generation and prompt-processing throughput to `SLT_INF` (throttled to ≥100 decoded tokens and ≥3 s interval); `server_context_impl` constructor now calls `mtmd_helper_log_set` unconditionally (was guarded by `!is_resume`); many `SLT_INF`/`SRV_WRN` downgraded to `SLT_TRC`/`SRV_INF`; compiled from upstream, no project JNI changes required
~b9150–b9151	`tools/server/server-task.cpp`	Several `SRV_WRN` calls downgraded to `SRV_INF`; one `SRV_WRN` upgraded to `SRV_ERR` for failed state restore; compiled from upstream, no project changes required
~b9151–b9172	`tools/mtmd/clip.h`	`clip_has_whisper_encoder()` removed from public API; not referenced by project — no changes required
~b9151–b9172	`tools/server/CMakeLists.txt` + `scripts/webui-download.cmake` (new)	WebUI assets no longer committed (`tools/server/public/` gitignored); provisioned at build time via HF bucket (`LLAMA_USE_PREBUILT_WEBUI=ON` default) or built from source (`LLAMA_BUILD_WEBUI`); project sets `LLAMA_BUILD_WEBUI=OFF CACHE BOOL "" FORCE` before FetchContent to skip asset download
~b9151–b9172	`common/common.h`	`common_params::webui` default made conditional on `LLAMA_WEBUI_DEFAULT_ENABLED` macro (falls back to `true` when undefined); compiled server sources unaffected
~b9151–b9172	`common/reasoning-budget.cpp`	`common_reasoning_budget_clone` rewritten to use `llama_sampler_init` properly; pure bug fix, no API change, no project changes required
~b9151–b9172	`ggml/src/ggml-cuda/fattn-mma-f16.cuh` + `mma.cuh`	AMD RDNA3 WMMA flash attention support; new `DATA_LAYOUT_I_MAJOR_SCRAMBLED`, `tile<16,16,half2,I_MAJOR_SCRAMBLED>`, extended config tables; internal CUDA backend, no project changes required
~b9151–b9172	`tools/server/server-chat.cpp`	Non-function Responses API tools now silently skipped (`continue`) instead of throwing; server behavior fix, no Java API change required
~b9172–b9198	project `CMakeLists.txt`	Option `LLAMA_BUILD_WEBUI` renamed to `LLAMA_BUILD_UI` (and `LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_USE_PREBUILT_UI`); upstream keeps a backward-compat shim that forwards the old cache variable with a `DEPRECATION` message, so this project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` still works unchanged
~b9172–b9198	`common/common.h`	`common_params::webui` / `webui_mcp_proxy` / `webui_config_json` deprecated in favour of `ui` / `ui_mcp_proxy` / `ui_config_json`; both pairs of fields are kept and synced by `common/arg.cpp`, compiled upstream sources unaffected; new `common_params::ctx_type` and `cparams.n_rs_seq` fields added (default `LLAMA_CONTEXT_TYPE_DEFAULT` / `0`), additive
~b9172–b9198	`common/common.cpp` + `common.h`	`common_params_print_info` gained optional `print_devices` parameter (default `true`); upstream `tools/server/server.cpp` passes `!is_router_server` to skip GPU enumeration on the router process; this project does not compile `server.cpp`, no impact
~b9172–b9198	`common/speculative.h` + `speculative.cpp`	New enum value `COMMON_SPECULATIVE_TYPE_DRAFT_MTP` (count is now 9); new `common_speculative_need_embd()` API; MTP draft implementation added (`common_speculative_state_draft_mtp`); `--spec-type draft-mtp` CLI flag added in `common/arg.cpp`; additive, no project changes (could be exposed later as a `ModelParameters` enhancement)
~b9172–b9198	`include/llama.h`	New `enum llama_context_type { LLAMA_CONTEXT_TYPE_DEFAULT, LLAMA_CONTEXT_TYPE_MTP }`; new `llama_context_params::n_rs_seq` (recurrent-state snapshots per seq for rollback) and `ctx_type` fields; new `llama_n_rs_seq()` accessor; all additive, default-zero, no project impact
~b9172–b9198	`src/llama-ext.h` (new) + `src/llama-context.cpp`	New pre-norm embedding extraction path: `llama_set_embeddings_pre_norm` / `llama_get_embeddings_pre_norm[_ith]` APIs and an `embd_pre_norm` output buffer in `llama_context`; used by the MTP draft loop only, additive
~b9172–b9198	`src/llama-memory-recurrent.cpp`	Recurrent-state rollback support: per-seq `rs_idx` snapshot index and `set_rs_idx()` helper; tensors widened to `(1 + n_rs_seq)` groups; `seq_rm` now rolls back via snapshot when within `n_rs_seq` bounds. Backwards-compatible when `n_rs_seq == 0` (this project's default), no project changes
~b9172–b9198	`tools/server/server-context.cpp`	Embedding endpoint default now reads `params.embd_normalize` (was hard-coded `2`); compiled upstream, no project changes
~b9172–b9198	`tools/server/CMakeLists.txt` + new `tools/ui/CMakeLists.txt`	WebUI asset wiring moved into a new `llama-ui` static library; `tools/server` now links `llama-ui`; project does not build the `llama-server` binary (only compiles `server-context.cpp` / `server-queue.cpp` / `server-task.cpp` / `server-models.cpp` directly into `jllama`), so no impact. HF bucket name renamed `LLAMA_WEBUI_HF_BUCKET` → `LLAMA_UI_HF_BUCKET` (old name still honoured)
~b9172–b9198	`vendor/cpp-httplib/httplib.{h,cpp}`	Bumped to v0.45.0: RFC 9112 §6 message-body framing — requests without `Content-Length` / `Transfer-Encoding` no longer scan for stray body bytes on persistent connections (fixes #2450 keep-alive misframing); X-Forwarded-For parser falls back to the connection remote address when the header is empty/malformed; compiled automatically, no project changes
~b9172–b9198	`ggml/CMakeLists.txt`	GGML version bumped 0.11.1 → 0.12.0; no project changes
~b9172–b9198	`ggml/src/ggml.c` + `ggml-cuda/gated_delta_net.cu` + `ggml-metal/ggml-metal.metal` + `ggml-vulkan/vulkan-shaders/gated_delta_net.comp`	`ggml_gated_delta_net` state tensor reshaped from 2D `(S_vS_vH, n_seqs)` to 3D `(S_vS_vH, K, n_seqs)` where `K` is the snapshot slot count (`K=1` is final-state-only, `K>1` keeps last `min(n_tokens, K)` per-token snapshots); internal Qwen3.5 / Qwen3-Next recurrent-attention kernel, no project changes
~b9198–b9219	`common/chat.{h,cpp}`	New `common_chat_continuation` enum (`NONE`/`AUTO`/`REASONING`/`CONTENT`); new `common_chat_msg::render_content(delimiter)` method; new `continue_final_message` field on `common_chat_templates_inputs`; new `common_chat_continuation_parse()` accepts both `bool` and `"reasoning_content"`/`"content"` strings; `common_chat_template_generation_prompt()` extracted; `oaicompat_chat_params_parse` refactored to route the prefill-assistant heuristic through the new continuation enum. Existing `bool` wire-format unchanged; the new string variants are exposed via `InferenceParameters.setContinueFinalMessage(ContinuationMode)`
~b9198–b9219	`common/hf-cache.{h,cpp}` + `common/arg.cpp`	`hf_cache::migrate_old_cache_to_hf_cache()` and `hf_file::size` field removed; the migration call in `common_params_parse_ex` was dropped. Internal to `arg.cpp`, no project impact
~b9198–b9219	`common/speculative.{h,cpp}` + `src/llama-ext.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h`	`llama_set_embeddings_pre_norm(ctx, value)` → `llama_set_embeddings_pre_norm(ctx, value, masked)` (3rd `bool` arg distinguishes "embeddings for outputs only" from "embeddings for every token"); new `cparams.embeddings_pre_norm_masked`; new `common_speculative_need_embd_pre_norm()` API; MTP draft path now uses pre-norm extraction. Project does not call any of these APIs (speculative decoding is configured via `ModelParameters` only), no source changes required
~b9198–b9219	`tools/server/server-task.{h,cpp}`	`task_result_state` ctor moved from header into `.cpp` — now seeds `chat_msg` via `common_chat_parse("", true, …)` when `!echo` so the assistant prefill is not echoed back as a delta; new `bool echo` field on `chat_parser_params` (default `false`, populated from request body via `json_value(data, "echo", false)`). Project compiles `server-task.cpp` from upstream and does not instantiate `task_result_state` directly, no source changes required
~b9198–b9219	`tools/server/server-context.cpp` + `server-models.cpp`	New `cors_proxy_enabled` boolean field added to `/props` and `/v1/models` JSON responses (set from `params.ui_mcp_proxy \|\| params.webui_mcp_proxy`). Additive, no Java consumer in this project
~b9198–b9219	upstream `CMakeLists.txt`	Backward-compat shim widened: `if(DEFINED LLAMA_BUILD_WEBUI AND NOT DEFINED LLAMA_BUILD_UI)` → `if(DEFINED LLAMA_BUILD_WEBUI)` — setting the old name now always forwards to the new one (and emits the existing `DEPRECATION` message). Project sets only `LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE` (`CMakeLists.txt:107`), behaviour unchanged
~b9198–b9219	`ggml/src/ggml-cuda/ssm-conv.cu` + `top-k.cu`	Added kernel size 15 to SSM-conv launcher (now supports 3/4/5/9/15); `top-k.cu` includes `<cuda/iterator>` for CCCL ≥ 3.1; internal CUDA backend, no project changes
~b9198–b9219	`ggml/src/ggml-sycl/ggml-sycl.cpp` + `vecdotq.hpp`	SYCL GEMM now falls back to direct MKL for small problems (gemm_flops < 256³); Q6_K dot product refactored to a single scalar fast-path helper `vec_dot_q6_K_q8_1_impl_mmvq_scalar`; internal SYCL backend, no project changes
~b9219–b9222	`ggml/src/ggml-hexagon/` + `htp/pad-ops.c` (new) + `htp/unary-ops.c`	Hexagon HTP backend gains `GGML_OP_PAD` (HVX + optional VTCM/DMA double-buffered, both zero-pad and circular-pad variants) and `GGML_OP_TRI` (HVX-vectorised triangular masking) support; new `HTP_OP_PAD` / `HTP_OP_TRI` opcodes; internal Qualcomm DSP backend, no project changes
~b9219–b9222	`.devops/*.Dockerfile` + `.github/workflows/docker.yml`	OCI image labels (`org.opencontainers.image.*`) added via `BUILD_DATE`/`APP_VERSION`/`APP_REVISION` build args; new `skip_s390x` workflow_dispatch input; manifest annotations on `docker buildx imagetools create`; upstream packaging/CI only, no project changes
~b9222–b9245	`common/common.h` + `common.cpp`	`common_init_result(common_params &, bool model_only = false)` and `common_init_from_params(common_params &, bool model_only = false)` gain an optional `model_only` flag that skips context/sampler/lora/warmup setup and returns only the loaded model. Additive with default value; no project call sites in `src/main/cpp/`, no source changes required
~b9222–b9245	`common/common.h`	`common_params_speculative_draft` defaults retuned: `n_max` 16→3, `p_min` 0.75f→0.0f. Defaults only; Java `ModelParameters` sets these explicitly via JSON, so behaviour is unchanged for this project
~b9222–b9245	`common/speculative.{h,cpp}`	`common_speculative_impl::accept()` virtual gains a 3rd `bool is_other` parameter; `common_speculative_accept()` now broadcasts the accepted-token count to every registered impl (with `is_other=true` for impls that did not generate the draft). `common_speculative_impl_ngram_map_k` ctor signature simplified (no longer takes `common_params_speculative`). Lots of new `LOG_INF` startup banners per impl. Internal to upstream-compiled `server-context.cpp`; no project call sites
~b9222–b9245	`common/arg.cpp` + `common/common.cpp` + `tools/fit-params/fit-params.cpp`	`--verbosity` levels relabeled: level `4` now means "trace (more info)" and level `5` means "debug"; `LOG_LEVEL_DEBUG` constant value moved from `4` to `5`. Direct `params.verbosity >= 4` comparisons in upstream `common.cpp` and `fit-params.cpp` replaced with `>= LOG_LEVEL_DEBUG`. Project does not reference `LOG_LEVEL_DEBUG` or numeric verbosity thresholds in `src/main/cpp/`; no source changes required
~b9222–b9245	`common/arg.cpp`	`--spec-type` duplicate-arg DEPRECATED warning suppressed (the flag legitimately accepts repeated values to form the comma-list). Behaviour-only
~b9222–b9245	`common/ngram-map.cpp`	One per-draft `LOG_INF` downgraded to `LOG_DBG`. Log-level only
~b9222–b9245	`src/llama-graph.h`	`llm_graph_params::operator==` adds a third disjunct so ubatches with both `token` and `embd` arrays present compare equal (graph reuse fix for MTP pre-norm path). Internal
~b9222–b9245	`src/llama-memory-recurrent.{h,cpp}` + `src/llama-memory-hybrid.cpp` + `src/llama-memory-hybrid-iswa.cpp`	`init_batch()` now forces sequential split (`split_seq`) instead of equal split when `n_rs_seq > 0` (recurrent-state rollback is incompatible with equal splits). Internal upstream model code, no project impact
~b9222–b9245	`src/models/delta-net-base.cpp` + `src/models/models.h` + `src/models/qwen35.cpp`	`llm_build_delta_net_base::keep_rs()` helper removed; conv-state and recurrent-attn paths reworked to read `cparams.n_rs_seq` directly and loop `K = n_rs_seq + 1` snapshot slots. Comment fix in `qwen35.cpp` MTP layer index. All internal upstream model code
~b9222–b9245	`tools/server/server-context.cpp`	`pos_min_thold` lowered by one (`pos_next - n_swa` → `pos_next - n_swa - 1`); checkpoint trigger guard relaxed from `n_past < slot.prompt.n_tokens()` to `<=`; per-slot `print_timings_pp`/`print_timings_tg` lines split into separate `SLT_INF` calls; new `graphs reused` and `draft acceptance` lines; `n_draft_total` log moved from `SLT_CNT` to `SLT_INF`. Compiled upstream-as-is, no project changes
~b9222–b9245	`ggml/src/ggml-cuda/mmvq.cu`	`calc_nwarps` table tweak: Q6_K returns 2 warps (was grouped with the 8-warp tier). Internal CUDA backend
~b9222–b9245	`ggml/src/ggml-hexagon/` (`htp/rope-ops.c`, `htp/unary-ops.c`, `htp-ops.h`, `main.c`, `ggml-hexagon.cpp`)	New `HTP_OP_NORM` opcode (mean+variance norm); `rope-ops.c` adds MROPE / IMROPE position-id support via new `mrope_cache_init()`. Internal Qualcomm DSP backend
~b9222–b9245	`ggml/src/ggml-opencl/` (`ggml-opencl.cpp`, `kernels/cvt.cl`, six new `gemm_moe_q{4,5,6}_k_f32_ns` + `gemv_moe_q{4,5,6}_k_f32_ns` kernels)	Adreno MoE pipeline extended to Q4_K / Q5_K / Q6_K (image1d_buffer_t transposed layout, dedicated convert/restore kernels, GEMM + GEMV paths). Internal OpenCL backend
~b9222–b9245	`ggml/src/ggml-rpc/ggml-rpc.cpp`	`last_graph_uid` field moved from `ggml_backend_rpc_context` (per-backend) into `ggml_backend_rpc_device_context` (per-device) so multiple backends sharing a device reuse cached graphs. Internal RPC backend
~b9222–b9245	`ggml/src/ggml-sycl/ggml-sycl.cpp`	New `GGML_SYCL_USE_ASYNC_MEM_OP` env (default `1`) decouples async USM alloc/free from the graph path. Internal SYCL backend
~b9222–b9245	`ggml/src/ggml-webgpu/ggml-webgpu.cpp` + `wgsl-shaders/gated_delta_net.wgsl`	Gated-delta-net shader gains a `K` snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend
~b9222–b9245	`convert_hf_to_gguf.py`, `convert_lora_to_gguf.py`, `examples/save-load-state/save-load-state.cpp`, `examples/llama-eval/*`, `tools/cli/README.md`, `tools/server/README.md`, `docs/speculative.md`, `docs/backend/SYCL.md`	Doc/example/tooling updates only. Not compiled by this project
~b9222–b9245	`tools/ui/*`	WebUI source reorganisation (enum file renames `.ts` → `.enums.ts`, new chat components, Tailwind plugin imports). Project sets `LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE` in `CMakeLists.txt`, so the UI is never built — no impact
~b9245–b9264	`src/llama-chat.{h,cpp}`	`LLM_CHAT_TEMPLATE_HUNYUAN_OCR` renamed to `LLM_CHAT_TEMPLATE_HUNYUAN_VL` (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required
~b9245–b9264	`tools/mtmd/clip-impl.h` + `tools/mtmd/models/`	`PROJECTOR_TYPE_HUNYUANOCR` removed and merged into `PROJECTOR_TYPE_HUNYUANVL`; `hunyuanocr.cpp` renamed to `hunyuanvl.cpp`; clip graph class `clip_graph_hunyuanocr` renamed to `clip_graph_hunyuanvl`. Not referenced by project — no source changes required
~b9245–b9264	`tools/mtmd/clip.h`	`clip_is_minicpmv()` and `clip_is_glm()` removed from public API. Not referenced by project — no source changes required
~b9245–b9264	`tools/mtmd/clip.h` (`struct clip_context_params`)	New `bool no_alloc` field added (initialized via `mtmd_context_params_default()`). Additive default-zero — no project changes required
~b9245–b9264	`tools/mtmd/mtmd.h`	New `mtmd_get_memory_usage()` C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project
~b9245–b9264	`tools/mtmd/clip-model.h`	New `enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST }` replacing the `bool image_resize_pad` flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links `mtmd` as-is
~b9245–b9264	`common/common.h` (`struct common_params_speculative_draft`)	New `bool backend_sampling = true` field — offloads draft sampling to the backend. Additive default-on; Java `ModelParameters` doesn't set it, so the upstream default applies. Backend sampler auto-disables when `split_mode == TENSOR` in `src/llama-context.cpp` — safe
~b9245–b9264	`common/speculative.cpp`	`common_speculative_impl_draft_mtp` now registers a per-seq backend sampler chain (top-k 10) on `ctx_dft` via `llama_set_sampler`; cleaned up in destructor. Falls back to CPU sampler if `llama_set_sampler` fails. Internal to upstream-compiled speculative module, no project call sites
~b9245–b9264	`app/` (new)	New optional unified `llama` binary (`llama-app` target) dispatching to `serve`/`cli`/`completion`/`bench`. Guarded by `LLAMA_BUILD_APP=OFF` default — project doesn't enable it
~b9245–b9264	`tools/{cli,completion,llama-bench,server}/CMakeLists.txt`	Each tool split into a `*-impl` static library (the logic) plus a thin `main.cpp` wrapper; the `main()` in `cli.cpp`/`completion.cpp`/`llama-bench.cpp`/`server.cpp` is renamed to `llama_cli`/`llama_completion`/`llama_bench`/`llama_server` and now satisfies `-Wmissing-declarations` via a forward decl. Project does NOT compile any of these `.cpp` files — only `server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-models.cpp` (see `CMakeLists.txt:237`/`:302`) — so no impact
~b9245–b9264	`tools/server/server-context.cpp`	Adds mmproj memory estimation: when `params_base.fit_params` is set, calls `mtmd_get_memory_usage(mmproj_path, mparams)` and adds the per-device cost into `params_base.fit_params_target` before `common_init_from_params`. Also calls `mtmd_helper_log_set(common_log_default_callback, nullptr)` once when `!is_resume`. Compiled upstream-as-is, no project call sites
~b9245–b9264	`src/llama-context.cpp`	New `llama_context::set_sampler()` short-circuits with a one-shot `LLAMA_LOG_WARN` and returns `false` when `model.split_mode() == LLAMA_SPLIT_MODE_TENSOR` (backend sampling not supported with tensor split). Internal safety check, no project call sites
~b9245–b9264	`common/arg.cpp`	New CLI flags `--spec-draft-backend-sampling` / `--no-spec-draft-backend-sampling` and env `LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING` to toggle the new `backend_sampling` field. Not exposed by `ModelParameters`; could be added later as a Java-side enhancement
~b9245–b9264	`ggml/src/ggml-cuda/CMakeLists.txt` + `common.cuh` + `binbcast.cu`, `concat.cu`, `cpy.cu`, `fattn-*.cu`, `gated_delta_net.cu`, `getrows.cu`, `mean.cu`, `mmvf.cu`, `mmvq.cu`, `norm.cu`, `quantize.cu`, `reduce_rows.cuh`, `rope.cu`, `scale.cu`, `set-rows.cu`, `softcap.cu`, `ssm-conv.cu`, `ssm-scan.cu`, `sumrows.cu`, `topk-moe.cu`, `unary.cu`	New PDL (Programmatic Dependent Launch) infrastructure: `GGML_CUDA_USE_PDL` build flag (CUDART ≥ 11.8, non-HIP/MUSA); `ggml_cuda_pdl_sync()` / `ggml_cuda_pdl_lc()` device helpers (active on Hopper sm_90+); `ggml_cuda_kernel_launch_params` + `ggml_cuda_kernel_launch()` host template that calls `cudaLaunchKernelEx` with stream-serialization attribute when `GGML_CUDA_PDL` env var allows. Adds `90-virtual` (Hopper) to default `CMAKE_CUDA_ARCHITECTURES` when CUDA ≥ 11.8. Internal CUDA backend, no project changes required
~b9245–b9264	`ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp` + `ggml-metal.metal`	New 4-element `kernel_pad__4` variant (currently disabled — `is_c4 = false`); `kernel_pad` rewritten with 1024-element-per-block tiling for larger tensors; `kernel_cpy_` rewritten to use `tpitg` rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend
~b9245–b9264	`ggml/src/ggml-hexagon/htp/` (`hmx-matmul-ops.c`, `hmx-ops.h`, `matmul-ops.c`, `main.c`)	HMX matmul refactor: K-loop tiled in 32-tile blocks with `Q6_activation_hf_mxmem_RR_deep`; the out-stationary fallback path for large M·K·N was deleted; function rename `hmx_mat_mul_permuted_w16a32` → `hmx_matmul_f16_f32`, `hmx_mat_mul_permuted_qk_0_d16a32` → `hmx_matmul_q_f32`, `hmx_mat_mul_permuted_w16a32_batched_params_t` → `hmx_matmul_f16_f32_batched_params_t`. HMX power-up code reorganized (`HAP_power_set_HMX_v2` now combines power-on + clock in one step for `__HVX_ARCH__ ≥ 75`). Internal Qualcomm DSP backend
~b9245–b9264	`ggml/src/ggml-opencl/ggml-opencl.cpp`	Lazy kernel compilation: `argsort` and `flash_attn` programs are now built only when first needed (`load_cl_kernels_argsort` / `load_cl_kernels_flash_attn` called from `supports_op`); new device-supported probe in `ggml_opencl_is_device_supported` runs at registration time; renamed `ggml_cl2_init`/`ggml_cl2_free` → `ggml_cl_init`/`ggml_cl_free`; OpenCL contexts now live as long as the process. Internal OpenCL backend
~b9245–b9264	`ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp`	Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes `BLOCK_SIZE` outputs per step. Internal Vulkan backend
~b9245–b9264	`src/models/delta-net-base.cpp`	Renamed local variables (`state_in_3d`→`s_3d`, `state_3d`→`s_3d_pad`) when reshaping the recurrent state; behaviour unchanged
~b9245–b9264	`tools/mtmd/mtmd-image.cpp`	`img_tool::resize()` takes a `pad_style` enum (was `bool add_padding`); new `PAD_NEAREST` rounding path for Pillow byte-parity; `mtmd_image_preprocessor_deepseekocr::preprocess` rewritten with `static constexpr` resolution table and `RESIZE_ALGO_BICUBIC_PILLOW` + `PAD_NEAREST`. Internal mtmd, project links as-is
~b9245–b9264	`tools/mtmd/models/deepseekocr.cpp`	Extracted `build_sam(ggml_tensor *inp_raw)` member function from the monolithic build path; FA mask casting to F16 only when `flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED`. Internal
~b9245–b9264	`conversion/hunyuan.py`, `gguf-py/gguf/constants.py`, `gguf-py/gguf/tensor_mapping.py`	HunyuanOCR / HunyuanVL unified in conversion: `VisionProjectorType.HUNYUANOCR` removed; `HunYuanVLForConditionalGeneration` registers a single `HunyuanVLVisionModel` + `HunyuanVLTextModel`; `vit.perceive.*` tensor mappings now only mention `HunyuanVL`. Python tooling, not compiled by project
~b9245–b9264	`CMakeLists.txt` (upstream)	New `LLAMA_BUILD_APP` option (default OFF); deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_BUILD_UI`/`LLAMA_USE_PREBUILT_UI` preserved. Project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` still works unchanged
~b9245–b9264	`.devops/*.Dockerfile`, `.github/workflows/build-and-test-snapdragon.yml`, `scripts/snapdragon/`, `docs/backend/snapdragon/`, `tools/cli/README.md`, `tools/server/README.md`, `tools/mtmd/tests/`	Docker images add `conversion/` dir; snapdragon toolchain bumped v0.3 → v0.6 with `+dotprod+i8mm`; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project
~b9264–b9279	`tools/server/server-context.cpp`	Slot-info JSON adds three additive fields (`n_prompt_tokens`, `n_prompt_tokens_processed`, `n_prompt_tokens_cache`) on each in-flight task; `server_context_impl::destroy()` now resets `spec` / `ctx_dft` / `model_dft` BEFORE `llama_init.reset()` to avoid use-after-free when a draft model holds back-references into the target context. Compiled directly into jllama from upstream — no project source changes required
~b9264–b9279	`tools/server/server-models.cpp`	Adds `#include <cstdlib>` and a `LLAMA_APP_CMD` env-var lookup in `server_model_meta::update_args()` to re-inject the unified-binary subcommand into router-spawned child argv. Env var is only set by the new `llama-app` binary (which this project does not build), so the lookup harmlessly returns null and the code path is a no-op. Compiled upstream-as-is, no project changes
~b9264–b9279	`src/llama-vocab.cpp`	New `hybriddna` BPE tokenizer model (DNA k-mer tokenization with `<dna>…</dna>` tag handling, k=6, OOV fallback) registered as a BPE variant; reached only when GGUF metadata declares `tokenizer.model = "hybriddna"`. Adds a virtual destructor + virtual `tokenize()` to `llm_tokenizer_bpe_session` and a `llm_tokenizer_hybriddna_session` subclass; existing BPE callers unchanged. Additive, no project changes
~b9264–b9279	`src/llama-graph.cpp`	`llm_graph_input_attn_kv_iswa::set_input()` / `can_reuse()` now guard the base and SWA tensor accesses behind `if (self_k_idxs && self_k_idxs->buffer)` / `if (self_k_idxs_swa && self_k_idxs_swa->buffer)`. Fixes crashes on models with only-SWA or only-non-SWA attention layers. Internal, no project impact
~b9264–b9279	`src/models/qwen35.cpp` + `src/models/qwen35moe.cpp`	MTP draft sub-graph now builds an `inp_out_ids` input and applies `ggml_get_rows(cur, inp_out_ids)` just before the head norm, so only the requested output rows are projected. Bug fix for MTP draft path; internal, no project changes
~b9264–b9279	`ggml/src/ggml-backend.cpp`	`ggml_backend_tensor_get_2d()` fast-path condition fixed: now checks `iface.get_tensor_2d == NULL` (was incorrectly checking `set_tensor_2d`), so multi-copy gets correctly fall back to the per-copy loop when the backend lacks `get_tensor_2d`. Bug fix, no project changes
~b9264–b9279	`ggml/src/ggml-vulkan/` (`ggml-vulkan.cpp`, new `vulkan-shaders/snake.comp`, `vulkan-shaders-gen.cpp`)	New Vulkan Snake activation fusion: detects the 5-op chain `MUL → SIN → SQR → MUL → ADD` (matching CUDA b9094 introduction) and dispatches a single fused `snake_{f32,f16,bf16}` kernel `y = x + sin(ax)^2 inv_b`. New `ggml_vk_can_fuse_snake()` validates contiguity, 2D shape, and broadcast operands `[1, C, 1, 1]`. Internal Vulkan backend, no project changes
~b9264–b9279	`ggml/src/ggml-metal/ggml-metal-ops.cpp` + `ggml-metal.metal`	`kernel_concat` / `kernel_set` now batch multiple small rows into one threadgroup (`nrptg = min(256/ne0, ne1)`, capped at 256 threads/group) to improve small-row throughput; `kernel_concat` gains an early-return bounds check. Internal Metal backend, no project changes
~b9264–b9279	`ggml/src/ggml-hexagon/` (`ggml-hexagon.cpp`, `htp/ssm-conv.c`, `htp/rope-ops.c`)	SSM_CONV HVX kernel rewritten with VTCM-staged 32×32 fp32 in-register transpose and per-thread tiling (1 MiB VTCM budget); strictly-contiguous gate replaced with byte-stride checks (`nb[0]==sizeof(float)` and `nb[1]==ne[0]*sizeof(float)`); `rope_cache_init` / `mrope_cache_init` marked `__attribute__((noinline))` to reduce code-bloat on Hexagon. Internal Qualcomm DSP backend, no project changes
~b9264–b9279	`examples/save-load-state/` removed, `tests/test-save-load-state.cpp` added; `tools/{batched-bench,fit-params,quantize,perplexity}/CMakeLists.txt`	The `llama-save-load-state` example binary was removed and re-homed as a CTest target; the four remaining standalone tools were each split into a `*-impl` static library + a thin `main.cpp` wrapper (mirroring the b9245 split of cli/completion/llama-bench/server), with the entry-point renamed to `llama_batched_bench` / `llama_fit_params` / `llama_quantize` / `llama_perplexity` to satisfy `-Wmissing-declarations`. Project does not compile any of these `.cpp` files (only `server-context.cpp`, `server-queue.cpp`, `server-task.cpp`, `server-models.cpp` — see `CMakeLists.txt`), so no impact
~b9264–b9279	`app/` (`CMakeLists.txt`, `llama.cpp`)	`llama-app` unified binary gains four new subcommands (`batched-bench`, `fit-params`, `quantize`, `perplexity`) and sets `LLAMA_APP_CMD` in the env before dispatching so that the router can re-inject the subcommand into spawned child argv. Guarded by `LLAMA_BUILD_APP=OFF` default — project doesn't enable it, no impact
~b9264–b9279	`conversion/base.py` + `conversion/llama.py`	New `_set_vocab_hybriddna()` Python helper that emits a `gpt2`-style BPE vocab tagged as `tokenizer.model = "hybriddna"`; `LlamaModel.set_vocab()` dispatches to it when `tokenizer_config.json` declares `"tokenizer_class": "HybridDNATokenizer"`; `add_prefix_space` handling moved earlier in the same method. Conversion tooling only, not compiled by project
~b9279–b9284	upstream `CMakeLists.txt`	`LLAMA_BUILD_APP` default flipped `OFF` → `ON`. Project's `LLAMA_BUILD_TOOLS` is OFF (FetchContent, `LLAMA_STANDALONE=OFF`), so `tools/`-dependent app targets are not configured; nevertheless `CMakeLists.txt:108` now explicitly forces `set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)` to keep the cache pinned across upgrades
~b9279–b9284	`tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt`	Each `*-impl` target switched from `add_library(... STATIC ...)` to default library type (becomes SHARED when `BUILD_SHARED_LIBS=ON`); added `WINDOWS_EXPORT_ALL_SYMBOLS ON` and conditional `install(TARGETS ... LIBRARY)` under `LLAMA_TOOLS_INSTALL`. Project doesn't enable `LLAMA_BUILD_TOOLS`, so none of these targets are configured — no impact
~b9279–b9284	`src/llama-vocab.cpp` + `conversion/base.py`	HybridDNA tokenizer fix: k-mers are now stored in `token_to_id` with a reserved `\xee\x80\x80` (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. `CCCCCC`); the suffix is stripped from `id_to_token` text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required
~b9279–b9284	`ggml/src/ggml-cuda/common.cuh`	PDL-launch gating now uses `ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER` instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required
~b9284–b9297	upstream `CMakeLists.txt`	`LLAMA_BUILD_APP` default reverted from `ON` back to `${LLAMA_STANDALONE}` (i.e. OFF for FetchContent consumers). Project's `set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE)` shim is now redundant but harmless; kept as defensive pin against future flips
~b9284–b9297	`common/chat.h` + `tools/server/server-task.cpp`	New additive `common_chat_parser_params::is_continuation` field (default `false`); `params_from_json_cmpl` now parses the `continue_final_message` request field via `common_chat_continuation_parse()` and sets `is_continuation` when the result is non-`NONE`. `task_result_state` ctor guard tightened: the empty-prefill `chat_msg = common_chat_parse("", true, ...)` initialization is now gated on `is_continuation && !echo` (was just `!echo`) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java `InferenceParameters.setContinueFinalMessage(boolean\|ContinuationMode)` already writes `continue_final_message` to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it
~b9284–b9297	`src/llama-model.{h,cpp}` + `src/models/qwen35.cpp` + `src/models/qwen35moe.cpp`	NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: `llama_layer_nextn` gains four scale fields (`eh_proj_s`, `eh_proj_in_s`, `shared_head_head_s`, `shared_head_head_in_s`); `load_tensors()` loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into `build_lora_mm()`. Internal model-loading + graph-building changes, no project changes required
~b9284–b9297	`ggml/src/ggml-backend.cpp`	Bug fix in `ggml_backend_tensor_get_2d_async`: fast-path condition checked `iface.set_tensor_2d_async == NULL` (typo) instead of `iface.get_tensor_2d_async == NULL`; multi-copy gets now correctly fall back when the backend lacks `get_tensor_2d_async`. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required
~b9284–b9297	`ggml/src/ggml-opencl/` (`ggml-opencl.cpp` + 17 kernel files)	Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the `ne01` bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when `ne01 % 64 != 0`. Fixed by: (1) rounding `global_size[0]` up to the next multiple of 64 in `ggml_cl_mul_mat_id`, (2) moving the per-thread `ne01` early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from `ne01 % 64 == 0` to `ne01 % 32 == 0` in `use_adreno_moe_kernels`. Internal OpenCL backend, affects the `opencl-android-aarch64` classifier build only — no project source changes
~b9284–b9297	`ggml/src/ggml-sycl/` (`ggml-sycl.cpp`, `dmmv.cpp`, `gated_delta_net.cpp`, `common.hpp`)	(1) BF16 added to `ggml_sycl_supports_dmmv()` and `can_use_dequantize_mul_mat_vec()`; new `convert_mul_mat_vec_bf16_sycl` path. (2) Level Zero auto-detect moved into `ggml_sycl_init()` — `info.ext_oneapi_level_zero` flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for `GGML_SYCL_ENABLE_LEVEL_ZERO` env. (3) `mmid_counting_sort_rows()` replaces the per-expert atomic scan in `ggml_sycl_mul_mat_id` — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with `keep_rs_t` template parameter and per-token snapshot writes when `K > 1`, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required
~b9284–b9297	`ggml/src/ggml-vulkan/CMakeLists.txt`	`find_package(SPIRV-Headers)` switched to `CONFIG REQUIRED` and adds `$ENV{VULKAN_SDK}` to `CMAKE_PREFIX_PATH`; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required
~b9284–b9297	`ggml/src/ggml-zendnn/` (`CMakeLists.txt`, `ggml-zendnn.cpp`)	ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles `GGML_TYPE_Q8_0` with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required
~b9284–b9297	`tools/perplexity/perplexity.cpp`	`log_probs.resize(n_ctx * nv)` widened to `size_t(n_ctx) * nv` to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact
~b9297–b9305	upstream `CMakeLists.txt`	Top-level backward-compat shims that forwarded `LLAMA_BUILD_WEBUI` → `LLAMA_BUILD_UI` and `LLAMA_USE_PREBUILT_WEBUI` → `LLAMA_USE_PREBUILT_UI` were REMOVED (they now live only in `tools/ui/CMakeLists.txt`). Java impact: project's `set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE)` no longer hits the shim at top level. `tools/ui` is not configured in FetchContent mode (`LLAMA_BUILD_TOOLS=OFF`), so the old setting was inert in practice, but the project's `CMakeLists.txt:107` was renamed to `set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE)` for clarity and to defend against future flips of `LLAMA_BUILD_UI` default
~b9297–b9305	`common/common.h`	`LLAMA_UI_DEFAULT_ENABLED` macro removed; `common_params::ui` default is now unconditionally `true`. Not referenced by project, no changes required
~b9297–b9305	`common/fit.{h,cpp}`	`common_get_device_memory_data()` made non-static and exported from `fit.h` (was a file-local helper). `fit.h` now also pulls in `ggml-backend.h`, `llama.h`, and `../src/llama-ext.h`. Used by upstream `tools/server/server-context.cpp` (compiled directly into jllama). The `#include "../src/llama-ext.h"` resolves relative to fit.h's location (`common/../src/llama-ext.h`), so no extra include paths are required. No project source changes
~b9297–b9305	`tools/server/server-context.cpp`	New `#include "fit.h"` and a new draft/MTP memory measurement block: when `params_base.fit_params` is set AND the speculative config includes a draft model or `COMMON_SPECULATIVE_TYPE_DRAFT_MTP`, `common_get_device_memory_data()` is called against the draft model (or a copy of the target params with `LLAMA_CONTEXT_TYPE_MTP` for MTP) and the resulting per-device `model + context + compute` bytes are added to `params_base.fit_params_target` before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. `ModelParameters.setFit(boolean)` defaults to `on`, so this kicks in automatically when a user configures a draft model — no Java-side wiring required
~b9297–b9305	`tools/server/server-context.cpp`	`[mtmd] estimated memory usage of mmproj` log line reworded to `estimated worst-case memory usage`; log only, no behavioural change
~b9297–b9305	`tools/server/server-http.cpp`	UI serving path migrated from per-asset extern arrays (`index_html`, `bundle_js`, …) and the `LLAMA_BUILD_UI` macro to a runtime `llama_ui_find_asset()` lookup gated on the new `LLAMA_UI_HAS_ASSETS` macro generated by the new `llama-ui-embed` host tool. Project does NOT compile `server-http.cpp` (only `server-context.cpp`/`server-queue.cpp`/`server-task.cpp`/`server-models.cpp`), no impact
~b9297–b9305	`tools/ui/` (`CMakeLists.txt`, new `embed.cpp`, new `sources.cmake`, new `scripts/ui-assets.cmake`, removed `scripts/ui-download.cmake` + `scripts/xxd.cmake`, removed `ui.cpp`+`ui.h`)	Full UI build pipeline rewrite: `xxd.cmake`+`ui-download.cmake` replaced by a host-compiled `llama-ui-embed` C++ tool that generates `ui.cpp`/`ui.h` (declaring a `g_assets[]` table and `llama_ui_find_asset()` lookup, plus `LLAMA_UI_HAS_ASSETS` macro) from arbitrary asset files; new `scripts/ui-assets.cmake` orchestrates asset provisioning with a clearer priority (pre-built `tools/ui/dist` → npm build → HF Bucket); `tools/ui` is now an `add_custom_target` always re-run per build. The deprecation shims for `LLAMA_BUILD_WEBUI`/`LLAMA_USE_PREBUILT_WEBUI`/`LLAMA_WEBUI_HF_BUCKET` moved here from the top-level `CMakeLists.txt`. Project does not build the UI (`LLAMA_BUILD_TOOLS=OFF` in FetchContent mode), no impact
~b9297–b9305	`ggml/include/ggml-alloc.h`	Comment-only API documentation update for `ggml_backend_alloc_ctx_tensors_from_buft`. No project changes required
~b9297–b9305	`ggml/src/ggml-backend-meta.cpp`	Bug fix for zero-sized split tensor slices: `set_tensor`/`get_tensor`/`set_tensor_async`/`get_tensor_async` paths now `continue` when `chunk_size_j == 0`; `ggml_backend_meta_alloc_ctx_tensors_from_buft` now allocates a dummy buffer when all tensors in a context are zero-sized (was returning `NULL` and asserting); `ggml_backend_buft_alloc_buffer` result now `GGML_ASSERT`ed non-null. Internal backend code, no project changes required
~b9297–b9305	`ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c`	`hvx_vec_splat_f16(hvx_vec_get_f16(...))` round-trip replaced with `hvx_vec_repl_f16(...)` which stays in the vector domain via `vdelta` (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required
~b9297–b9305	`ggml/src/ggml-opencl/ggml-opencl.cpp`	`GGML_OPENCL_PROFILING` batching fix: when `profiling_info` reaches 2048 entries the batch is now flushed into a persistent `profiling_results` vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing `]` closing the JSON array in `cl_trace.json`. Profile-only code (`GGML_OPENCL_PROFILING` is off by default), no project changes required
~b9305–b9333	`common/common.h` + `common/arg.cpp`	`common_params::checkpoint_every_nt` renamed to `checkpoint_min_step`; default changed 8192 → 256; CLI flag `-cpent`/`--checkpoint-every-n-tokens` REMOVED (throws `std::invalid_argument` at parse time) and replaced by `-cms`/`--checkpoint-min-step`; env var `LLAMA_ARG_CHECKPOINT_EVERY_NT` → `LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT`. Java layer does not expose this flag, no project source changes required
~b9305–b9333	`common/chat.h` + `common/chat.cpp`	New `common_chat_msg_span` and `common_chat_msg_delimiter` structs; new `common_chat_params::message_spans` field (default empty vector); new `common_chat_split_by_role()` function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected `user_start`/`assistant_start` markers; passed through `server-common.cpp` as `message_spans` JSON array in the task params; compiled from upstream, no Java changes required
~b9305–b9333	`common/chat-diff-analyzer.cpp` + `common/chat-auto-parser.h`	New `autoparser::user_start` and `autoparser::assistant_start` fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required
~b9305–b9333	`tools/server/server-task.h` + `tools/server/server-context.cpp`	New `task_params::n_before_user` field (default `-1`); server computes it from `message_spans` to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates `draft.cache_type_k/v`; compiled directly into jllama from upstream, no project source changes required
~b9305–b9333	`ggml/include/gguf.h` + `ggml/src/gguf.cpp`	New `gguf_reader_callback_t` typedef; new `gguf_init_from_buffer(data, size, params)` and `gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params)` public APIs; internal `gguf_init_from_reader()` helper refactored to use a callback-based reader; additive, not used by project
~b9305–b9333	`ggml/CMakeLists.txt`	GGML version bumped 0.12.0 → 0.13.0; no project changes required
~b9305–b9333	`ggml/src/CMakeLists.txt` + `ggml/src/ggml-cpu/CMakeLists.txt`	OpenMP detection and `target_link_libraries` moved from `ggml-cpu` into `ggml-base`; exported `ggml-config.cmake.in` updated to add `GGML_BASE_INTERFACE_LINK_LIBRARIES` and guard OpenMP targets before appending; fixes static-lib consumers that link only `ggml-base`; no project source changes required
~b9305–b9333	`ggml/src/ggml-alloc.c`	Off-by-one bug fix in `ggml_dyn_tallocr_remove_block`: loop ran one iteration past the last valid element; internal allocator fix, no project changes required
~b9305–b9333	`ggml/src/ggml-backend-meta.cpp`	Rotating-pair compute containers: external views created between evals now use a `stc_compute[2]` double-buffer scheme so they don't slowly deplete `stc_static` memory; `split_state_cache` is now unbounded (comment documents it as FIXME); `ggml_backend_meta_alloc_ctx_tensors_from_buft` uses `ggml_get_mem_size(ctx)` for static container and `16×` that for each compute container; internal multi-GPU meta backend refactor, no project changes required
~b9305–b9333	`ggml/src/ggml-cuda/fwht.cu` + `fwht.cuh` + `ggml-cuda.cu`	New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (`fwht_cuda<N>`) for N = 64/128/256/512; dispatched from `ggml_cuda_mul_mat` when `GGML_HINT_SRC0_IS_HADAMARD` op hint is set on a `ggml_mul_mat` node (hint index 1); internal CUDA backend, no project changes required
~b9305–b9333	`ggml/src/ggml-metal/ggml-metal-device.{h,m}`	New `ggml_metal_device_id` enum covering M1–M5 variants; `device_id` field added to `ggml_metal_device_props`, populated by new `ggml_metal_device_id_parse()` from the MTL device name string; additive, no project changes required
~b9305–b9333	`ggml/src/ggml-quants.c`	IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on `counter` under OpenMP; guards with `#ifdef GGML_USE_OPENMP`; internal quantization init, no project changes required
~b9305–b9333	`src/llama-arch.cpp`	`LLM_TENSOR_FFN_LATENT_DOWN` and `LLM_TENSOR_FFN_LATENT_UP` probe op changed from `GGML_OP_MUL` to `GGML_OP_MUL_MAT`; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use `MUL_MAT` to keep them there); internal upstream fix, no project changes required
~b9305–b9333	`vendor/cpp-httplib/httplib.{h,cpp}`	Bumped to v0.45.1: `close_socket`, `shutdown_socket`, `Server::stop` marked `noexcept`; macOS Keychain cert loading migrated from deprecated `SecTrustCopyAnchorCertificates` to `SecTrustSettingsCopyCertificates` (all three trust domains: system, admin, user); `CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN` now restricted to `TARGET_OS_OSX` only with compile-time `#error` on iOS/tvOS/watchOS; compiled automatically, no project changes required
~b9305–b9333	`common/common.h`	New `string_lcs(std::string_view a, std::string_view b)` function (longest common substring via DP); additive, not used by project directly
~b9333–b9354	`src/models/talkie.cpp` (new) + `src/llama-arch.h/cpp` + `src/llama-model.cpp` + `src/llama-vocab.cpp/h`	New Talkie model architecture (`LLM_ARCH_TALKIE`); uses NEOX rope type; embedding skip connections via `out_scale`; per-head Q gain via `attn_q_norm`; logit scale; new `LLAMA_VOCAB_PRE_TYPE_MINICPM5 = 52` ("minicpm5" pre-type with `ignore_merges = true`); "talkie" tokenizer_pre mapped to GPT4O; `Gemma4ForCausalLM` registered as Gemma4 in HF conversion map; all additive, no project source changes required
~b9333–b9354	`src/models/mistral3.cpp`	Dense FFN now passes `ffn_up_s`/`ffn_gate_s`/`ffn_down_s` instead of `nullptr`; MoE passes `ffn_up_exps_s`/`ffn_gate_exps_s`/`ffn_down_exps_s` to `build_moe_ffn`; bug fix for NVFP4 Mistral3/Mistral-MoE models; upstream only, no project changes required
~b9333–b9354	`tools/server/server-http.h` + `server-http.cpp`	`bool is_ssl = false` field added to `server_http_context`; `listening_address` now uses `https://` prefix when SSL is configured (was always `http://`); compiled from upstream, no project changes required
~b9333–b9354	`ggml/src/ggml-sycl/ggml-sycl.cpp`	Virtual memory pool (`ggml_sycl_pool_vmm`) implemented when `SYCL_EXT_ONEAPI_VIRTUAL_MEM` is available; `GGML_SYCL_ENABLE_VMM` env var (default `1`) controls it; `DEBUG_SYCL_MALLOC` compile flag for verbose allocation logging; `vmm_granularity` field in `sycl_device_info`; internal SYCL backend, no project changes required
~b9333–b9354	`ggml/src/ggml-cuda/fwht.cu` + `fwht.cuh`	`ggml_cuda_op_fwht` return type changed `void` → `bool`; returns `false` for non-contiguous tensors or unsupported N values instead of calling `GGML_ABORT`; caller in `ggml-cuda.cu` now skips FWHT gracefully; internal CUDA backend, no project changes required
~b9333–b9354	`ggml/src/ggml-vulkan/ggml-vulkan.cpp` + `conv2d_mm.comp`	Cooperative matrix 1 (cm1) path for conv2d; new `CONV_SHAPE_64x128` tile size; `aligned` spec constant skips bounds checks when K/CRS/NPQ are tile-aligned; `csh_store` stages cm2/cm1 output through shared memory for coalesced global stores; internal Vulkan backend, no project changes required
~b9333–b9354	`ggml/src/ggml-webgpu/`	New MMVQ path for mat-vec using `packed_4x8_integer_dot_product`; legacy `mul_mat.wgsl` removed (replaced by register-tile path); new `quantize_q8.wgsl` and `mul_mat_vec_q_acc.tmpl`; vendor and dot-product capability detection at init; `q8_1.m` renamed to `q8_1.s` in WGSL struct; internal WebGPU backend, no project changes required
~b9333–b9354	upstream CI (`.github/workflows/`)	CANN and SYCL builds disabled to save Actions resources; macOS builds moved to `build-apple.yml`; cache keys prefixed with `cache-gha-`; `[no release]` commit message token skips release pipeline; no project changes required
~b9354–b9437	`common/common.h` + `common/arg.h` + `common/arg.cpp`	`common_params_handle_models()` return type `void` → `bool` (caller can detect skip-download misses); new `common_params::skip_download`; `common_params::timeout_read` default raised 600 → 3600. Project does not call `common_params_handle_models()` directly — arg parsing happens upstream; the new defaults flow through transparently
~b9354–b9437	`common/download.h` + `common/download.cpp`	`common_download_model()` parameter list trimmed: `download_mmproj`/`download_mtp` moved into `common_download_opts`; new `common_skip_download_exception`; new opt `skip_download` returns `-2` on missing/etag mismatch. Project does not include `download.h` directly, no source changes required
~b9354–b9437	`tools/server/server-task.h` + `server-task.cpp`	`task_params::stream` default `true` → `false`; new `server_task_result_cmpl_partial::is_begin` bool to let HTTP layer emit SSE headers before the first delta; `to_json()` returns `nullptr` for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets `stream` explicitly from Java (`LlamaIterator.java`, `LlamaModel.java`) so the default change is inert. The `is_begin` / nullable-`to_json` contract DOES leak into the JNI bridge — see the row below for the required fix
~b9354–b9437	`tools/server/server-context.cpp` + `server-queue.cpp`	`send_partial_response()` gained `is_begin` parameter (defaulted); SSE stream now emits a no-content opening event when `stream && !return_progress` (`server-context.cpp:2835`) so the client sees HTTP 200 + headers before first token. `server_response_reader::next()` 30s warn-on-cancel diagnostic message updated. Required project source change: `Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson` in `src/main/cpp/jllama.cpp` called `result->to_json()` once and assigned `response["stop"]`, which silently auto-promoted the `nullptr` to an object `{"stop": false}` and surfaced a phantom empty `LlamaOutput` to every Java streaming caller (`LlamaModelTest.testGenerateAnswer` and four sibling tests overran by +1 token). Fixed by wrapping the `rd->next()` call in a loop that skips `response.is_null()` results so only real events reach Java
~b9354–b9437	`common/arg.cpp` (env-var renames)	`LLAMA_LOG_` → `LLAMA_ARG_LOG_`, `LLAMA_OFFLINE` → `LLAMA_ARG_OFFLINE`, `LLAMA_LOG_FILE` → `LLAMA_ARG_LOG_FILE`, `LLAMA_CHAT_TEMPLATE_KWARGS` → `LLAMA_ARG_CHAT_TEMPLATE_KWARGS`. CLI verbosity values relabeled (4=trace, 5=debug). The `--license` CLI flag was REMOVED and moved to the new `llama-app licenses` subcommand. Project does not expose these env vars or the `--license` flag through the Java API, no changes required
~b9354–b9437	`src/llama.cpp`	`llama_backend_init()` device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in `jllama.cpp` unchanged
~b9354–b9437	`src/llama-chat.cpp`	New `LLM_CHAT_TEMPLATE_GRANITE_4_1` enum value + "granite-4.1" template name; `granite-4.0` detection now requires the literal token `g4_default_system_message` in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly — routing happens inside compiled-from-upstream code, no source changes required
~b9354–b9437	`vendor/cpp-httplib/`	Bumped to v0.46.0: adds `Client::set_no_proxy(std::vector<std::string>)` with full hostname-suffix and IPv4/IPv6 CIDR matching; `Server::ThreadPool` constructor is exception-safe (already in v0.45.0); `Client::set_proxy()` now disconnects the held socket immediately so a later proxy change cannot reuse the old TLS session. Compiled automatically, no project changes required
~b9354–b9437	`common/arg.cpp` (additive flags)	New `--spec-draft-backend-sampling` / `--no-spec-draft-backend-sampling` (env `LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING`) and `--skip-download` (mapped to `common_params::skip_download`). Both default-on / default-off in a way that preserves current Java behaviour. Consider exposing as `ModelParameters.setSpecDraftBackendSampling(boolean)` and `setSkipDownload(boolean)` in a follow-up — tracked under Open TODOs
~b9354–b9437	`ggml/src/ggml-cuda/common.cuh`	`GGML_CUDA_USE_PDL` gating tightened: for MSVC, now requires CTK ≥ 12.3 (was 11.8) due to a compiler bug in the older Windows CUDA toolchains. Project's only CUDA build is Linux (dockcross, CUDA 13.2) so the MSVC gate has no CI impact; Windows CI builds CPU-only
~b9437–b9442	`src/llama-vocab.{h,cpp}` + `src/llama-arch.{h,cpp}`	New `LLAMA_VOCAB_PRE_TYPE_WHITESPACE = 53` and `llm_tokenizer_whitespace_session` (used by jina-v2-base-zh embeddings); new "whitespace" tokenizer_model routed as `LLAMA_VOCAB_TYPE_BPE`; new `LLM_KV_TOKENIZER_NORMALIZER_LOWERCASE` key (`tokenizer.ggml.normalizer.lowercase`) read into `llama_vocab::impl::normalizer_lowercase`; new public accessor `llama_vocab::get_normalizer_lowercase()`. All additive — existing tokenizers untouched; new whitespace + lowercase normalizer is consumed automatically when loading a GGUF that sets these vocabulary keys, no project source or Java API changes required
~b9437–b9442	`src/llama.cpp`	`llama_prepare_model_devices()` iGPU collection now appends only the FIRST `GGML_BACKEND_DEVICE_TYPE_IGPU` device (prevents duplicate iGPU registration on multi-iGPU hosts). Behavioural fix, single-line caller in `jllama.cpp` unchanged, no project source changes required
~b9437–b9442	`tools/ui/embed.cpp` + `tools/ui/src/...` (Svelte)	Webasset embedder tightened printf format specifiers (`%lu` → `%zu` and `PRIx64`); UI settings split `custom` into `customJson` + `customCss`; runtime CSS injection via `<svelte:head>`. Project does not ship the upstream UI, no impact
~b9437–b9442	`gguf-py/`, `conversion/` (Python)	New `_set_vocab_whitespace()` helper and `add_normalizer_lowercase()` GGUF writer for the new whitespace tokenizer + lowercase normalizer keys (mirrors the vocab additions above); jina-v2 Roberta-tokenizer path now branches to whitespace when `tokenizer.json` declares a `Whitespace` pre-tokenizer. Python-side only, no impact on the Java/JNI build
~b9442–b9444	`.github/workflows/build-cpu.yml` (upstream CI)	Upstream's CPU-build CI trigger paths narrowed to `*/.h`, `*/.hpp`, `*/.c`, `*/.cpp` (dropped `*/.cu`, `*/.cuh`, `*/.swift`, `*/.m`, `*/.metal`, `*/.comp`, `*/.glsl`, `*/.wgsl`) so GPU/Metal/Vulkan/WebGPU/Swift source edits no longer trigger the CPU build. Upstream-only CI plumbing; this project consumes none of upstream's workflow files and has its own `publish.yml`, no impact
~b9442–b9444	`tools/server/server-http.cpp`	`If-None-Match` conditional-GET handling now also accepts the weak ETag form `W/"..."` (previously matched only strong ETag bytes-equal); 304 Not Modified returned for either form. This is the standalone `llama-server` HTTP tool, which is not linked into the JNI build (`libllama` + `libcommon` only); no project source changes required and no new Java API surface to expose
~b9444–b9490	`common/common.cpp`	`common_prompt_batch_decode()` signature changed: new `int n_new` parameter added between `all_tokens` and `n_past`. Callers must pass the count of newly-decoded tokens for the batch. Only called inside upstream `tools/server/server-context.cpp` (compiled directly into jllama); no project source changes required — the new signature flows through transparently
~b9444–b9490	`include/llama.h`	`llama_set_warmup()` deprecated via `LLAMA_DEPRECATED` macro (warmup is now handled internally during model load + first decode). Not called from `jllama.cpp` or any project source — absorbed inside upstream-compiled code, no project changes required. If a future jllama feature wants to control warmup explicitly, that path is the deprecated one and should pick the new replacement instead
~b9444–b9490	`include/llama.h` + `src/llama-context.cpp`	New `llama_context_params::n_outputs_max` field (default `-1` = derived from `n_batch`). Limits the number of output slots allocated per context; useful for low-memory setups that always request `logits_all=false`. Not exposed by project today — consider adding `ModelParameters.setMaxOutputs(int)` if a user requests fine-grained control. Tracked under Open TODOs
~b9444–b9490	`common/arg.cpp` + `common/common.cpp`	`common_params_handle_models()` no longer sets `hf_opts.download_mmproj = true` unconditionally; instead uses `opts.download_mmproj = !params.no_mmproj` so the new `--no-mmproj` flag suppresses the multimodal projector download. Not called from project source — arg parsing happens upstream, no project changes required
~b9444–b9490	`common/sampling.h` + `common/sampling.cpp`	New `common_sampler_reasoning_budget_force(common_sampler *)` API that triggers the budget sampler to inject the end-of-thinking token on the next sample. Paired with new `common_params_sampling::reasoning_control` bool: when set, arms the budget sampler so external code (e.g. a server control endpoint) can end reasoning at runtime. Not used by project today — would pair with a future `InferenceParameters.setReasoningControl(boolean)` setter and a `LlamaModel.endReasoning(...)` helper. Tracked under Open TODOs
~b9444–b9490	`common/common.h` + `common/arg.cpp`	New `common_params::sse_ping_interval` (int32, env `LLAMA_ARG_SSE_PING_INTERVAL`, CLI `--sse-ping-interval`); server emits SSE keep-alive comments at this interval. Server-only; project does not run the upstream HTTP server (uses a direct in-process API), no Java setter required
~b9444–b9490	`tools/server/server-http.cpp`	New `POST /v1/chat/completions/control` endpoint accepting `{"id": "...", "action": "reasoning_end"}` — tells a streaming completion to wrap up reasoning early. Server-only; not linked into the JNI build (`libllama` + `libcommon` only), no project source changes required. If exposed in Java, would map to a new `LlamaModel.endReasoning(String taskId)` method that calls `common_sampler_reasoning_budget_force` on the slot's sampler. Tracked under Open TODOs
~b9444–b9490	`src/llama-hparams.h` + `src/llama-model.cpp`	Internal renames: `hparams::recurrent_layer_arr` → `hparams::is_recr_impl`; `hparams::swa_layers` → `hparams::is_swa_impl`. Internal helper fields not part of the public API; not referenced by `jllama.cpp` or any project source, no changes required
~b9444–b9490	`src/llama-arch.h` + `src/llama-arch.cpp` + `gguf-py/`	New `LLM_KV_HIDDEN_ACT` GGUF key (`%s.hidden_act`) for ModernBert SwiGLU/GeGLU activation selection; new `LLM_KV_ATTENTION_RECURRENT_LAYERS` key for hybrid (recurrent + attention) models. Additive vocabulary keys consumed automatically when loading a GGUF that sets them; no project source or Java API changes required
~b9444–b9490	`src/llama-arch.h` + `src/models/*.cpp` (new)	New model architectures: `LLM_ARCH_MELLUM` (JetBrains code-completion), `LLM_ARCH_EXAONE4_5` (LG AI multimodal), `LLM_ARCH_STEP3P7` (StepFun Step-3.7 with MTP support); `LLM_ARCH_QWEN3NEXT`/`LLM_ARCH_QWEN35`/`LLM_ARCH_QWEN35MOE` removed from `llama_model_saver_supports_arch()` allowlist. New tokenizer pre-types: `LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54`, `LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55`. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required
~b9444–b9490	`common/arg.cpp`	New `--mtp` / `--no-mtp` flags (env `LLAMA_ARG_MTP`) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to `ModelParameters.setMtp(boolean)`). Tracked under Open TODOs if a user requests it
~b9444–b9490	upstream build / verification	Local build with `GIT_TAG b9490` was verified clean: `cmake -B build` configures cleanly; `cmake --build build --config Release -j$(nproc)` links `libjllama.so` with zero warnings on `jllama.cpp` or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (`common.cpp`, `arg.cpp`, `llama.cpp`, `server-*.cpp`, `download.cpp`); no project source edits required for the version bump itself
~b9490–b9495	`include/llama.h` + `src/llama-ext.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-graph.{h,cpp}` + `common/speculative.{h,cpp}` + `src/models/{qwen35,qwen35moe,step35}.cpp`	Mass terminology rename: `pre_norm` → `nextn` everywhere the pre-final-norm hidden state is referenced. Affects the public API: `llama_set_embeddings_pre_norm()` → `llama_set_embeddings_nextn()`, `llama_get_embeddings_pre_norm()` → `llama_get_embeddings_nextn()`, `llama_get_embeddings_pre_norm_ith()` → `llama_get_embeddings_nextn_ith()`. Internal: `cparams.embeddings_pre_norm` → `cparams.embeddings_nextn`, `cparams.embeddings_pre_norm_masked` → `cparams.embeddings_nextn_masked`, `llm_graph_result::t_h_pre_norm` → `t_h_nextn`, `common_speculative_need_embd_pre_norm()` → `common_speculative_need_embd_nextn()`. Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting `t_h_nextn` (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (`speculative.cpp`, `llama-context.cpp`, `server-context.cpp`, model TUs). Verified by grep across `src/main/cpp/.{cpp,hpp}`: zero matches for any `pre_norm` / `nextn` / `embeddings_pre_norm` / `t_h_pre_norm*` symbol. No project source changes required
~b9490–b9495	`ggml/src/ggml-cuda/common.cuh` + 10 CUDA kernel files	New `GGML_CUDA_RESTRICT` macro replaces `__restrict__` on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires `__restrict__` to be disabled per llama.cpp PR #24030; the macro expands to nothing under `GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER`, otherwise to `__restrict__`. Kernel signatures change from direct `T * __restrict__ x` parameters to `T * x_ptr` parameter + an internal `T * GGML_CUDA_RESTRICT x = x_ptr;` alias line; `GGML_UNUSED_VARS` calls in fallback branches updated to reference the `_ptr` names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required
~b9490–b9495	`src/llama-arch.{h,cpp}` + `src/llama-vocab.{h,cpp}` + `gguf-py/gguf/constants.py` + `gguf-py/gguf/gguf_writer.py`	New `LLM_KV_TOKENIZER_SUPPRESS_TOKENS` GGUF key (`tokenizer.ggml.suppress_tokens`). When a GGUF declares this array, the loader stores it on `llama_vocab::impl::suppress_tokens` and exposes it via new `llama_vocab::get_suppress_tokens()` accessor. The Gemma4 model graph (`src/models/gemma4.cpp`) reads this list and appends a `-INFINITY` logit bias to those token IDs at the end of the forward graph (new `llm_graph_input_logits_bias` class). Additive: existing models without the key produce an empty `suppress_tokens` vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers `suppress_tokens` parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image
~b9490–b9495	`gguf-py/gguf/constants.py` + `gguf-py/gguf/tensor_mapping.py` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/clip.cpp` + new `tools/mtmd/models/gemma4uv.cpp` + new `tools/mtmd/models/gemma4ua.cpp` + `tools/mtmd/mtmd-audio.{h,cpp}` + `tools/mtmd/mtmd.cpp` + `conversion/__init__.py` + `conversion/gemma.py`	New Gemma4 Unified vision + audio variant (`Gemma4UnifiedForConditionalGeneration`). Adds new projector types `PROJECTOR_TYPE_GEMMA4UV` and `PROJECTOR_TYPE_GEMMA4UA` (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New `V_ENC_EMBD_PATCH_NORM` tensor enum (`v.patch_norm.{bid}`) and 3 indexed `patch_norm_{1,2,3}_{w,b}` weights on `clip_model` (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New `mtmd_audio_preprocessor_gemma4ua` mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream `mtmd-cli` / `mtmd-debug` binaries that the project does not link; the JNI build links `libllama` + `libcommon` only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required
~b9490–b9495	`tools/ui/` (`package.json`, `src/lib/components/app/content/MarkdownContent/`, new `MermaidPreview.svelte`, new `DialogMermaidPreview.svelte`, new constants / icons / rehype plugins)	Upstream `llama-server` web UI gains Mermaid diagram rendering: new `mermaid@^11.15` dependency, lazy-loaded; new rehype plugin chain (`rehype-mermaid-pre`, `rehype-enhance-mermaid-blocks`) converts ```mermaid code fences to `<pre class="mermaid">` and wraps them with copy / preview action buttons; the existing single-file `MarkdownContent.svelte` is split into a `.svelte` + sibling `.css` / `markdown-utils.ts` / `markdown-handlers.ts` so the new mermaid renderer can share helpers. Project does not compile or ship the upstream `tools/ui` (server-only feature, classpath-only JNI build); no impact
~b9490–b9495	upstream build / verification	Local build with `GIT_TAG b9495` was verified clean: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit; `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself
~b9495–b9543	`src/llama-hparams.{h,cpp}` + every `src/models/*.cpp` (~150 files)	Field `hparams::n_layer` (uint32_t) was split: the raw count moved to `hparams::n_layer_all` and `hparams::n_layer()` is now a member function that returns `n_layer_all - n_layer_nextn` (the effective non-MTP layer count). Sibling rename: `hparams::nextn_predict_layers` → `hparams::n_layer_nextn`. Every per-model TU in `src/models/*.cpp` was updated to call `hparams.n_layer()` and `hparams.n_layer_nextn`. New `hparams::set_recr_pattern()` mirror of `set_swa_pattern()` for hybrid recurrent architectures. New per-layer `hparams::deepstack_mapping_arr` (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key `LLM_KV_DEEPSTACK_MAPPING` for Granite4-Vision-style per-layer deepstack injection. `hparams::kv_only_nextn` was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly — verified via `grep -rn "hparams\.n_layer\|nextn_predict_layers\|n_layer_nextn\|n_layer_all\|deepstack_mapping" src/main/cpp/ src/test/cpp/` returns zero matches. All consumers are inside upstream-compiled TUs (`llama-model.cpp`, `llama-context.cpp`, model TUs); no project source changes required
~b9495–b9543	`include/llama.h` (state-seq flags) + `tools/server/server-context.cpp` + `examples/speculative-simple/speculative-simple.cpp`	The `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` flag was removed from the `llama_state_seq_flags` enum. All upstream call sites that passed `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY \| LLAMA_STATE_SEQ_FLAGS_ON_DEVICE` were updated to pass only `LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY` — the on-device path is now the default for partial saves/loads. Project does not call `llama_state_seq_get_` / `llama_state_seq_set_` directly from `jllama.cpp`; the only consumer in the JNI build is upstream `server-context.cpp` (speculative checkpoint helpers), which was updated upstream. Verified via `grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/` returns zero matches. No project source changes required
~b9495–b9543	new `common/imatrix-loader.{h,cpp}` + refactor of `tools/imatrix/imatrix.cpp` + `tools/quantize/quantize.cpp`	Extracted shared imatrix-loading logic into a standalone library: new `common_imatrix` struct (`entries`, `datasets`, `chunk_count`, `chunk_size`, `is_legacy`, `has_metadata`) and `common_imatrix_load(const std::string &, common_imatrix &)` reader. New GGUF metadata keys exposed as `LLM_KV_IMATRIX_DATASETS`, `LLM_KV_IMATRIX_CHUNK_COUNT`, `LLM_KV_IMATRIX_CHUNK_SIZE`. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: `common/CMakeLists.txt` now includes `imatrix-loader.cpp` and `imatrix-loader.h` in `libcommon`, which means the JNI build picks up the new TU automatically via FetchContent + the existing `target_link_libraries(jllama PRIVATE common)` line. Project does not use imatrix loading from Java today (no `LlamaImatrix` class); the new symbols ship as additive surface area only. No project source changes required
~b9495–b9543	`tools/mtmd/clip.{h,cpp}` + `tools/mtmd/clip-impl.h` + `tools/mtmd/clip-model.h` + `tools/mtmd/mtmd.{h,cpp}` + `tools/mtmd/mtmd-helper.{h,cpp}` + `tools/mtmd/mtmd-image.cpp` + every `tools/mtmd/models/*.cpp`	Large MTMD subsystem refactor: (1) `clip_image_u8` and `clip_image_f32` switched from public POD-style `nx` / `ny` / `buf` fields to private members with `get_size()` / `set_size()` / `get_ro_buf()` / `cpy_buf()` / `get_pixel()` / `set_pixel()` / `is_placeholder()` getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from `tools/mtmd/clip.h`: `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_image_u8_get_data`, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`. (3) `mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` gained a required `bool placeholder` parameter (when true the bitmap reserves shape only, no pixel decode — used for token counting). (4) `mtmd_bitmap` is now a true class (private buffer + `is_placeholder()` / `can_batch_with()`); `mtmd_bitmap_init()` and `mtmd_bitmap_init_from_audio()` accept `nullptr` data to create placeholder bitmaps. (5) New Granite4 Vision projector type `PROJECTOR_TYPE_GRANITE4_VISION` and tensor enums (`V_MULTI_PROJ_`, `V_QF_`) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: `clip_graph_qwen2vl::build_inp_with_temporal_merge()` plus `n_batch_max=2` for batch-merged consecutive image frames. Project does not link any `tools/mtmd/` TUs into the JNI build (`libllama` + `libcommon` only); the JNI vision API surfaces through `mtmd-helper.h` and was reviewed: zero `clip_image_` / removed-helper references found across `src/main/cpp/` and `src/test/cpp/`. No project source changes required
~b9495–b9543	`tools/server/server-context.cpp` + `tools/server/server-http.cpp` + `tools/server/server.cpp` (new `/v1/responses/input_tokens` + `/v1/chat/completions/input_tokens` + `/v1/messages/count_tokens`)	New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: `server_routes::handle_count_tokens()` consolidates the body parsing path (chat completions, responses, anthropic messages) and emits `{"input_tokens": N, "object": "response.input_tokens"}`. `process_mtmd_prompt()` signature gained a `bool is_placeholder = false` parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither `tools/server/server.cpp` nor `server-http.cpp`); the only server TU we link is `server-context.cpp`, where the only project-visible change is the new optional `process_mtmd_prompt` parameter, which is defaulted — existing project call sites compile unchanged. No project source changes required
~b9495–b9543	`common/chat-peg-parser.{h,cpp}` + `common/chat.cpp` (LFM2/2.5 unified)	LFM2.5's chat-completion parser was merged into the single `common_chat_params_init_lfm2()` (was a separate `_lfm2_5` function); a `bool tool_list_tokens` flag toggles between the two template flavours. New helper `common_chat_peg_builder::python_or_json_value()` and a new `bool allow_json_literals` parameter on `python_style_tool_calls()` so LFM2.5 can accept JSON-cased `true` / `false` / `null` alongside the Python-cased literals. Pure-Python literal normalisation in `chat-peg-parser.cpp` (`True`/`False`/`None` → JSON during streaming). Project does not call any `common_chat_peg_` or `common_chat_params_init_lfm2` symbols; routing happens inside upstream-compiled `chat.cpp`. No project source changes required
~b9495–b9543	`ggml/src/ggml-cuda/mmvq.cu` + `ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c` + `ggml/src/ggml-metal/ggml-metal-device.m` + `ggml/src/ggml-opencl/` + `ggml/src/ggml-sycl/` + `ggml/src/ggml-vulkan/` + `ggml/src/ggml-webgpu/` + `ggml/src/ggml-cpu/kleidiai/kleidiai.cpp`	Per-backend numerical & performance work: (1) CUDA `mul_mat_vec_q_moe` switched to `GGML_CUDA_RESTRICT` aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (`vl128` / `vl256` / `vl512` / `vl1024` separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster `concat`/`cpy`/`get_rows` packed kernels for narrow tensors (`<32` cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; `should_reorder_tensor` gate widened from `ne[1]==1` to `ne[1]<=8`. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every `coopmat2_features.*` bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (`U32_DEQUANT_HELPERS`); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars `GGML_KLEIDIAI_CHUNK_MULTIPLIER` & `GGML_KLEIDIAI_SME` thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to `jllama.cpp`. No project source changes required
~b9495–b9543	`conversion/__init__.py` + `conversion/granite.py` + `conversion/gemma.py` + `convert_lora_to_gguf.py` + `gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py`	Python-side: new `Granite4VisionMmprojModel` (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (`hidden_size` falls back to `audio_embed_dim`; `model_patch_size` falls back to `patch_size * pooling_kernel_size`). `convert_lora_to_gguf.py` gained `--trust-remote-code`. New `LLM_KV_DEEPSTACK_MAPPING` writer (`add_deepstack_mapping`) and new clip-vision keys (`KEY_PROJ_SAMPLE_QUERY_SIDE`, `KEY_PROJ_SAMPLE_WINDOW_SIDE`, `KEY_PROJ_SPATIAL_OFFSETS`, `KEY_FEATURE_LAYERS`, `KEY_IMAGE_GRID_PINPOINTS`) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required
~b9495–b9543	upstream build / verification	Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero `grep` matches in `src/main/cpp/` for any of the renamed or removed symbols: `hparams.n_layer`, `nextn_predict_layers`, `n_layer_nextn`, `n_layer_all`, `LLAMA_STATE_SEQ_FLAGS_ON_DEVICE`, `clip_image_u8`/`clip_image_f32` field access, `clip_build_img_from_pixels`, `clip_get_newline_tensor`, `clip_image_u8_get_data`, `clip_embd_nbytes`, `clip_embd_nbytes_by_img`, `clip_encode_float_image`, `clip_image_f32_batch_add_mel`, `mtmd_helper_bitmap_init_from_file`, `mtmd_helper_bitmap_init_from_buf`, `common_imatrix_load`). The only project-visible signature change — `process_mtmd_prompt()`'s new `bool is_placeholder` parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself
~b9543–b9549	`include/llama.h` + `src/llama-context.{h,cpp}` + `src/llama-cparams.h` + `src/llama-ext.h`	New `llama_context_params::ctx_other` field (a source/target/parent `llama_context *`, default `nullptr`) used to share results or `llama_memory` between two contexts; mirrored by new `cparams.ctx_other` and the new staging API `llama_get_ctx_other()` (`llama-ext.h`). `llama_get_memory()` was moved earlier in `llama-context.cpp` and made null-safe (returns `nullptr` for a null ctx). `llama_context_default_params()` initializes `ctx_other = nullptr`. Project does not aggregate-init `llama_context_params` (it goes through `llama_context_default_params()` inside upstream `server-context.cpp`) and never includes `llama-ext.h` — verified via `grep -rn "llama_context_params\|ctx_other\|llama-ext.h\|llama_get_ctx_other\|llama_get_memory" src/main/cpp/` returns zero matches. No project source changes required
~b9543–b9549	`src/llama-kv-cache.{h,cpp}` + `llama-kv-cache-iswa.{h,cpp}` + `llama-kv-cache-dsa.cpp` + `llama-memory.h` + `llama-memory-hybrid{,-iswa}.cpp`	KV-cache constructors gained two new parameters: `llama_memory_t mem_other` and `layer_share_cb share` (`std::function<int32_t(int32_t il)>` returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). `llama_memory_params` gained a `mem_other` field. All call sites (iswa/dsa/hybrid wrappers, `llama_model::create_memory`) updated upstream; the project never constructs a `llama_kv_cache` or `llama_memory_` directly. No project source changes required
~b9543–b9549	`src/llama-arch.{h,cpp}` + new `src/models/gemma4-assistant.cpp` + `src/models/models.h` + `src/llama-model.{h,cpp}` + `src/llama-hparams.{h,cpp}` + `src/llama-graph.{h,cpp}` + `gguf-py/` + `conversion/gemma.py`	New model architecture `LLM_ARCH_GEMMA4_ASSISTANT` ("gemma4-assistant") — a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors `LLM_TENSOR_NEXTN_PROJ_PRE`/`NEXTN_PROJ_POST` (`nextn.pre_projection`/`post_projection`) plus model-level `nextn_proj_pre`/`nextn_proj_post`; new hparams `n_embd_inp_impl` (input-embedding dim override, honoured by `n_embd_inp()`) and graph field `n_layer_nextn`. Python conversion registers `Gemma4AssistantForCausalLM`/`Gemma4UnifiedAssistantForCausalLM`. This is the headline new feature; it is a speculative-decoding / MTP mechanism, which this project tracks as deferred-by-policy (see Open TODOs / `spec-draft-backend-sampling` + MTP). Consumed entirely inside upstream-compiled TUs — loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO
~b9543–b9549	`common/chat.cpp` + new `models/templates/LFM2.5-8B-A1B.jinja`	LFM2 chat-template handling: prior-turn `reasoning_content` is now copied into the template's `thinking` field, and `<think>` reasoning extraction is gated on the template source actually containing `<think>` (and no longer on `enable_thinking`). New `LFM2.5-8B-A1B` template + parser test consolidation. Routing happens inside upstream-compiled `chat.cpp`; the project calls no `common_chat_params_init_lfm2*` symbol. Handled automatically when such a model is loaded; no project source or Java API changes required
~b9543–b9549	`common/arg.cpp` + `common/speculative.cpp` + `src/llama-graph.cpp`	`common_params_handle_models()` mmproj auto-download now also requires `params.mmproj.path.empty() && params.mmproj.url.empty()` (an explicitly-specified mmproj is no longer re-downloaded). `speculative.cpp` MTP path adds a shared-memory fast path (`is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt`) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to `llama_model_n_embd_out()` for the MTP row width. `llama-graph.cpp` moved the `set_input_kq_mask` / `can_reuse_kq_mask` calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required
~b9543–b9549	`tools/server/server-context.cpp` (project-linked)	The one project-linked server TU changed: now `#include`s `ggml-cpp.h` and `../../src/llama-ext.h`; sets `cparams.ctx_other = ctx_tgt` for MTP draft/MTP contexts; moved the `ctx_dft_seq_rm_type = common_context_can_seq_rm(...)` assignment to after context init (guarded by `if (ctx_dft)`); downgraded the spec memory-measure failure log from `SRV_ERR` to `SRV_WRN`; and gated the mtmd draft-processing block on `llama_get_ctx_other(ctx_dft) != ctx_tgt`. All changes are internal to the TU and the new includes resolve against the FetchContent'd `src/` and `ggml` headers. Compiles into `jllama` unchanged from the project's side. No project source changes required
~b9543–b9549	`.github/workflows/docker.yml` (upstream CI)	Upstream's `cuda13` Docker image bumped from CUDA `13.1.1` to `13.3.0`. Upstream's own CI only; this project ships its own `publish.yml` and pins CUDA 13.2 via `.github/build_cuda_linux.sh` (see CLAUDE.md "Upgrading CUDA Version"). No impact
~b9543–b9549	project `CMakeLists.txt` (pre-existing latent bug, fixed in this bump)	Not an upstream change — surfaced while build-testing this bump locally. The OS/arch detection block invoked `net.ladenthin.llama.OSInfo`, but the class had moved to `net.ladenthin.llama.loader.OSInfo` in the earlier layered-package restructure, so `cmake -B build` failed with "Could not determine OS name" on any host that does not pass `-DOS_NAME`/`-DOS_ARCH` explicitly (CI does, which is why it went unnoticed). Fixed both `execute_process` invocations (`--os` and `--arch`) to the `loader.OSInfo` FQN. Same stale-FQN-after-restructure class as the earlier `spotbugs-exclude.xml` / PIT-`targetClasses` repairs — the standing reminder to re-validate every FQN-bearing config after a package move now also covers `CMakeLists.txt`
~b9543–b9549	upstream build / verification	Local build with `GIT_TAG b9549` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly (after the `loader.OSInfo` FQN fix above), `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit (incl. the changed `server-context.cpp`), and `ctest --test-dir build --output-on-failure` reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9549–b9553	`common/sampling.h` + `common/sampling.cpp` + `common/arg.cpp` + `common/common.cpp` + `tools/server/server-task.cpp`	`common_sampler_types_from_names()` dropped its `bool allow_alt_names` parameter — the signature is now `common_sampler_types_from_names(const std::vector<std::string> & names)`. The body was rewritten to (a) auto-generate kebab-case (`top-k`) and no-dash (`topk`) aliases from the canonical snake_case names, plus misc aliases (`nucleus`→top_p, `temp`→temperature, `typ`→typical_p), and (b) lowercase the input so matching is case-insensitive; aliases are now always accepted (the old gate is gone). All three call sites were updated upstream (`arg.cpp` / `common.cpp` dropped the `, true` arg; `server-task.cpp` dropped the `, false` arg). Project impact: none at the source level — `grep -rn common_sampler_types_from_names src/main/cpp src/test/cpp` returns zero matches; the symbol is reached only through the upstream-compiled `server-task.cpp` linked into `jllama`. New behaviour exposed for free: because `server-task.cpp` previously passed `allow_alt_names=false`, the project's `InferenceParameters` `samplers` JSON array only matched canonical names like `top_k`; it now also accepts `top-k` / `topk` / `nucleus` / `temp` / `typ` and is case-insensitive (`TOP_K`, `Min-P`). Pinned by 5 new `ParamsFromJsonCmpl.Samplers_*` tests in `test_server.cpp`
~b9549–b9553	`src/llama-kv-cache.cpp` + `src/llama-kv-cache.h` + `src/llama-kv-cells.h`	KV-cache shared-cells refactor (continues `TAG_KV_CACHE_SHARE_CELLS`, used by the Gemma4-assistant MTP head): the `v_cells` member changed from a by-value `std::vector<llama_kv_cells>` to a `std::shared_ptr<llama_kv_cells_vec> v_cells_impl` plus a `llama_kv_cells_vec & v_cells` reference, so a target cache now views the source cache's cells instead of copying them in `apply_ubatch()`; the constructor also clamps `kv_size` down to the shared source's size. New type alias `using llama_kv_cells_vec = std::vector<llama_kv_cells>;` in `llama-kv-cells.h`. All internal `src/` headers the JNI build does not include (the project pulls public `llama.h` / `llama-cpp.h`, never `llama-kv-cache.h` / `llama-kv-cells.h`) — verified via `grep -rn "llama_kv_cells\|llama-kv-cache" src/main/cpp src/test/cpp` → zero matches. No project source changes required
~b9549–b9553	`conversion/mistral.py` + `convert_hf_to_gguf.py`	Python conversion-script robustness only: `hparams["llama_4_scaling"]` and `"moe" in hparams` replaced with `hparams.get(...)` / `is not None` guards so a present-but-null key no longer crashes conversion. Python tooling, not part of the JNI build. No impact
~b9549–b9553	upstream build / verification	Local build with `GIT_TAG b9553` verified clean on Linux x86_64: `cmake -B build -DBUILD_TESTING=ON` configures cleanly, `cmake --build build --config Release -j$(nproc)` links `libjllama.so` + `jllama_test` with zero warnings on any project translation unit, and `ctest --test-dir build --output-on-failure` reports 440/440 tests passing (435 prior + 5 new `Samplers_*` tests). The sole breaking change in this range (the `common_sampler_types_from_names` signature) is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9553–b9555	`.devops/intel.Dockerfile` + `ggml/src/ggml-metal/ggml-metal-device.cpp` + `tests/test-backend-ops.cpp`	Tiny maintenance bump — no API change and no new feature. (1) `intel.Dockerfile`: Intel GPU userspace driver pins bumped (IGC `v2.20.5`→`v2.34.4`, compute-runtime `25.40.35563.10`→`26.18.38308.1`, IGDGMM `22.8.2`→`22.10.0`) with the old multi-GPU-safe versions commented out; upstream's own Docker image only — this project ships its own `publish.yml` and does not consume `.devops/`. No impact. (2) `ggml-metal-device.cpp`: bugfix to the Metal im2col pipeline selector — the standard-vs-`_ext` kernel choice now keys off the actual conv-kernel footprint (`KHKW`, with `KH = is_2D ? ne01 : 1`, `KW = ne00`) instead of the raw `ne00ne01` product, fixing kernel selection for 1-D convolutions. Backend-internal Metal TU compiled via FetchContent; no API surface visible to `jllama.cpp`, and only affects the macOS/Metal backend at runtime. (3) `tests/test-backend-ops.cpp`: one extra `test_im2col` case (`{3000,384,1,1}` / `{3,384,384,1}`) added — upstream test only, not linked into the JNI build. No project source changes required; no new Java-API-exposable feature. Build verification deferred to CI (`publish.yml`) / a developer host as usual
~b9555–b9621	`ggml/include/ggml.h` + `ggml/src/ggml.c` + `ggml/src/ggml-cuda/gated_delta_net.cu` + `ggml/src/ggml-metal/ggml-metal.metal` + `ggml/src/ggml-vulkan/vulkan-shaders/gated_delta_net.comp`	`ggml_gated_delta_net` state tensor reshaped again: the 3D `(S_vS_vH, K, n_seqs)` layout is now the 4D `[S_v, S_v, H, n_seqs]` with an explicit `int64_t K` seventh parameter (snapshot count, K=1 is final-state-only). Signature: `ggml_gated_delta_net(ctx, q, k, v, g, beta, state, K)` (was 6-argument). Snapshot-slot ordering also flipped to most-recent-first. Internal Qwen3.5 / Qwen3-Next recurrent-attention kernel; project does not call `ggml_gated_delta_net` directly — no project source changes required
~b9555–b9621	`ggml/include/ggml.h`	New `ggml_col2im_1d(ctx, a, s0, oc, p0)` function and `GGML_OP_COL2IM_1D` enum value added; `GGML_OP_COUNT` incremented 96 → 97. Additive; not called by project — no project source changes required
~b9555–b9621	`common/fit.h` + `tools/server/server-context.cpp`	`common_get_device_memory_data()` return type changed: now returns `common_device_memory_data_vec` (typedef for `std::vector<common_device_memory_data>`). New `common_device_memory_data` struct carries `.total`, `.free`, `.model`, `.context`, `.compute` fields directly (previously the caller reached them via `.mb.model` etc.). `fit.h` also dropped its `#include "ggml-backend.h"` and `#include "../src/llama-ext.h"` lines (those types are no longer needed at the header level). Consumed exclusively in upstream-compiled `server-context.cpp` (field-accessor update from `.mb.model` → `.model` etc. was applied upstream); project does not include `fit.h` or call `common_get_device_memory_data()` directly — no project source changes required
~b9555–b9621	`tools/mtmd/mtmd-helper.h` + `tools/mtmd/mtmd-helper.cpp` + `tools/server/server-common.cpp`	`mtmd_helper_bitmap_init_from_file()` and `mtmd_helper_bitmap_init_from_buf()` return type changed: both now return `mtmd_helper_bitmap_wrapper` struct (contains `bitmap` + `video_ctx` fields) instead of `mtmd_bitmap*`. All call sites updated in upstream `server-common.cpp`. Project does not call these functions from `src/main/cpp/` (verified via grep: zero matches) — no project source changes required
~b9555–b9621	`tools/mtmd/mtmd-helper.h` + `tools/mtmd/mtmd-helper.cpp`	New video pipeline: `mtmd_helper_video_context`, `mtmd_helper_video_` API family (init/free/decode), ffmpeg-based frame extraction. New `--video` CLI flag in `common/arg.cpp`; new `input_video` content type in `server-common.cpp`. Multimodal helper additions flow through the upstream-compiled `mtmd-helper.cpp` and `server-common.cpp`; project does not reference any `mtmd_helper_video_` symbol — no project source changes required. Could be exposed in a future Java API as `InferenceParameters.setVideoPath(String)`
~b9555–b9621	`common/common.h`	New `common_params` fields: `path_prompts_log_dir` (prompt-logging output directory, string) and `mtmd_batch_max_tokens` (multimodal batch token limit, default 1024). Both additive with harmless defaults. Not surfaced by `ModelParameters` today — could be added in a future enhancement. No project source changes required
~b9555–b9621	`src/llama-ext.h`	New EAGLE3 speculative-decoding support APIs: `llama_set_embeddings_layer_inp(ctx, lid, value)`, `llama_get_embeddings_layer_inp(ctx, lid)`, `llama_model_target_layer_ids(model)` → `const int32_t*`, `llama_model_target_layer_ids_n(model)` → `uint32_t`. New `LLM_ARCH_EAGLE3` model architecture; new `llama_model_eagle3` struct in upstream model sources. EAGLE3 enables full encoder+decoder graph implementation for speculative decoding. All consumed inside upstream-compiled `speculative.cpp` and model TUs; project does not reference any of these symbols — no project source changes required. Could be exposed later as a speculative-decoding backend type in `ModelParameters`
~b9555–b9621	`src/llama-graph.h` + `src/llama-graph.cpp`	`llm_graph_result::set_outputs()` signature changed: now takes a `const llm_graph_params &` parameter (was no-parameter). New `t_layer_inp` vector added to `llm_graph_result` for layer-input embedding extraction (used by EAGLE3). Internal graph-building API; not called from project sources — no project source changes required
~b9555–b9621	`src/llama-context.cpp`	`llama_context` now initializes `embeddings_layer_inp` storage for EAGLE3 layer-input extraction; `n_outputs_max` is forced to `n_batch` when `llama_model_has_encoder()` returns true (encoder models always need all outputs). Internal context lifecycle; no project sources reference these fields — no project source changes required
~b9555–b9621	`vendor/cpp-httplib/httplib.h` + `httplib.cpp`	cpp-httplib bumped to v0.47.0. Compiled automatically via FetchContent — no project source changes required
~b9555–b9621	`ggml/src/ggml-cuda/ggml-cuda.cu`	`ggml_concat` on CUDA now handles F16, BF16, I8, I16, I32, I64 element types in addition to F32; `active_count` tracking added to CUDA context to prevent memory leak from lazy `cudaMemGetInfo` context creation. Internal CUDA backend, no project changes required
~b9555–b9621	`ggml/src/ggml-vulkan/` + Vulkan shaders	New `VK_VALVE_shader_mixed_float_dot_product` extension support for F16→F32 fused dot products (`dot2_f16`) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required
~b9555–b9621	`ggml/src/ggml-opencl/` + OpenCL kernels	New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects `opencl-android-aarch64` classifier build only); no project source changes required
~b9555–b9621	`ggml/src/ggml-cuda/ssm-scan.cu`	Added `__syncthreads()` before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required
b9621–b9637	`common/chat.cpp`	New Cohere2 MoE ("North Code") chat parser `common_chat_params_init_cohere2moe` + auto-detection (template containing `<\|START_TEXT\|>` and `<\|START_ACTION\|>`). Purely additive — compiled in the `chat.cpp` TU and reached through the existing specialized-template path, so the project's `oaicompat_chat_params_parse` picks it up automatically. No project source changes required. New feature: Cohere2 MoE reasoning + JSON tool-call chat support
b9621–b9637	`common/jinja/runtime.cpp`, `common/jinja/value.cpp`	Jinja chat-template engine fixes: filter aliases `count`→`length`, `d`→`default`, `e`→`escape`; negative-step slice start/stop defaults; `split` raises on empty separator; `replace('', x)` now expands between every char. Compiled into `common`; improves chat-template compatibility automatically. No project source changes required
b9621–b9637	`src/llama-arch.{h,cpp}`, `src/models/cohere2moe.cpp` (new), `src/models/models.h`, `src/llama-model.cpp`, `src/llama-model-saver.cpp`, `src/llama-vocab.cpp`	New `LLM_ARCH_COHERE2MOE` architecture (MoE + MTP/NextN) with `llama_model_cohere2moe`; `cohere2moe` tokenizer pre-type (maps to `LLAMA_VOCAB_PRE_TYPE_TINY_AYA`); Cohere2 dense path gains `ffn__s` NVFP4 scale tensors; tied-NVFP4-`output` assert relaxed to allow sidecar LM-head scales. Additive enum/struct internal to libllama; the project includes `llama.h`, not `llama-arch.h`/`models.h`, and switches on no arch enum. No project source changes required. New feature:* loads North-Mini-Code GGUFs
b9621–b9637	`ggml/src/ggml-vulkan/` + shaders	Unary shaders consolidated into one templated `unary.comp`; new `EXPM1` Vulkan op; GLU push-constants reworked (per-dim strides + misalign offsets); fastdiv `L` values byte-packed to stay under the 128B push-constant limit. Internal Vulkan backend — the project builds CPU/CUDA/Metal/OpenCL only, never Vulkan. No project changes required
b9621–b9637	`tools/server/server-http.cpp`, `tools/ui/`, `scripts/ui-assets.cmake`	Optional gzip-compressed WebUI asset serving (`LLAMA_UI_GZIP`, `llama_ui_use_gzip()`). The project compiles `server-context/queue/task/models` but not `server-http.cpp` or `tools/ui`, so the HTTP/WebUI layer is absent from `jllama`. No project changes required
b9621–b9637	`tools/cli/cli.cpp`, `.devops/*.Dockerfile`, `.github/`, `conversion/`, `convert_hf_to_gguf_update.py`, `gguf-py/`, `models/templates/Cohere2MoE.jinja`, `docs/`, `tests/`	CLI preserved-token wiring, Docker image `docker.io/` prefixes, CI labeler/release tweaks, Python GGUF converters, the new model template asset, doc typos, and upstream tests. None are compiled into `jllama` or shipped by the project. No project changes required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp upstream breaking changes — version-range changelog

FilesExpand file tree

llama-cpp-breaking-changes.md

Latest commit

History

llama-cpp-breaking-changes.md

File metadata and controls

llama.cpp upstream breaking changes — version-range changelog