Skip to content

Latest commit

 

History

History
354 lines (349 loc) · 133 KB

File metadata and controls

354 lines (349 loc) · 133 KB

llama.cpp upstream breaking changes — version-range changelog

Per-version-range record of upstream API breaks observed in the b5022 → latest range, what the affected upstream files are, and the project-side fix (or "no project changes required" when the break stayed inside an upstream-compiled translation unit).

Used during llama.cpp version bumps: when upgrading, scan this file from the row matching the current pinned version forward to the target, apply any rows marked as needing project source changes, and append a new row covering the upgrade range. See the "Upgrading/Downgrading llama.cpp Version" section in ../../CLAUDE.md for the upgrade workflow.

Version File Change
~b7217–b7433 common/common.h, include/llama-cpp.h common_init_result became common_init_result_ptr; access changed to ->model() / ->context() / ->free_context()
~b7433 common/arg.h n_parallel default changed to sentinel -1 (auto); Java bindings must resolve to 1 before model load
~b7217–b7783 common/arg.hcommon/download.h common_remote_get_content and common_remote_params split into new download.h; headers changed from vector<string> to vector<pair>
~b7783 common/common.h build_info string moved into common.h; local definition must be removed
~b7783–b7858 common/chat.h common_chat_syntax renamed to common_chat_parser_params; to_json_oaicompat<json>() template removed (no template arg); ensure_tool_call_ids_set()set_tool_call_ids()
~b7858–b7864 common/speculative.h Full redesign: common_speculative_init(ctx_tgt, ctx_dft)common_speculative_init(params_speculative, ctx); common_speculative_gen_draftcommon_speculative_draft; new common_speculative_accept(); common_speculative_params struct replaced by common_params_speculative; draft model loaded via llama_model_load_from_file into llama_model_ptr
~b7858–b7864 common/common.h params_speculative: .model.path/.hf_repo replaced by .has_dft()/.mparams_dft; new .model_dft and .cparams_dft fields; speculative.type enum added (COMMON_SPECULATIVE_TYPE_NONE)
~b7858–b7864 server.hpp (internal) slot_action.slot_idslot_action.id_slot; llama_init_dft removed from server_context; model_dft changed from llama_model* to llama_model_ptr; slot.ctx_tgt/ctx_dft removed
~b7864 common/mtmd.h mtmd_init_params.verbosity field removed
~b7904–b8190 common/common.h params_base.model_alias changed from std::string to a container; use *model_alias.begin() instead of direct string cast
~b8778–b8808 tools/mtmd/mtmd.h MTMD_DEFAULT_IMAGE_MARKER macro removed; mtmd_image_tokens_get_nx/ny deprecated; new mtmd_decoder_pos struct + mtmd_image_tokens_get_decoder_pos(); mtmd_context_params_default() now sets image_marker = nullptr (throws "custom image_marker is not supported anymore" if non-null); upstream server adds randomized get_media_marker() in server-common.h — our server.hpp is unaffected since it does not include that header and uses mtmd_default_marker() consistently
~b8808–b8831 project CMakeLists.txt CMake target common renamed to llama-common; update target_link_libraries for jllama and jllama_test
~b8808–b8831 common/common.h → new common/build-info.h build_info std::string removed; replaced by llama_build_info() (const char*) in new build-info.h; add #include "build-info.h" in server.hpp and utils.hpp; call sites: std::string(llama_build_info()) in server.hpp (6×), llama_build_info() in jllama.cpp (1×) and utils.hpp (1×)
~b8808–b8831 ggml/src/ggml.c New ggml_graph_next_uid() calls _InterlockedIncrement64 via <intrin.h> on x86; intrinsic unavailable on 32-bit MSVC; fix: src/main/cpp/compat/ggml_x86_compat.c provides __cdecl _InterlockedIncrement64 via InterlockedIncrement64 (CMPXCHG8B), added to ggml-base via target_sources guarded by MSVC AND CMAKE_SIZEOF_VOID_P EQUAL 4
~b8838–b8841 src/llama-model.h Attention bias fields renamed: bqwq_b, bkwk_b, bvwv_b, bowo_b, bqkvwqkv_b; internal to llama.cpp, no impact on this project
~b8841–b8854 common/common.h common_params::clear_idle renamed to cache_idle_slots; new common_context_seq_rm_type enum + common_context_can_seq_rm() replacing common_speculative_is_compat(); get_model_endpoint()common_get_model_endpoint()
~b8841–b8854 tools/mtmd/mtmd.h + mtmd-helper.h mtmd_decoder_pos gains z field; mtmd_image_tokens_get_decoder_pos() + mtmd_helper_image_get_decoder_pos() gain new pos_0 parameter
~b8841–b8854 project utils.hpp / server.hpp server_tokens::get_text_tokens() split: get_tokens() returns raw const llama_tokens &; new get_text_tokens() returns filtered copy (removes LLAMA_TOKEN_NULL mtmd placeholders); save/load and context-shift call sites updated to get_tokens()
~b8854–b8887 common/chat.h common_chat_msg_diff_to_json_oaicompat removed; moved to tools/server/server-chat.cpp; project defines it locally in server.hpp — importing server-chat.cpp is impractical because it pulls in convert_transcriptions_to_chatcmplget_media_markerserver-common.cpp
~b8854–b8887 common/common.h common_params::reasoning_budget and reasoning_budget_message moved into common_params::sampling sub-struct as reasoning_budget_tokens; update: params_base.reasoning_budgetparams_base.sampling.reasoning_budget_tokens
~b8854–b8887 common/fit.h (new) llama_params_fit and llama_memory_breakdown_print removed from include/llama.h; now common_fit_params / common_memory_breakdown_print in new common/fit.h; not used directly by project
~b8887–b8913 tools/server/server-chat.h convert_transcriptions_to_chatcmpl gained a new const common_chat_templates * tmpls second parameter; not called by project's server.hpp — handled automatically by upstream server-chat.cpp
~b8887–b8913 tools/server/server-task.cpp n_discard clamped to non-negative: params.n_discard = std::max(0, params.n_discard); applied in project's server.hpp after the json_value parse
~b8887–b8913 tools/server/server-common.cpp parallel_tool_calls now defaults to caps["supports_parallel_tool_calls"] instead of hardcoded false; handled automatically by upstream file
~b8887–b8913 common/chat.h New additive common_chat_prompt_preset struct and common_chat_get_asr_prompt() function; no project changes required
~b8887–b8913 common/common.h New string_starts_with(std::string_view, char) overload added; no project changes required
~b8887–b8913 tools/mtmd/mtmd.cpp Added LLAMA_ROPE_TYPE_NONE case to rope-type switch; internal fix, no project changes required
~b8913–b8953 common/debug.h base_callback_data renamed to common_debug_cb_user_data; template common_debug_cb_eval<false/true> replaced by plain common_debug_cb_eval; not used by this project
~b8913–b8953 tools/server/server-http.h New uploaded_file struct; files map type changed from map<string, raw_buffer> to map<string, uploaded_file>; upstream server sources compiled directly — no project impact
~b8913–b8953 src/llama-quant.cpp Default quantization ftype changed from LLAMA_FTYPE_MOSTLY_Q5_1 to LLAMA_FTYPE_MOSTLY_Q8_0; upstream only
~b8913–b8953 src/models/llama.cpp, qwen3.cpp, qwen3moe.cpp Removed duplicate ggml_mul for wo_s scale (now handled exclusively by build_attn); upstream only
~b8953–b8962 common/common.h struct cpu_paramsstruct common_cpu_params; cpu_get_num_physical_cores()common_cpu_get_num_physical_cores(); cpu_get_num_math()common_cpu_get_num_math(); not used directly by project
~b8953–b8962 common/common.h common_params_speculative fully restructured with nested sub-structs: .mparams_dft/.model_dft/.cparams_dft/.n_max/.n_min/.p_split/.p_min.draft.mparams/.draft.model/.draft.cparams/.draft.n_max/.draft.n_min/.draft.p_split/.draft.p_min; ngram fields moved to .ngram_cache/.ngram_mod/.ngram_simple/etc sub-structs; not referenced by project directly
~b8953–b8962 common/arg.h is_sparam bool split into is_sampling + is_spec; set_sparam() split into set_sampling() + set_spec(); not used by project
~b8953–b8962 tools/server/server-task.cpp task_params::to_json() drops "speculative.n_max", "speculative.n_min", "speculative.p_min" from output; only "speculative.type" remains; test SlotParamsToJson.SpeculativeFields_Present updated accordingly
~b8953–b8962 common/speculative.h New public API: common_speculative_n_max() and common_speculative_n_min() added; server-context.cpp uses these instead of direct field access; no project changes required
~b8962–b8982 common/sampling.h common_sampler_accept 3rd param renamed accept_grammaris_generated; semantics broadened: false now also skips reasoning budget update (not just grammar); no project call sites affected
~b8962–b8982 common/reasoning-budget.h Two overloads merged: prefill_tokens variant removed; new single overload takes initial_state = REASONING_BUDGET_IDLE; prefill now fed via llama_sampler_accept() loop after init; not called directly by project
~b8962–b8982 ggml/src/ggml-cuda/ssm-conv.cuh ggml_cuda_op_ssm_conv gained optional bias_add_node param; SSM_CONV + ADD + SILU fusion now supported; internal CUDA code, no project changes required
~b8962–b8982 common/speculative.cpp Draft token confidence check (p_min) moved before push to result: low-confidence tokens are now discarded entirely rather than included then ignored; behavior fix, no project changes required
~b8962–b8982 tools/server/server-context.cpp n_draft_total accounting moved to draft generation site instead of acceptance site (bug fix); upstream only
~b8982–b8994 ggml/src/ggml-cuda.cu ggml_backend_cuda_i struct: .get_tensor_2d_async and .set_tensor_2d_async function pointers were swapped (get pointed to set impl and vice versa); corrected; internal CUDA backend, no project changes required
~b8982–b8994 ggml/src/ggml-vulkan.cpp ggml_vk_buffer_write_2d_async and ggml_vk_buffer_write_2d gained a dpitch parameter; Vulkan now implements set_tensor_2d/get_tensor_2d in buffer interface; internal backend code, no project changes required
~b8982–b8994 common/speculative.cpp Checkpoint helpers renamed: draft_create_checkpointcreate_checkpoint, draft_restore_checkpointrestore_checkpoint; ckpt_size field removed (size computed from context directly); internal speculative module, not called by project
~b8982–b8994 common/arg.cpp CLI option typo fixed: --spec--draft-p-split--spec-draft-p-split (extra dash removed); CLI-only, no project changes required
~b8982–b8994 src/llama-mmap.cpp Windows large-file (>2 GB) fix: ftell/fseek replaced with _ftelli64/_fseeki64; upstream only
~b8982–b8994 tools/server/httplib.h cpp-httplib bumped to v0.43.2: Windows FILE_SHARE_WRITE fix, Linux DNS cancel race fix, mbedTLS close_notify fix; upstream server header, no project changes required
~b8982–b8994 tools/server/server-context.cpp New LLAMA_TRACE env variable enables slot acceptance tracing; upstream only
~b8994–b9004 ggml/src/ggml-vulkan/ggml-vulkan.cpp vk_fa_pipeline_state gains k_type/v_type fields; get_fa_tuning_params_coopmat2 now takes separate k_type/v_type params; mixed K/V type FA pipeline creation refactored to CREATE_FA_CM2_MIXED() macro; flash_attn_cm2.comp shader uses runtime FaTypeK/FaTypeV spec constants (spec constants 12–15 added); DECODEFUNC/NEEDS_INIT_IQ_SHMEM macros removed; internal Vulkan backend, no project changes required
~b8994–b9004 ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp get_mul_mat_fast_pipeline vectorized-path condition fixed: dst->ne[1] % 4 == 0 check removed (was preventing vectorization for non-multiple-of-4 batch sizes); internal WebGPU backend, no project changes required
~b8994–b9004 ggml/src/ggml-hexagon/ Hexagon HTP backend: FA exp2 half-precision option, unary-op non-contiguous tensor fix; internal DSP backend, no project changes required
~b8994–b9004 tools/server/webui/ Major frontend component reorganization (Svelte/TypeScript); purely UI, no C++ or JNI impact
~b9004–b9016 src/llama-io.h llama_io_read_i interface changed: read(size_t)→read(void*,size_t), read_to(void*,size_t) removed, new read_tensor(tensor,offset,size) added; llama_io_write_buffer/llama_io_read_buffer now batch backend tensor ops in destructors for performance; internal state-save/load path, not called by project
~b9004–b9016 tools/server/server-context.cpp Static server_get_checkpoint() (returns by value) renamed to server_prompt_checkpoint_update() (takes server_prompt_checkpoint & by reference, in-place update); compiled directly into jllama, no call site in project code
~b9004–b9016 common/arg.cpp + docs Speculative decoding CLI args renamed: --draft/--draft-n/--draft-max and --draft-min/--draft-n-min were REMOVED (handler throws std::invalid_argument at parse time, not just deprecated); other draft flags (--draft-p-min, --ctx-size-draft, --device-draft, --gpu-layers-draft, --model-draft) kept as aliases for new canonical --spec-draft-* names. Java impact: ModelParameters.setDraftMax/setDraftMin produced removed flags → threw at model load; fixed to canonical --spec-draft-n-max/--spec-draft-n-min. Other set*Draft methods updated to canonical names for forward compatibility. Env vars also renamed (LLAMA_ARG_DRAFT_MAXLLAMA_ARG_SPEC_DRAFT_N_MAX, etc.)
~b9004–b9016 ggml/src/ggml-cuda/ggml-cuda.cu PCI bus ID detection replaced snprintf with cudaDeviceGetPCIBusId (buffer 16→32 bytes); HIP/MUSA compat headers gain cudaDeviceGetPCIBusId alias; internal CUDA backend
~b9004–b9016 ggml/src/ggml-opencl/ Adreno MoE MXFP4: new kernel_convert_block_mxfp4_trans4_ns/restore kernels in cvt.cl; new gemm_moe_mxfp4_f32_ns, gemv_moe_mxfp4_f32_ns, moe_reorder_b, moe_sort_by_expert kernel files; GPU-side router reorder replaces CPU-side preprocessing; q_img created for GEMM path; internal OpenCL backend
~b9004–b9016 ggml/src/ggml-vulkan/ggml-vulkan.cpp GGML_VK_MAX_NODES 8192 macro removed (node limit now determined differently); internal Vulkan backend
~b9004–b9016 ggml/src/ggml-webgpu/ ggml_webgpu_row_norm_pipeline_key gains src_type/dst_type fields; GGML_OP_NORM now supported alongside GGML_OP_RMS_NORM/GGML_OP_L2_NORM; row_norm.wgsl gains SRC_TYPE/DST_TYPE parameterization and NORM two-pass algorithm; internal WebGPU backend
~b9004–b9016 src/llama-model.cpp rope_yarn_log_mul get_key call changed from required=0.0f to required=false; fixes Mistral YaRN log_mul loading; internal model loading, no project impact
~b9004–b9016 common/chat.cpp common_chat_templates_generation_prompt() extracted from common_chat_templates_apply_jinja(); internal refactor, no API change
~b9016–b9022 src/llama-model.h + src/llama-model.cpp + src/models/ llama_model becomes abstract base with pure virtual methods (load_stats, load_hparams, load_vocab, load_tensors, load_arch_hparams, load_arch_tensors, build_arch_graph); load_arch() removed; new intermediate llama_model_base class provides concrete implementations; per-arch subclasses (e.g. llama_model_llama, llama_model_gemma2) in src/models/; factory llama_model_create(llm_arch, params) and llama_model_create(ml, params) replace direct instantiation; LLAMA_LOAD_LOCALS convenience macro added; public C API (llama_model_load_from_file etc.) unchanged — no project impact
~b9016–b9022 src/models/ Many model files renamed: cohere2-iswa.cppcohere2.cpp, gemma2-iswa.cppgemma2.cpp, gemma3n-iswa.cppgemma3n.cpp, gemma4-iswa.cppgemma4.cpp, mimo2-iswa.cppmimo2.cpp, openai-moe-iswa.cppopenai-moe.cpp, pangu-embedded.cpppangu-embed.cpp, qwen3vl-moe.cppqwen3vlmoe.cpp, step35-iswa.cppstep35.cpp; new model files added (deepseek2ocr.cpp, glm-dsa.cpp, granite-moe.cpp, hunyuan-vl.cpp, jina-bert-v2/v3.cpp, lfm2moe.cpp, llama-embed.cpp, mamba2.cpp, minicpm.cpp, mistral4.cpp, nemotron-h-moe.cpp, nomic-bert.cpp, nomic-bert-moe.cpp, phimoe.cpp); upstream only, no project changes required
~b9016–b9022 tools/server/server-context.cpp server_prompt_checkpoint_update (the renamed function from b9016) static function signature changed from returning by value to taking server_prompt_checkpoint & by reference; compiled directly into jllama, no project call site
~b9016–b9022 tools/server/server-tools.cpp New built-in get_datetime tool added via new server_tool_get_datetime struct in build_tools(); no project changes required (handled automatically by compiled upstream source)
~b9016–b9022 common/chat-auto-parser-generator.cpp force_tools variable removed from build_tool_parser_json_native, build_tool_parser_tag_json, build_tool_parser_tag_tagged; content before tool calls is now always p.optional(p.content(...)) regardless of tool_choice=required; upstream only, no project changes required
~b9016–b9022 common/chat-peg-parser.h/cpp New optspace(const std::string & tag) method added to common_chat_peg_builder; makes leading/trailing spaces in reasoning tags optional; upstream only, no project changes required
~b9016–b9022 common/reasoning-budget.cpp Forced token logit now set to +INFINITY (previously left at whatever the model computed); reasoning budget enforcement is now absolute; upstream only, no project changes required
~b9016–b9022 common/chat.cpp thinking_start_tag and thinking_end_tag now trimmed via trim_whitespace(); upstream only, no project changes required
~b9016–b9022 examples/diffusion/ diffusion_generate extracted from diffusion-cli.cpp to new diffusion.h/diffusion.cpp static library; enum names prefixed: ORIGINDIFFUSION_ALGORITHM_ORIGIN, TIMESTEP_BASEDDIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED etc.; examples only, no project changes required
~b9022–b9049 include/llama.h New LLAMA_STATE_SEQ_FLAGS_ON_DEVICE 2 macro added alongside existing LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY 1; enables on-device KV cache state save/restore without host round-trip via llama_state_seq_get_size_ext/get_data_ext/set_data_ext; no project call-site changes required (not used by JNI layer)
~b9022–b9049 src/llama-context.cpp State seq data format breaking change: llama_state_seq_get_data/set_data now prepend a 4-byte magic (0xaf143cd8) + 4-byte seq_id header; state data saved with ≤b9022 is incompatible with b9049+; internal I/O classes renamed llama_io_write_bufferllama_io_write_host, llama_io_read_bufferllama_io_read_host; new llama_io_write_device/llama_io_read_device classes for on-device paths; no project changes required (not called by JNI layer)
~b9022–b9049 ggml/include/ggml.h New ggml_op_hint enum (GGML_HINT_DEFAULT=0, GGML_HINT_SRC0_IS_HADAMARD=1) and ggml_mul_mat_set_hint() function added for FWHT (Fast Walsh-Hadamard Transform) support; used internally in llama-graph.cpp / llama-kv-cache.cpp; no project call-site changes required
~b9022–b9049 src/llama.cpp llama_backend_init() now auto-calls ggml_backend_load_all() if no backends are yet registered; ggml_backend_load_all() removed from common_params_parser_init() (was in common/arg.cpp); no project changes required — backend loading still happens correctly
~b9022–b9049 tools/server/server-context.cpp server_prompt_checkpoint_update() gained an on_device bool parameter; speculative checkpoints now use LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE; compiled directly into jllama from upstream source — no project call-site changes required
~b9022–b9049 src/llama-model.cpp Unsupported model architecture now throws std::runtime_error instead of calling GGML_ABORT; allows callers to catch unknown-arch errors gracefully; no project changes required
~b9022–b9049 ggml/CMakeLists.txt GGML version bumped 0.10.2 → 0.11.0; no project changes required
~b9022–b9049 vendor/cpp-httplib/ Updated to 0.43.3: str2tag converted to iterative loop (eliminates recursion stack depth risk), res.body.reserve now OOM-safe; upstream server header, no project changes required
~b9049–b9071 common/chat.h contains_media() method added to common_chat_msg; to_json_oaicompat() now forces text concatenation when message contains media markers; additive change, no project impact
~b9049–b9071 src/llama-arch.h/cpp + src/llama-hparams.h New LLM_KV_ATTENTION_VALUE_SCALE KV key and f_attn_value_scale hparam field added for MiMo-V2 attention value scaling; additive, no project changes required
~b9049–b9071 src/llama.cpp llama_supports_gpu_offload() and llama_supports_rpc() now auto-call ggml_backend_load_all() if no backends are registered; behavior fix, no project changes required
~b9049–b9071 src/llama-context.cpp state_seq_set_data: removed too-strict seq_id matching guard that was gated on LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY; KV slot restorer now checks tensor shapes and view offsets before deciding to reallocate (avoids unnecessary realloc on shape-compatible updates); both are bug fixes, no project API changes required
~b9049–b9071 src/models/mimo2.cpp MiMo-V2 extended with MTP (Multi-Token Prediction) layer support via nextn_predict_layers; fused wqkv projection; attention_value_scale post-attention scaling; all internal model-loading changes, no project changes required
~b9049–b9071 ggml/src/ggml-sycl/ SYCL implementations added for CUMSUM, DIAG, FILL, SSM_SCAN, SOLVE_TRI ops; additive, no project changes required
~b9049–b9071 ggml/src/ggml-cuda/out-prod.cu CUDA outer-product uses cublasSgemmStridedBatched for batched path (dps2==1, ne2>1); HIP/MUSA compat headers gain the alias; performance improvement, no project changes required
~b9049–b9071 tools/mtmd/ MiniCPM-V 4.6 multimodal support added (PROJECTOR_TYPE_MINICPMV4_6, ViT merger graph, new tensor names); additive, no project changes required
~b9049–b9071 tools/server/webui/ LLM-based conversation title generation; CSS animation fill-mode-forwards fixes; UI-only changes compiled into upstream server, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh (NEW) 2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via GGML_CUDA_ALLREDUCE env var (nccl/internal/none); compiled automatically via FetchContent, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/snake.cu + snake.cuh (NEW) Fused CUDA Snake activation kernel (y = x + sin(a*x)^2 * inv_b) for BigVGAN/Vocos audio models; fuses 5-op chain MUL→SIN→SQR→MUL→ADD at graph level; F32/F16/BF16; compiled automatically, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/ggml-cuda.cu Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to ggml_backend_cuda_comm_context with try_allreduce function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required
~b9071–b9094 ggml/src/ggml-sycl/ Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required
~b9071–b9094 ggml/src/ggml-hexagon/ GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required
~b9071–b9094 src/models/sarvam.cpp (NEW) Sarvam-MoE model (sarvamai/sarvam-30b); reuses BailingMoeV2 arch; new vocab pre-type LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51; additive, no project changes required
~b9071–b9094 src/models/gemma4.cpp Gemma4 split gate/up experts: ffn_gate_up_exps now TENSOR_NOT_REQUIRED; fallback to separate ffn_gate_exps/ffn_up_exps; NVFP4 per_expert_scale folding; internal model-loading, no project changes required
~b9071–b9094 tools/server/server-context.h + server-context.cpp New get_model_info() method on server_context; /v1/models response now includes "n_ctx" field (value: slot_n_ctx); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently)
~b9071–b9094 tools/server/server-http.h + server.cpp handlers map moved from private to public in server_http_context; new register_gcp_compat() method exposes GCP/Vertex AI Prediction Protocol endpoint reading AIP_MODE/AIP_PREDICT_ROUTE/AIP_HEALTH_ROUTE/AIP_HTTP_PORT env vars; compiled from upstream sources, no project changes required
~b9071–b9094 tools/server/server-models.h + server.cpp Router child→parent model info propagation: new CMD_CHILD_TO_ROUTER_INFO command; setup_child_server() gains const json & model_info parameter; new update_loaded_info() method; server_model_meta gains loaded_info field; all internally consistent across compiled upstream sources, no project changes required
~b9071–b9094 common/reasoning-budget.cpp Forced token logit no longer set to +INFINITY; only competing tokens set to -INFINITY; internal sampler behavior change, no project changes required
~b9071–b9094 tools/server/webui/ Settings registry refactored (settings-config.ts/settings-fields.ts/settings-sections.ts merged into settings-registry.ts); MCP route #/settings/mcp#/mcp-servers; settings route /settings/chat/[section]/settings/[[section]]; UI-only, no project changes required
~b9094–b9102 ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh Internal CUDA AllReduce pipeline refactored with ggml_cuda_ar_pipeline struct; ggml_cuda_ar_pipeline_init(devices, n_devices) / _free / _allreduce APIs; supports 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+); chunked kernel path (small tensors) vs copy-engine path (large tensors); GGML_CUDA_ALLREDUCE env = nccl/internal/none; env tuning vars GGML_CUDA_AR_COPY_THRESHOLD / GGML_CUDA_AR_COPY_CHUNK_BYTES / GGML_CUDA_AR_BF16_THRESHOLD; HIP/MUSA builds return nullptr stub; compiled automatically via FetchContent, no project changes required
~b9094–b9102 ggml/src/ggml-cuda/ggml-cuda.cu GGML_LOG_WARN_ONCE macro added; ggml_backend_cuda_comm_context gains try_allreduce fn pointer and ar_pipeline; three dispatch fns: try_allreduce_nccl, try_allreduce_internal, try_allreduce_butterfly; init chain: comm_init_ncclcomm_init_internalcomm_init_none; platform default Linux→NCCL, Windows→internal; no project changes required
~b9094–b9102 ggml/src/ggml-sycl/ggml-sycl.cpp + im2col.cpp + im2col.hpp New ggml_sycl_im2col_3d function; GGML_OP_IM2COL_3D now supported on Intel GPU via SYCL; 2D im2col kernel rewritten with tile-based IC_KH_KW thread decomposition; new SYCL_IM2COL_BLOCK_SIZE 256; additive, no project changes required
~b9094–b9102 ggml/CMakeLists.txt GGML version patch bumped 0.11.0 → 0.11.1; no project changes required
~b9094–b9102 common/sampling.cpp Bug fix in common_sampler_sample: set_logits now called at the top before backend-sampling check; backend sampling token-selection now scans all of cur_p.data to find matching token (instead of artificial 1-element array), fixing cur_p.selected for downstream n_probs; post-sampling probabilities now work correctly with backend sampling
~b9094–b9102 tools/server/server-context.cpp need_logits renamed to need_pre_sample_logits; only set when n_probs > 0 && !post_sampling_probs; backend sampling now works with post_sampling_probs; 0.0-probability tokens filtered from result.probs; compiled from upstream, no project JNI changes required
~b9094–b9102 src/llama-model.cpp n_vocab loading moved from llama_model_base::load_hparams() to per-model load_arch_hparams() (e.g. src/models/deepseek2.cpp, src/models/llama.cpp); internal model-loading refactor, no project changes required
~b9094–b9102 src/llama-model.cpp ggml/src/ggml-virtgpu/ggml-backend-device.cpp gains #include <mutex> for std::once_flag; internal backend fix, no project changes required
~b9094–b9102 vendor/cpp-httplib/httplib.cpp + httplib.h Security fix: chunk-size parsing replaced strtoul with manual hex-digit scanning to prevent overflow and reject invalid chunk extensions; version bumped to 0.43.4; compiled automatically, no project changes required
~b9102–b9103 vendor/cpp-httplib/httplib.cpp + httplib.h cpp-httplib bumped to v0.44.0: (1) RFC 9110 §5.5 compliance — header field values are no longer percent-decoded by the recipient in parse_header; Location/Referer special-casing removed; callers that need URI-component decoding must call decode_uri_component() explicitly; (2) ThreadPool constructor is now exception-safe — if thread creation fails partway through, already-started workers are signalled to exit and joined before rethrowing, preventing std::terminate from joinable threads in the destructor; compiled automatically, no project changes required
~b9103–b9106 ggml/src/ggml-vulkan/ggml-vulkan.cpp + Vulkan shaders Vulkan flash attention refactored: pipeline_flash_attn_f32_f16 changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (flash_attn_f32_f16 and flash_attn_f32_f16_int8) that select K/V type at runtime via FaTypeK/FaTypeV spec constants; new flash_attn_dequant.glsl contains aliased SSBO views and an uber dequantize4() switch; the K/V type mismatch guard removed from ggml_backend_vk_device_supports_op; internal Vulkan backend refactor, no project changes required
~b9103–b9106 ggml/src/ggml-cuda/argsort.cu Added #include <cuda/iterator> for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required
~b9103–b9106 convert_hf_to_gguf.py Mistral Medium 3.5 mmproj support: n_embd_text now reads "dim" key instead of "hidden_dim"; negative img_break_tok_id placeholders resolved from tekken.json or tokenizer.json; conversion tool only, no project changes required
~b9106–b9134 common/arg.cpp CLI option --spec-draft-ctx-size / -cd / --ctx-size-draft REMOVED — throws std::invalid_argument at parse time; ModelParameters.setCtxSizeDraft() removed; no replacement (context size now managed internally by speculative engine)
~b9106–b9134 common/arg.cpp CLI option --spec-draft-replace / --spec-replace REMOVED — throws std::invalid_argument at parse time; no corresponding Java method existed
~b9106–b9134 common/speculative.h Full redesign: common_speculative_type enum values renamed DRAFTDRAFT_SIMPLE, EAGLE3DRAFT_EAGLE3; common_params_speculative.type (single enum) → .types (vector); common_speculative_n_max() / common_speculative_n_min() REMOVED; new common_speculative_init(params, n_seq) no longer takes ctx; new common_speculative_begin(spec, seq_id, prompt), common_speculative_draft(spec), common_speculative_accept(spec, seq_id, n), common_speculative_process(spec, batch) signatures; common_speculative_draft_params struct added; server sources compiled directly, no project JNI changes required
~b9106–b9134 common/common.h New common_prompt_checkpoint struct (contains data_tgt + data_dft) replaces the old server_prompt_checkpoint in server-task.h; compiled from upstream server sources, no project JNI changes required
~b9106–b9134 tools/server/server-task.cpp task_params::to_json() renamed field "speculative.type""speculative.types" (now serialises the vector); test SlotParamsToJson.SpeculativeFields_Present updated accordingly
~b9106–b9134 include/llama.h New LLAMA_STATE_SEQ_FLAGS_NONE = 0 macro added; additive, no project changes required
~b9134–b9145 tools/server/server-common.cpp New continue_final_message boolean request field in oaicompat_chat_params_parse; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when true, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with add_generation_prompt=true (throws 400); compiled from upstream server sources; InferenceParameters.setContinueFinalMessage(boolean) added
~b9134–b9145 ggml/src/ggml-sycl/ Level Zero API integration for SYCL device memory allocation (GGML_SYCL_SUPPORT_LEVEL_ZERO build option, GGML_SYCL_ENABLE_LEVEL_ZERO runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required
~b9134–b9145 ggml/src/ggml-opencl/ Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required
~b9134–b9145 ggml/src/ggml-cuda/allreduce.cu AllReduce accumulation now routed through float intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required
~b9134–b9145 ggml/src/ggml-hexagon/ GGML_UNARY_OP_TANH added to Hexagon HTP backend; internal DSP backend, no project changes required
~b9134–b9145 ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp use_subgroup_matrix condition now also checks sg_mat_k > 0 && sg_mat_n > 0 and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required
~b9145–b9150 ggml/src/ggml-vulkan/ggml-vulkan.cpp Bug fix: mul_mat_l_int[i] / mul_mat_m_int[i] / mul_mat_s_int[i] / mul_mat_id_l_int[i] / mul_mat_id_m_int[i] / mul_mat_id_s_int[i] were unconditionally set to true instead of mirroring the actual device pipeline capabilities from mul_mat_l[i] etc.; now properly initialized; internal Vulkan backend bug fix, no project changes required
~b9145–b9150 src/unicode.cpp New unicode_regex_split_custom_qwen35() function registered for the Qwen 3.5 tokenizer regex pattern; uses [\p{L}\p{M}]+ letter-plus-combining-mark runs vs. Qwen2's \p{L}+; additive internal tokenizer change, no project changes required
~b9145–b9150 ggml/src/ggml-cpu/ggml-cpu-riscv64-spacemit/ SpaceMIT RISC-V IME backend major refactor: IME2 kernels, expanded quantization (Q2_K, Q3_K, Q6_K, Q8_0, Q5_0, Q5_1, Q5_K, MXFP4), TCM (Tightly Coupled Memory) pool; new source files ime2_kernels.cpp, ime_env.cpp, repack.cpp, rvv_kernels.cpp, spine_mem_pool.cpp; guarded by GGML_CPU_RISCV64_SPACEMIT build flag; no project changes required
~b9150–b9151 common/log.h New LOG_TRC macro added at LOG_LEVEL_TRACE = 4 (between INFO=3 and DEBUG=5); LOG_LEVEL_DEBUG bumped from 4 to 5; new LOG_TRCV verbosity variant; additive, no project changes required
~b9150–b9151 common/common.h + common/common.cpp New common_params_print_info(const common_params &) function: prints verbosity level, per-device memory (name, total, free), and system info at LOG_INF level; replaces the two-line pattern LOG_INF("build_info: %s\n", llama_build_info()); LOG_INF("%s\n", common_params_get_system_info(params).c_str()); — updated in jllama.cpp
~b9150–b9151 common/common.cpp common_init() now unconditionally calls common_log_set_prefix(…, true) and common_log_set_timestamps(…, true) before setting the log callback; log output will always include prefix and timestamps unless explicitly disabled with --no-log-prefix / --no-log-timestamps
~b9150–b9151 common/arg.cpp --log-prefix and --log-timestamps now also accept negated forms --no-log-prefix / --no-log-timestamps (lambda receives a bool value); backing env vars renamed LLAMA_LOG_PREFIXLLAMA_ARG_LOG_PREFIX and LLAMA_LOG_TIMESTAMPSLLAMA_ARG_LOG_TIMESTAMPS; Java layer does not expose these, so no project changes required
~b9150–b9151 tools/server/server-common.h New SLT_TRC and SRV_TRC macros (emit at LOG_TRC level); additive, no project changes required
~b9150–b9151 tools/server/server-context.cpp New server_slot::t_print_last field + print_timings_tg() / print_timings_pp() methods: emit periodic in-flight token-generation and prompt-processing throughput to SLT_INF (throttled to ≥100 decoded tokens and ≥3 s interval); server_context_impl constructor now calls mtmd_helper_log_set unconditionally (was guarded by !is_resume); many SLT_INF/SRV_WRN downgraded to SLT_TRC/SRV_INF; compiled from upstream, no project JNI changes required
~b9150–b9151 tools/server/server-task.cpp Several SRV_WRN calls downgraded to SRV_INF; one SRV_WRN upgraded to SRV_ERR for failed state restore; compiled from upstream, no project changes required
~b9151–b9172 tools/mtmd/clip.h clip_has_whisper_encoder() removed from public API; not referenced by project — no changes required
~b9151–b9172 tools/server/CMakeLists.txt + scripts/webui-download.cmake (new) WebUI assets no longer committed (tools/server/public/ gitignored); provisioned at build time via HF bucket (LLAMA_USE_PREBUILT_WEBUI=ON default) or built from source (LLAMA_BUILD_WEBUI); project sets LLAMA_BUILD_WEBUI=OFF CACHE BOOL "" FORCE before FetchContent to skip asset download
~b9151–b9172 common/common.h common_params::webui default made conditional on LLAMA_WEBUI_DEFAULT_ENABLED macro (falls back to true when undefined); compiled server sources unaffected
~b9151–b9172 common/reasoning-budget.cpp common_reasoning_budget_clone rewritten to use llama_sampler_init properly; pure bug fix, no API change, no project changes required
~b9151–b9172 ggml/src/ggml-cuda/fattn-mma-f16.cuh + mma.cuh AMD RDNA3 WMMA flash attention support; new DATA_LAYOUT_I_MAJOR_SCRAMBLED, tile<16,16,half2,I_MAJOR_SCRAMBLED>, extended config tables; internal CUDA backend, no project changes required
~b9151–b9172 tools/server/server-chat.cpp Non-function Responses API tools now silently skipped (continue) instead of throwing; server behavior fix, no Java API change required
~b9172–b9198 project CMakeLists.txt Option LLAMA_BUILD_WEBUI renamed to LLAMA_BUILD_UI (and LLAMA_USE_PREBUILT_WEBUILLAMA_USE_PREBUILT_UI); upstream keeps a backward-compat shim that forwards the old cache variable with a DEPRECATION message, so this project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged
~b9172–b9198 common/common.h common_params::webui / webui_mcp_proxy / webui_config_json deprecated in favour of ui / ui_mcp_proxy / ui_config_json; both pairs of fields are kept and synced by common/arg.cpp, compiled upstream sources unaffected; new common_params::ctx_type and cparams.n_rs_seq fields added (default LLAMA_CONTEXT_TYPE_DEFAULT / 0), additive
~b9172–b9198 common/common.cpp + common.h common_params_print_info gained optional print_devices parameter (default true); upstream tools/server/server.cpp passes !is_router_server to skip GPU enumeration on the router process; this project does not compile server.cpp, no impact
~b9172–b9198 common/speculative.h + speculative.cpp New enum value COMMON_SPECULATIVE_TYPE_DRAFT_MTP (count is now 9); new common_speculative_need_embd() API; MTP draft implementation added (common_speculative_state_draft_mtp); --spec-type draft-mtp CLI flag added in common/arg.cpp; additive, no project changes (could be exposed later as a ModelParameters enhancement)
~b9172–b9198 include/llama.h New enum llama_context_type { LLAMA_CONTEXT_TYPE_DEFAULT, LLAMA_CONTEXT_TYPE_MTP }; new llama_context_params::n_rs_seq (recurrent-state snapshots per seq for rollback) and ctx_type fields; new llama_n_rs_seq() accessor; all additive, default-zero, no project impact
~b9172–b9198 src/llama-ext.h (new) + src/llama-context.cpp New pre-norm embedding extraction path: llama_set_embeddings_pre_norm / llama_get_embeddings_pre_norm[_ith] APIs and an embd_pre_norm output buffer in llama_context; used by the MTP draft loop only, additive
~b9172–b9198 src/llama-memory-recurrent.cpp Recurrent-state rollback support: per-seq rs_idx snapshot index and set_rs_idx() helper; tensors widened to (1 + n_rs_seq) groups; seq_rm now rolls back via snapshot when within n_rs_seq bounds. Backwards-compatible when n_rs_seq == 0 (this project's default), no project changes
~b9172–b9198 tools/server/server-context.cpp Embedding endpoint default now reads params.embd_normalize (was hard-coded 2); compiled upstream, no project changes
~b9172–b9198 tools/server/CMakeLists.txt + new tools/ui/CMakeLists.txt WebUI asset wiring moved into a new llama-ui static library; tools/server now links llama-ui; project does not build the llama-server binary (only compiles server-context.cpp / server-queue.cpp / server-task.cpp / server-models.cpp directly into jllama), so no impact. HF bucket name renamed LLAMA_WEBUI_HF_BUCKETLLAMA_UI_HF_BUCKET (old name still honoured)
~b9172–b9198 vendor/cpp-httplib/httplib.{h,cpp} Bumped to v0.45.0: RFC 9112 §6 message-body framing — requests without Content-Length / Transfer-Encoding no longer scan for stray body bytes on persistent connections (fixes #2450 keep-alive misframing); X-Forwarded-For parser falls back to the connection remote address when the header is empty/malformed; compiled automatically, no project changes
~b9172–b9198 ggml/CMakeLists.txt GGML version bumped 0.11.1 → 0.12.0; no project changes
~b9172–b9198 ggml/src/ggml.c + ggml-cuda/gated_delta_net.cu + ggml-metal/ggml-metal.metal + ggml-vulkan/vulkan-shaders/gated_delta_net.comp ggml_gated_delta_net state tensor reshaped from 2D (S_v*S_v*H, n_seqs) to 3D (S_v*S_v*H, K, n_seqs) where K is the snapshot slot count (K=1 is final-state-only, K>1 keeps last min(n_tokens, K) per-token snapshots); internal Qwen3.5 / Qwen3-Next recurrent-attention kernel, no project changes
~b9198–b9219 common/chat.{h,cpp} New common_chat_continuation enum (NONE/AUTO/REASONING/CONTENT); new common_chat_msg::render_content(delimiter) method; new continue_final_message field on common_chat_templates_inputs; new common_chat_continuation_parse() accepts both bool and "reasoning_content"/"content" strings; common_chat_template_generation_prompt() extracted; oaicompat_chat_params_parse refactored to route the prefill-assistant heuristic through the new continuation enum. Existing bool wire-format unchanged; the new string variants are exposed via InferenceParameters.setContinueFinalMessage(ContinuationMode)
~b9198–b9219 common/hf-cache.{h,cpp} + common/arg.cpp hf_cache::migrate_old_cache_to_hf_cache() and hf_file::size field removed; the migration call in common_params_parse_ex was dropped. Internal to arg.cpp, no project impact
~b9198–b9219 common/speculative.{h,cpp} + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h llama_set_embeddings_pre_norm(ctx, value)llama_set_embeddings_pre_norm(ctx, value, masked) (3rd bool arg distinguishes "embeddings for outputs only" from "embeddings for every token"); new cparams.embeddings_pre_norm_masked; new common_speculative_need_embd_pre_norm() API; MTP draft path now uses pre-norm extraction. Project does not call any of these APIs (speculative decoding is configured via ModelParameters only), no source changes required
~b9198–b9219 tools/server/server-task.{h,cpp} task_result_state ctor moved from header into .cpp — now seeds chat_msg via common_chat_parse("", true, …) when !echo so the assistant prefill is not echoed back as a delta; new bool echo field on chat_parser_params (default false, populated from request body via json_value(data, "echo", false)). Project compiles server-task.cpp from upstream and does not instantiate task_result_state directly, no source changes required
~b9198–b9219 tools/server/server-context.cpp + server-models.cpp New cors_proxy_enabled boolean field added to /props and /v1/models JSON responses (set from params.ui_mcp_proxy || params.webui_mcp_proxy). Additive, no Java consumer in this project
~b9198–b9219 upstream CMakeLists.txt Backward-compat shim widened: if(DEFINED LLAMA_BUILD_WEBUI AND NOT DEFINED LLAMA_BUILD_UI)if(DEFINED LLAMA_BUILD_WEBUI) — setting the old name now always forwards to the new one (and emits the existing DEPRECATION message). Project sets only LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE (CMakeLists.txt:107), behaviour unchanged
~b9198–b9219 ggml/src/ggml-cuda/ssm-conv.cu + top-k.cu Added kernel size 15 to SSM-conv launcher (now supports 3/4/5/9/15); top-k.cu includes <cuda/iterator> for CCCL ≥ 3.1; internal CUDA backend, no project changes
~b9198–b9219 ggml/src/ggml-sycl/ggml-sycl.cpp + vecdotq.hpp SYCL GEMM now falls back to direct MKL for small problems (gemm_flops < 256³); Q6_K dot product refactored to a single scalar fast-path helper vec_dot_q6_K_q8_1_impl_mmvq_scalar; internal SYCL backend, no project changes
~b9219–b9222 ggml/src/ggml-hexagon/ + htp/pad-ops.c (new) + htp/unary-ops.c Hexagon HTP backend gains GGML_OP_PAD (HVX + optional VTCM/DMA double-buffered, both zero-pad and circular-pad variants) and GGML_OP_TRI (HVX-vectorised triangular masking) support; new HTP_OP_PAD / HTP_OP_TRI opcodes; internal Qualcomm DSP backend, no project changes
~b9219–b9222 .devops/*.Dockerfile + .github/workflows/docker.yml OCI image labels (org.opencontainers.image.*) added via BUILD_DATE/APP_VERSION/APP_REVISION build args; new skip_s390x workflow_dispatch input; manifest annotations on docker buildx imagetools create; upstream packaging/CI only, no project changes
~b9222–b9245 common/common.h + common.cpp common_init_result(common_params &, bool model_only = false) and common_init_from_params(common_params &, bool model_only = false) gain an optional model_only flag that skips context/sampler/lora/warmup setup and returns only the loaded model. Additive with default value; no project call sites in src/main/cpp/, no source changes required
~b9222–b9245 common/common.h common_params_speculative_draft defaults retuned: n_max 16→3, p_min 0.75f→0.0f. Defaults only; Java ModelParameters sets these explicitly via JSON, so behaviour is unchanged for this project
~b9222–b9245 common/speculative.{h,cpp} common_speculative_impl::accept() virtual gains a 3rd bool is_other parameter; common_speculative_accept() now broadcasts the accepted-token count to every registered impl (with is_other=true for impls that did not generate the draft). common_speculative_impl_ngram_map_k ctor signature simplified (no longer takes common_params_speculative). Lots of new LOG_INF startup banners per impl. Internal to upstream-compiled server-context.cpp; no project call sites
~b9222–b9245 common/arg.cpp + common/common.cpp + tools/fit-params/fit-params.cpp --verbosity levels relabeled: level 4 now means "trace (more info)" and level 5 means "debug"; LOG_LEVEL_DEBUG constant value moved from 4 to 5. Direct params.verbosity >= 4 comparisons in upstream common.cpp and fit-params.cpp replaced with >= LOG_LEVEL_DEBUG. Project does not reference LOG_LEVEL_DEBUG or numeric verbosity thresholds in src/main/cpp/; no source changes required
~b9222–b9245 common/arg.cpp --spec-type duplicate-arg DEPRECATED warning suppressed (the flag legitimately accepts repeated values to form the comma-list). Behaviour-only
~b9222–b9245 common/ngram-map.cpp One per-draft LOG_INF downgraded to LOG_DBG. Log-level only
~b9222–b9245 src/llama-graph.h llm_graph_params::operator== adds a third disjunct so ubatches with both token and embd arrays present compare equal (graph reuse fix for MTP pre-norm path). Internal
~b9222–b9245 src/llama-memory-recurrent.{h,cpp} + src/llama-memory-hybrid.cpp + src/llama-memory-hybrid-iswa.cpp init_batch() now forces sequential split (split_seq) instead of equal split when n_rs_seq > 0 (recurrent-state rollback is incompatible with equal splits). Internal upstream model code, no project impact
~b9222–b9245 src/models/delta-net-base.cpp + src/models/models.h + src/models/qwen35.cpp llm_build_delta_net_base::keep_rs() helper removed; conv-state and recurrent-attn paths reworked to read cparams.n_rs_seq directly and loop K = n_rs_seq + 1 snapshot slots. Comment fix in qwen35.cpp MTP layer index. All internal upstream model code
~b9222–b9245 tools/server/server-context.cpp pos_min_thold lowered by one (pos_next - n_swapos_next - n_swa - 1); checkpoint trigger guard relaxed from n_past < slot.prompt.n_tokens() to <=; per-slot print_timings_pp/print_timings_tg lines split into separate SLT_INF calls; new graphs reused and draft acceptance lines; n_draft_total log moved from SLT_CNT to SLT_INF. Compiled upstream-as-is, no project changes
~b9222–b9245 ggml/src/ggml-cuda/mmvq.cu calc_nwarps table tweak: Q6_K returns 2 warps (was grouped with the 8-warp tier). Internal CUDA backend
~b9222–b9245 ggml/src/ggml-hexagon/ (htp/rope-ops.c, htp/unary-ops.c, htp-ops.h, main.c, ggml-hexagon.cpp) New HTP_OP_NORM opcode (mean+variance norm); rope-ops.c adds MROPE / IMROPE position-id support via new mrope_cache_init(). Internal Qualcomm DSP backend
~b9222–b9245 ggml/src/ggml-opencl/ (ggml-opencl.cpp, kernels/cvt.cl, six new gemm_moe_q{4,5,6}_k_f32_ns + gemv_moe_q{4,5,6}_k_f32_ns kernels) Adreno MoE pipeline extended to Q4_K / Q5_K / Q6_K (image1d_buffer_t transposed layout, dedicated convert/restore kernels, GEMM + GEMV paths). Internal OpenCL backend
~b9222–b9245 ggml/src/ggml-rpc/ggml-rpc.cpp last_graph_uid field moved from ggml_backend_rpc_context (per-backend) into ggml_backend_rpc_device_context (per-device) so multiple backends sharing a device reuse cached graphs. Internal RPC backend
~b9222–b9245 ggml/src/ggml-sycl/ggml-sycl.cpp New GGML_SYCL_USE_ASYNC_MEM_OP env (default 1) decouples async USM alloc/free from the graph path. Internal SYCL backend
~b9222–b9245 ggml/src/ggml-webgpu/ggml-webgpu.cpp + wgsl-shaders/gated_delta_net.wgsl Gated-delta-net shader gains a K snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend
~b9222–b9245 convert_hf_to_gguf.py, convert_lora_to_gguf.py, examples/save-load-state/save-load-state.cpp, examples/llama-eval/*, tools/cli/README.md, tools/server/README.md, docs/speculative.md, docs/backend/SYCL.md Doc/example/tooling updates only. Not compiled by this project
~b9222–b9245 tools/ui/* WebUI source reorganisation (enum file renames *.ts*.enums.ts, new chat components, Tailwind plugin imports). Project sets LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE in CMakeLists.txt, so the UI is never built — no impact
~b9245–b9264 src/llama-chat.{h,cpp} LLM_CHAT_TEMPLATE_HUNYUAN_OCR renamed to LLM_CHAT_TEMPLATE_HUNYUAN_VL (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip-impl.h + tools/mtmd/models/ PROJECTOR_TYPE_HUNYUANOCR removed and merged into PROJECTOR_TYPE_HUNYUANVL; hunyuanocr.cpp renamed to hunyuanvl.cpp; clip graph class clip_graph_hunyuanocr renamed to clip_graph_hunyuanvl. Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip.h clip_is_minicpmv() and clip_is_glm() removed from public API. Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip.h (struct clip_context_params) New bool no_alloc field added (initialized via mtmd_context_params_default()). Additive default-zero — no project changes required
~b9245–b9264 tools/mtmd/mtmd.h New mtmd_get_memory_usage() C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project
~b9245–b9264 tools/mtmd/clip-model.h New enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST } replacing the bool image_resize_pad flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links mtmd as-is
~b9245–b9264 common/common.h (struct common_params_speculative_draft) New bool backend_sampling = true field — offloads draft sampling to the backend. Additive default-on; Java ModelParameters doesn't set it, so the upstream default applies. Backend sampler auto-disables when split_mode == TENSOR in src/llama-context.cpp — safe
~b9245–b9264 common/speculative.cpp common_speculative_impl_draft_mtp now registers a per-seq backend sampler chain (top-k 10) on ctx_dft via llama_set_sampler; cleaned up in destructor. Falls back to CPU sampler if llama_set_sampler fails. Internal to upstream-compiled speculative module, no project call sites
~b9245–b9264 app/ (new) New optional unified llama binary (llama-app target) dispatching to serve/cli/completion/bench. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it
~b9245–b9264 tools/{cli,completion,llama-bench,server}/CMakeLists.txt Each tool split into a *-impl static library (the logic) plus a thin main.cpp wrapper; the main() in cli.cpp/completion.cpp/llama-bench.cpp/server.cpp is renamed to llama_cli/llama_completion/llama_bench/llama_server and now satisfies -Wmissing-declarations via a forward decl. Project does NOT compile any of these .cpp files — only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp (see CMakeLists.txt:237/:302) — so no impact
~b9245–b9264 tools/server/server-context.cpp Adds mmproj memory estimation: when params_base.fit_params is set, calls mtmd_get_memory_usage(mmproj_path, mparams) and adds the per-device cost into params_base.fit_params_target before common_init_from_params. Also calls mtmd_helper_log_set(common_log_default_callback, nullptr) once when !is_resume. Compiled upstream-as-is, no project call sites
~b9245–b9264 src/llama-context.cpp New llama_context::set_sampler() short-circuits with a one-shot LLAMA_LOG_WARN and returns false when model.split_mode() == LLAMA_SPLIT_MODE_TENSOR (backend sampling not supported with tensor split). Internal safety check, no project call sites
~b9245–b9264 common/arg.cpp New CLI flags --spec-draft-backend-sampling / --no-spec-draft-backend-sampling and env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING to toggle the new backend_sampling field. Not exposed by ModelParameters; could be added later as a Java-side enhancement
~b9245–b9264 ggml/src/ggml-cuda/CMakeLists.txt + common.cuh + binbcast.cu, concat.cu, cpy.cu, fattn-*.cu, gated_delta_net.cu, getrows.cu, mean.cu, mmvf.cu, mmvq.cu, norm.cu, quantize.cu, reduce_rows.cuh, rope.cu, scale.cu, set-rows.cu, softcap.cu, ssm-conv.cu, ssm-scan.cu, sumrows.cu, topk-moe.cu, unary.cu New PDL (Programmatic Dependent Launch) infrastructure: GGML_CUDA_USE_PDL build flag (CUDART ≥ 11.8, non-HIP/MUSA); ggml_cuda_pdl_sync() / ggml_cuda_pdl_lc() device helpers (active on Hopper sm_90+); ggml_cuda_kernel_launch_params + ggml_cuda_kernel_launch() host template that calls cudaLaunchKernelEx with stream-serialization attribute when GGML_CUDA_PDL env var allows. Adds 90-virtual (Hopper) to default CMAKE_CUDA_ARCHITECTURES when CUDA ≥ 11.8. Internal CUDA backend, no project changes required
~b9245–b9264 ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp + ggml-metal.metal New 4-element kernel_pad_*_4 variant (currently disabled — is_c4 = false); kernel_pad rewritten with 1024-element-per-block tiling for larger tensors; kernel_cpy_* rewritten to use tpitg rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend
~b9245–b9264 ggml/src/ggml-hexagon/htp/ (hmx-matmul-ops.c, hmx-ops.h, matmul-ops.c, main.c) HMX matmul refactor: K-loop tiled in 32-tile blocks with Q6_activation_hf_mxmem_RR_deep; the out-stationary fallback path for large M·K·N was deleted; function rename hmx_mat_mul_permuted_w16a32hmx_matmul_f16_f32, hmx_mat_mul_permuted_qk_0_d16a32hmx_matmul_q_f32, hmx_mat_mul_permuted_w16a32_batched_params_thmx_matmul_f16_f32_batched_params_t. HMX power-up code reorganized (HAP_power_set_HMX_v2 now combines power-on + clock in one step for __HVX_ARCH__ ≥ 75). Internal Qualcomm DSP backend
~b9245–b9264 ggml/src/ggml-opencl/ggml-opencl.cpp Lazy kernel compilation: argsort and flash_attn programs are now built only when first needed (load_cl_kernels_argsort / load_cl_kernels_flash_attn called from supports_op); new device-supported probe in ggml_opencl_is_device_supported runs at registration time; renamed ggml_cl2_init/ggml_cl2_freeggml_cl_init/ggml_cl_free; OpenCL contexts now live as long as the process. Internal OpenCL backend
~b9245–b9264 ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes BLOCK_SIZE outputs per step. Internal Vulkan backend
~b9245–b9264 src/models/delta-net-base.cpp Renamed local variables (state_in_3ds_3d, state_3ds_3d_pad) when reshaping the recurrent state; behaviour unchanged
~b9245–b9264 tools/mtmd/mtmd-image.cpp img_tool::resize() takes a pad_style enum (was bool add_padding); new PAD_NEAREST rounding path for Pillow byte-parity; mtmd_image_preprocessor_deepseekocr::preprocess rewritten with static constexpr resolution table and RESIZE_ALGO_BICUBIC_PILLOW + PAD_NEAREST. Internal mtmd, project links as-is
~b9245–b9264 tools/mtmd/models/deepseekocr.cpp Extracted build_sam(ggml_tensor *inp_raw) member function from the monolithic build path; FA mask casting to F16 only when flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED. Internal
~b9245–b9264 conversion/hunyuan.py, gguf-py/gguf/constants.py, gguf-py/gguf/tensor_mapping.py HunyuanOCR / HunyuanVL unified in conversion: VisionProjectorType.HUNYUANOCR removed; HunYuanVLForConditionalGeneration registers a single HunyuanVLVisionModel + HunyuanVLTextModel; vit.perceive.* tensor mappings now only mention HunyuanVL. Python tooling, not compiled by project
~b9245–b9264 CMakeLists.txt (upstream) New LLAMA_BUILD_APP option (default OFF); deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUILLAMA_BUILD_UI/LLAMA_USE_PREBUILT_UI preserved. Project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged
~b9245–b9264 .devops/*.Dockerfile, .github/workflows/build-and-test-snapdragon.yml, scripts/snapdragon/, docs/backend/snapdragon/, tools/cli/README.md, tools/server/README.md, tools/mtmd/tests/ Docker images add conversion/ dir; snapdragon toolchain bumped v0.3 → v0.6 with +dotprod+i8mm; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project
~b9264–b9279 tools/server/server-context.cpp Slot-info JSON adds three additive fields (n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache) on each in-flight task; server_context_impl::destroy() now resets spec / ctx_dft / model_dft BEFORE llama_init.reset() to avoid use-after-free when a draft model holds back-references into the target context. Compiled directly into jllama from upstream — no project source changes required
~b9264–b9279 tools/server/server-models.cpp Adds #include <cstdlib> and a LLAMA_APP_CMD env-var lookup in server_model_meta::update_args() to re-inject the unified-binary subcommand into router-spawned child argv. Env var is only set by the new llama-app binary (which this project does not build), so the lookup harmlessly returns null and the code path is a no-op. Compiled upstream-as-is, no project changes
~b9264–b9279 src/llama-vocab.cpp New hybriddna BPE tokenizer model (DNA k-mer tokenization with <dna>…</dna> tag handling, k=6, OOV fallback) registered as a BPE variant; reached only when GGUF metadata declares tokenizer.model = "hybriddna". Adds a virtual destructor + virtual tokenize() to llm_tokenizer_bpe_session and a llm_tokenizer_hybriddna_session subclass; existing BPE callers unchanged. Additive, no project changes
~b9264–b9279 src/llama-graph.cpp llm_graph_input_attn_kv_iswa::set_input() / can_reuse() now guard the base and SWA tensor accesses behind if (self_k_idxs && self_k_idxs->buffer) / if (self_k_idxs_swa && self_k_idxs_swa->buffer). Fixes crashes on models with only-SWA or only-non-SWA attention layers. Internal, no project impact
~b9264–b9279 src/models/qwen35.cpp + src/models/qwen35moe.cpp MTP draft sub-graph now builds an inp_out_ids input and applies ggml_get_rows(cur, inp_out_ids) just before the head norm, so only the requested output rows are projected. Bug fix for MTP draft path; internal, no project changes
~b9264–b9279 ggml/src/ggml-backend.cpp ggml_backend_tensor_get_2d() fast-path condition fixed: now checks iface.get_tensor_2d == NULL (was incorrectly checking set_tensor_2d), so multi-copy gets correctly fall back to the per-copy loop when the backend lacks get_tensor_2d. Bug fix, no project changes
~b9264–b9279 ggml/src/ggml-vulkan/ (ggml-vulkan.cpp, new vulkan-shaders/snake.comp, vulkan-shaders-gen.cpp) New Vulkan Snake activation fusion: detects the 5-op chain MUL → SIN → SQR → MUL → ADD (matching CUDA b9094 introduction) and dispatches a single fused snake_{f32,f16,bf16} kernel y = x + sin(a*x)^2 * inv_b. New ggml_vk_can_fuse_snake() validates contiguity, 2D shape, and broadcast operands [1, C, 1, 1]. Internal Vulkan backend, no project changes
~b9264–b9279 ggml/src/ggml-metal/ggml-metal-ops.cpp + ggml-metal.metal kernel_concat / kernel_set now batch multiple small rows into one threadgroup (nrptg = min(256/ne0, ne1), capped at 256 threads/group) to improve small-row throughput; kernel_concat gains an early-return bounds check. Internal Metal backend, no project changes
~b9264–b9279 ggml/src/ggml-hexagon/ (ggml-hexagon.cpp, htp/ssm-conv.c, htp/rope-ops.c) SSM_CONV HVX kernel rewritten with VTCM-staged 32×32 fp32 in-register transpose and per-thread tiling (1 MiB VTCM budget); strictly-contiguous gate replaced with byte-stride checks (nb[0]==sizeof(float) and nb[1]==ne[0]*sizeof(float)); rope_cache_init / mrope_cache_init marked __attribute__((noinline)) to reduce code-bloat on Hexagon. Internal Qualcomm DSP backend, no project changes
~b9264–b9279 examples/save-load-state/ removed, tests/test-save-load-state.cpp added; tools/{batched-bench,fit-params,quantize,perplexity}/CMakeLists.txt The llama-save-load-state example binary was removed and re-homed as a CTest target; the four remaining standalone tools were each split into a *-impl static library + a thin main.cpp wrapper (mirroring the b9245 split of cli/completion/llama-bench/server), with the entry-point renamed to llama_batched_bench / llama_fit_params / llama_quantize / llama_perplexity to satisfy -Wmissing-declarations. Project does not compile any of these .cpp files (only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp — see CMakeLists.txt), so no impact
~b9264–b9279 app/ (CMakeLists.txt, llama.cpp) llama-app unified binary gains four new subcommands (batched-bench, fit-params, quantize, perplexity) and sets LLAMA_APP_CMD in the env before dispatching so that the router can re-inject the subcommand into spawned child argv. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it, no impact
~b9264–b9279 conversion/base.py + conversion/llama.py New _set_vocab_hybriddna() Python helper that emits a gpt2-style BPE vocab tagged as tokenizer.model = "hybriddna"; LlamaModel.set_vocab() dispatches to it when tokenizer_config.json declares "tokenizer_class": "HybridDNATokenizer"; add_prefix_space handling moved earlier in the same method. Conversion tooling only, not compiled by project
~b9279–b9284 upstream CMakeLists.txt LLAMA_BUILD_APP default flipped OFFON. Project's LLAMA_BUILD_TOOLS is OFF (FetchContent, LLAMA_STANDALONE=OFF), so tools/-dependent app targets are not configured; nevertheless CMakeLists.txt:108 now explicitly forces set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) to keep the cache pinned across upgrades
~b9279–b9284 tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt Each *-impl target switched from add_library(... STATIC ...) to default library type (becomes SHARED when BUILD_SHARED_LIBS=ON); added WINDOWS_EXPORT_ALL_SYMBOLS ON and conditional install(TARGETS ... LIBRARY) under LLAMA_TOOLS_INSTALL. Project doesn't enable LLAMA_BUILD_TOOLS, so none of these targets are configured — no impact
~b9279–b9284 src/llama-vocab.cpp + conversion/base.py HybridDNA tokenizer fix: k-mers are now stored in token_to_id with a reserved \xee\x80\x80 (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. CCCCCC); the suffix is stripped from id_to_token text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required
~b9279–b9284 ggml/src/ggml-cuda/common.cuh PDL-launch gating now uses ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required
~b9284–b9297 upstream CMakeLists.txt LLAMA_BUILD_APP default reverted from ON back to ${LLAMA_STANDALONE} (i.e. OFF for FetchContent consumers). Project's set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) shim is now redundant but harmless; kept as defensive pin against future flips
~b9284–b9297 common/chat.h + tools/server/server-task.cpp New additive common_chat_parser_params::is_continuation field (default false); params_from_json_cmpl now parses the continue_final_message request field via common_chat_continuation_parse() and sets is_continuation when the result is non-NONE. task_result_state ctor guard tightened: the empty-prefill chat_msg = common_chat_parse("", true, ...) initialization is now gated on is_continuation && !echo (was just !echo) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java InferenceParameters.setContinueFinalMessage(boolean|ContinuationMode) already writes continue_final_message to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it
~b9284–b9297 src/llama-model.{h,cpp} + src/models/qwen35.cpp + src/models/qwen35moe.cpp NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: llama_layer_nextn gains four scale fields (eh_proj_s, eh_proj_in_s, shared_head_head_s, shared_head_head_in_s); load_tensors() loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into build_lora_mm(). Internal model-loading + graph-building changes, no project changes required
~b9284–b9297 ggml/src/ggml-backend.cpp Bug fix in ggml_backend_tensor_get_2d_async: fast-path condition checked iface.set_tensor_2d_async == NULL (typo) instead of iface.get_tensor_2d_async == NULL; multi-copy gets now correctly fall back when the backend lacks get_tensor_2d_async. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required
~b9284–b9297 ggml/src/ggml-opencl/ (ggml-opencl.cpp + 17 kernel files) Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the ne01 bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when ne01 % 64 != 0. Fixed by: (1) rounding global_size[0] up to the next multiple of 64 in ggml_cl_mul_mat_id, (2) moving the per-thread ne01 early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from ne01 % 64 == 0 to ne01 % 32 == 0 in use_adreno_moe_kernels. Internal OpenCL backend, affects the opencl-android-aarch64 classifier build only — no project source changes
~b9284–b9297 ggml/src/ggml-sycl/ (ggml-sycl.cpp, dmmv.cpp, gated_delta_net.cpp, common.hpp) (1) BF16 added to ggml_sycl_supports_dmmv() and can_use_dequantize_mul_mat_vec(); new convert_mul_mat_vec_bf16_sycl path. (2) Level Zero auto-detect moved into ggml_sycl_init()info.ext_oneapi_level_zero flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for GGML_SYCL_ENABLE_LEVEL_ZERO env. (3) mmid_counting_sort_rows() replaces the per-expert atomic scan in ggml_sycl_mul_mat_id — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with keep_rs_t template parameter and per-token snapshot writes when K > 1, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required
~b9284–b9297 ggml/src/ggml-vulkan/CMakeLists.txt find_package(SPIRV-Headers) switched to CONFIG REQUIRED and adds $ENV{VULKAN_SDK} to CMAKE_PREFIX_PATH; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required
~b9284–b9297 ggml/src/ggml-zendnn/ (CMakeLists.txt, ggml-zendnn.cpp) ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles GGML_TYPE_Q8_0 with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required
~b9284–b9297 tools/perplexity/perplexity.cpp log_probs.resize(n_ctx * nv) widened to size_t(n_ctx) * nv to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact
~b9297–b9305 upstream CMakeLists.txt Top-level backward-compat shims that forwarded LLAMA_BUILD_WEBUILLAMA_BUILD_UI and LLAMA_USE_PREBUILT_WEBUILLAMA_USE_PREBUILT_UI were REMOVED (they now live only in tools/ui/CMakeLists.txt). Java impact: project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) no longer hits the shim at top level. tools/ui is not configured in FetchContent mode (LLAMA_BUILD_TOOLS=OFF), so the old setting was inert in practice, but the project's CMakeLists.txt:107 was renamed to set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE) for clarity and to defend against future flips of LLAMA_BUILD_UI default
~b9297–b9305 common/common.h LLAMA_UI_DEFAULT_ENABLED macro removed; common_params::ui default is now unconditionally true. Not referenced by project, no changes required
~b9297–b9305 common/fit.{h,cpp} common_get_device_memory_data() made non-static and exported from fit.h (was a file-local helper). fit.h now also pulls in ggml-backend.h, llama.h, and ../src/llama-ext.h. Used by upstream tools/server/server-context.cpp (compiled directly into jllama). The #include "../src/llama-ext.h" resolves relative to fit.h's location (common/../src/llama-ext.h), so no extra include paths are required. No project source changes
~b9297–b9305 tools/server/server-context.cpp New #include "fit.h" and a new draft/MTP memory measurement block: when params_base.fit_params is set AND the speculative config includes a draft model or COMMON_SPECULATIVE_TYPE_DRAFT_MTP, common_get_device_memory_data() is called against the draft model (or a copy of the target params with LLAMA_CONTEXT_TYPE_MTP for MTP) and the resulting per-device model + context + compute bytes are added to params_base.fit_params_target before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. ModelParameters.setFit(boolean) defaults to on, so this kicks in automatically when a user configures a draft model — no Java-side wiring required
~b9297–b9305 tools/server/server-context.cpp [mtmd] estimated memory usage of mmproj log line reworded to estimated worst-case memory usage; log only, no behavioural change
~b9297–b9305 tools/server/server-http.cpp UI serving path migrated from per-asset extern arrays (index_html, bundle_js, …) and the LLAMA_BUILD_UI macro to a runtime llama_ui_find_asset() lookup gated on the new LLAMA_UI_HAS_ASSETS macro generated by the new llama-ui-embed host tool. Project does NOT compile server-http.cpp (only server-context.cpp/server-queue.cpp/server-task.cpp/server-models.cpp), no impact
~b9297–b9305 tools/ui/ (CMakeLists.txt, new embed.cpp, new sources.cmake, new scripts/ui-assets.cmake, removed scripts/ui-download.cmake + scripts/xxd.cmake, removed ui.cpp+ui.h) Full UI build pipeline rewrite: xxd.cmake+ui-download.cmake replaced by a host-compiled llama-ui-embed C++ tool that generates ui.cpp/ui.h (declaring a g_assets[] table and llama_ui_find_asset() lookup, plus LLAMA_UI_HAS_ASSETS macro) from arbitrary asset files; new scripts/ui-assets.cmake orchestrates asset provisioning with a clearer priority (pre-built tools/ui/dist → npm build → HF Bucket); tools/ui is now an add_custom_target always re-run per build. The deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUI/LLAMA_WEBUI_HF_BUCKET moved here from the top-level CMakeLists.txt. Project does not build the UI (LLAMA_BUILD_TOOLS=OFF in FetchContent mode), no impact
~b9297–b9305 ggml/include/ggml-alloc.h Comment-only API documentation update for ggml_backend_alloc_ctx_tensors_from_buft. No project changes required
~b9297–b9305 ggml/src/ggml-backend-meta.cpp Bug fix for zero-sized split tensor slices: set_tensor/get_tensor/set_tensor_async/get_tensor_async paths now continue when chunk_size_j == 0; ggml_backend_meta_alloc_ctx_tensors_from_buft now allocates a dummy buffer when all tensors in a context are zero-sized (was returning NULL and asserting); ggml_backend_buft_alloc_buffer result now GGML_ASSERTed non-null. Internal backend code, no project changes required
~b9297–b9305 ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c hvx_vec_splat_f16(hvx_vec_get_f16(...)) round-trip replaced with hvx_vec_repl_f16(...) which stays in the vector domain via vdelta (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required
~b9297–b9305 ggml/src/ggml-opencl/ggml-opencl.cpp GGML_OPENCL_PROFILING batching fix: when profiling_info reaches 2048 entries the batch is now flushed into a persistent profiling_results vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing ] closing the JSON array in cl_trace.json. Profile-only code (GGML_OPENCL_PROFILING is off by default), no project changes required
~b9305–b9333 common/common.h + common/arg.cpp common_params::checkpoint_every_nt renamed to checkpoint_min_step; default changed 8192 → 256; CLI flag -cpent/--checkpoint-every-n-tokens REMOVED (throws std::invalid_argument at parse time) and replaced by -cms/--checkpoint-min-step; env var LLAMA_ARG_CHECKPOINT_EVERY_NTLLAMA_ARG_CHECKPOINT_MIN_SPACING_NT. Java layer does not expose this flag, no project source changes required
~b9305–b9333 common/chat.h + common/chat.cpp New common_chat_msg_span and common_chat_msg_delimiter structs; new common_chat_params::message_spans field (default empty vector); new common_chat_split_by_role() function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected user_start/assistant_start markers; passed through server-common.cpp as message_spans JSON array in the task params; compiled from upstream, no Java changes required
~b9305–b9333 common/chat-diff-analyzer.cpp + common/chat-auto-parser.h New autoparser::user_start and autoparser::assistant_start fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required
~b9305–b9333 tools/server/server-task.h + tools/server/server-context.cpp New task_params::n_before_user field (default -1); server computes it from message_spans to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates draft.cache_type_k/v; compiled directly into jllama from upstream, no project source changes required
~b9305–b9333 ggml/include/gguf.h + ggml/src/gguf.cpp New gguf_reader_callback_t typedef; new gguf_init_from_buffer(data, size, params) and gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params) public APIs; internal gguf_init_from_reader() helper refactored to use a callback-based reader; additive, not used by project
~b9305–b9333 ggml/CMakeLists.txt GGML version bumped 0.12.0 → 0.13.0; no project changes required
~b9305–b9333 ggml/src/CMakeLists.txt + ggml/src/ggml-cpu/CMakeLists.txt OpenMP detection and target_link_libraries moved from ggml-cpu into ggml-base; exported ggml-config.cmake.in updated to add GGML_BASE_INTERFACE_LINK_LIBRARIES and guard OpenMP targets before appending; fixes static-lib consumers that link only ggml-base; no project source changes required
~b9305–b9333 ggml/src/ggml-alloc.c Off-by-one bug fix in ggml_dyn_tallocr_remove_block: loop ran one iteration past the last valid element; internal allocator fix, no project changes required
~b9305–b9333 ggml/src/ggml-backend-meta.cpp Rotating-pair compute containers: external views created between evals now use a stc_compute[2] double-buffer scheme so they don't slowly deplete stc_static memory; split_state_cache is now unbounded (comment documents it as FIXME); ggml_backend_meta_alloc_ctx_tensors_from_buft uses ggml_get_mem_size(ctx) for static container and 16× that for each compute container; internal multi-GPU meta backend refactor, no project changes required
~b9305–b9333 ggml/src/ggml-cuda/fwht.cu + fwht.cuh + ggml-cuda.cu New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (fwht_cuda<N>) for N = 64/128/256/512; dispatched from ggml_cuda_mul_mat when GGML_HINT_SRC0_IS_HADAMARD op hint is set on a ggml_mul_mat node (hint index 1); internal CUDA backend, no project changes required
~b9305–b9333 ggml/src/ggml-metal/ggml-metal-device.{h,m} New ggml_metal_device_id enum covering M1–M5 variants; device_id field added to ggml_metal_device_props, populated by new ggml_metal_device_id_parse() from the MTL device name string; additive, no project changes required
~b9305–b9333 ggml/src/ggml-quants.c IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on counter under OpenMP; guards with #ifdef GGML_USE_OPENMP; internal quantization init, no project changes required
~b9305–b9333 src/llama-arch.cpp LLM_TENSOR_FFN_LATENT_DOWN and LLM_TENSOR_FFN_LATENT_UP probe op changed from GGML_OP_MUL to GGML_OP_MUL_MAT; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use MUL_MAT to keep them there); internal upstream fix, no project changes required
~b9305–b9333 vendor/cpp-httplib/httplib.{h,cpp} Bumped to v0.45.1: close_socket, shutdown_socket, Server::stop marked noexcept; macOS Keychain cert loading migrated from deprecated SecTrustCopyAnchorCertificates to SecTrustSettingsCopyCertificates (all three trust domains: system, admin, user); CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN now restricted to TARGET_OS_OSX only with compile-time #error on iOS/tvOS/watchOS; compiled automatically, no project changes required
~b9305–b9333 common/common.h New string_lcs(std::string_view a, std::string_view b) function (longest common substring via DP); additive, not used by project directly
~b9333–b9354 src/models/talkie.cpp (new) + src/llama-arch.h/cpp + src/llama-model.cpp + src/llama-vocab.cpp/h New Talkie model architecture (LLM_ARCH_TALKIE); uses NEOX rope type; embedding skip connections via out_scale; per-head Q gain via attn_q_norm; logit scale; new LLAMA_VOCAB_PRE_TYPE_MINICPM5 = 52 ("minicpm5" pre-type with ignore_merges = true); "talkie" tokenizer_pre mapped to GPT4O; Gemma4ForCausalLM registered as Gemma4 in HF conversion map; all additive, no project source changes required
~b9333–b9354 src/models/mistral3.cpp Dense FFN now passes ffn_up_s/ffn_gate_s/ffn_down_s instead of nullptr; MoE passes ffn_up_exps_s/ffn_gate_exps_s/ffn_down_exps_s to build_moe_ffn; bug fix for NVFP4 Mistral3/Mistral-MoE models; upstream only, no project changes required
~b9333–b9354 tools/server/server-http.h + server-http.cpp bool is_ssl = false field added to server_http_context; listening_address now uses https:// prefix when SSL is configured (was always http://); compiled from upstream, no project changes required
~b9333–b9354 ggml/src/ggml-sycl/ggml-sycl.cpp Virtual memory pool (ggml_sycl_pool_vmm) implemented when SYCL_EXT_ONEAPI_VIRTUAL_MEM is available; GGML_SYCL_ENABLE_VMM env var (default 1) controls it; DEBUG_SYCL_MALLOC compile flag for verbose allocation logging; vmm_granularity field in sycl_device_info; internal SYCL backend, no project changes required
~b9333–b9354 ggml/src/ggml-cuda/fwht.cu + fwht.cuh ggml_cuda_op_fwht return type changed voidbool; returns false for non-contiguous tensors or unsupported N values instead of calling GGML_ABORT; caller in ggml-cuda.cu now skips FWHT gracefully; internal CUDA backend, no project changes required
~b9333–b9354 ggml/src/ggml-vulkan/ggml-vulkan.cpp + conv2d_mm.comp Cooperative matrix 1 (cm1) path for conv2d; new CONV_SHAPE_64x128 tile size; aligned spec constant skips bounds checks when K/CRS/NPQ are tile-aligned; csh_store stages cm2/cm1 output through shared memory for coalesced global stores; internal Vulkan backend, no project changes required
~b9333–b9354 ggml/src/ggml-webgpu/ New MMVQ path for mat-vec using packed_4x8_integer_dot_product; legacy mul_mat.wgsl removed (replaced by register-tile path); new quantize_q8.wgsl and mul_mat_vec_q_acc.tmpl; vendor and dot-product capability detection at init; q8_1.m renamed to q8_1.s in WGSL struct; internal WebGPU backend, no project changes required
~b9333–b9354 upstream CI (.github/workflows/) CANN and SYCL builds disabled to save Actions resources; macOS builds moved to build-apple.yml; cache keys prefixed with cache-gha-; [no release] commit message token skips release pipeline; no project changes required
~b9354–b9437 common/common.h + common/arg.h + common/arg.cpp common_params_handle_models() return type voidbool (caller can detect skip-download misses); new common_params::skip_download; common_params::timeout_read default raised 600 → 3600. Project does not call common_params_handle_models() directly — arg parsing happens upstream; the new defaults flow through transparently
~b9354–b9437 common/download.h + common/download.cpp common_download_model() parameter list trimmed: download_mmproj/download_mtp moved into common_download_opts; new common_skip_download_exception; new opt skip_download returns -2 on missing/etag mismatch. Project does not include download.h directly, no source changes required
~b9354–b9437 tools/server/server-task.h + server-task.cpp task_params::stream default truefalse; new server_task_result_cmpl_partial::is_begin bool to let HTTP layer emit SSE headers before the first delta; to_json() returns nullptr for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets stream explicitly from Java (LlamaIterator.java, LlamaModel.java) so the default change is inert. The is_begin / nullable-to_json contract DOES leak into the JNI bridge — see the row below for the required fix
~b9354–b9437 tools/server/server-context.cpp + server-queue.cpp send_partial_response() gained is_begin parameter (defaulted); SSE stream now emits a no-content opening event when stream &amp;&amp; !return_progress (server-context.cpp:2835) so the client sees HTTP 200 + headers before first token. server_response_reader::next() 30s warn-on-cancel diagnostic message updated. Required project source change: Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson in src/main/cpp/jllama.cpp called result->to_json() once and assigned response["stop"], which silently auto-promoted the nullptr to an object {"stop": false} and surfaced a phantom empty LlamaOutput to every Java streaming caller (LlamaModelTest.testGenerateAnswer and four sibling tests overran by +1 token). Fixed by wrapping the rd->next() call in a loop that skips response.is_null() results so only real events reach Java
~b9354–b9437 common/arg.cpp (env-var renames) LLAMA_LOG_*LLAMA_ARG_LOG_*, LLAMA_OFFLINELLAMA_ARG_OFFLINE, LLAMA_LOG_FILELLAMA_ARG_LOG_FILE, LLAMA_CHAT_TEMPLATE_KWARGSLLAMA_ARG_CHAT_TEMPLATE_KWARGS. CLI verbosity values relabeled (4=trace, 5=debug). The --license CLI flag was REMOVED and moved to the new llama-app licenses subcommand. Project does not expose these env vars or the --license flag through the Java API, no changes required
~b9354–b9437 src/llama.cpp llama_backend_init() device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in jllama.cpp unchanged
~b9354–b9437 src/llama-chat.cpp New LLM_CHAT_TEMPLATE_GRANITE_4_1 enum value + "granite-4.1" template name; granite-4.0 detection now requires the literal token g4_default_system_message in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly — routing happens inside compiled-from-upstream code, no source changes required
~b9354–b9437 vendor/cpp-httplib/ Bumped to v0.46.0: adds Client::set_no_proxy(std::vector&lt;std::string&gt;) with full hostname-suffix and IPv4/IPv6 CIDR matching; Server::ThreadPool constructor is exception-safe (already in v0.45.0); Client::set_proxy() now disconnects the held socket immediately so a later proxy change cannot reuse the old TLS session. Compiled automatically, no project changes required
~b9354–b9437 common/arg.cpp (additive flags) New --spec-draft-backend-sampling / --no-spec-draft-backend-sampling (env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING) and --skip-download (mapped to common_params::skip_download). Both default-on / default-off in a way that preserves current Java behaviour. Consider exposing as ModelParameters.setSpecDraftBackendSampling(boolean) and setSkipDownload(boolean) in a follow-up — tracked under Open TODOs
~b9354–b9437 ggml/src/ggml-cuda/common.cuh GGML_CUDA_USE_PDL gating tightened: for MSVC, now requires CTK ≥ 12.3 (was 11.8) due to a compiler bug in the older Windows CUDA toolchains. Project's only CUDA build is Linux (dockcross, CUDA 13.2) so the MSVC gate has no CI impact; Windows CI builds CPU-only
~b9437–b9442 src/llama-vocab.{h,cpp} + src/llama-arch.{h,cpp} New LLAMA_VOCAB_PRE_TYPE_WHITESPACE = 53 and llm_tokenizer_whitespace_session (used by jina-v2-base-zh embeddings); new "whitespace" tokenizer_model routed as LLAMA_VOCAB_TYPE_BPE; new LLM_KV_TOKENIZER_NORMALIZER_LOWERCASE key (tokenizer.ggml.normalizer.lowercase) read into llama_vocab::impl::normalizer_lowercase; new public accessor llama_vocab::get_normalizer_lowercase(). All additive — existing tokenizers untouched; new whitespace + lowercase normalizer is consumed automatically when loading a GGUF that sets these vocabulary keys, no project source or Java API changes required
~b9437–b9442 src/llama.cpp llama_prepare_model_devices() iGPU collection now appends only the FIRST GGML_BACKEND_DEVICE_TYPE_IGPU device (prevents duplicate iGPU registration on multi-iGPU hosts). Behavioural fix, single-line caller in jllama.cpp unchanged, no project source changes required
~b9437–b9442 tools/ui/embed.cpp + tools/ui/src/... (Svelte) Webasset embedder tightened printf format specifiers (%lu%zu and PRIx64); UI settings split custom into customJson + customCss; runtime CSS injection via <svelte:head>. Project does not ship the upstream UI, no impact
~b9437–b9442 gguf-py/, conversion/ (Python) New _set_vocab_whitespace() helper and add_normalizer_lowercase() GGUF writer for the new whitespace tokenizer + lowercase normalizer keys (mirrors the vocab additions above); jina-v2 Roberta-tokenizer path now branches to whitespace when tokenizer.json declares a Whitespace pre-tokenizer. Python-side only, no impact on the Java/JNI build
~b9442–b9444 .github/workflows/build-cpu.yml (upstream CI) Upstream's CPU-build CI trigger paths narrowed to **/*.h, **/*.hpp, **/*.c, **/*.cpp (dropped **/*.cu, **/*.cuh, **/*.swift, **/*.m, **/*.metal, **/*.comp, **/*.glsl, **/*.wgsl) so GPU/Metal/Vulkan/WebGPU/Swift source edits no longer trigger the CPU build. Upstream-only CI plumbing; this project consumes none of upstream's workflow files and has its own publish.yml, no impact
~b9442–b9444 tools/server/server-http.cpp If-None-Match conditional-GET handling now also accepts the weak ETag form W/"..." (previously matched only strong ETag bytes-equal); 304 Not Modified returned for either form. This is the standalone llama-server HTTP tool, which is not linked into the JNI build (libllama + libcommon only); no project source changes required and no new Java API surface to expose
~b9444–b9490 common/common.cpp common_prompt_batch_decode() signature changed: new int n_new parameter added between all_tokens and n_past. Callers must pass the count of newly-decoded tokens for the batch. Only called inside upstream tools/server/server-context.cpp (compiled directly into jllama); no project source changes required — the new signature flows through transparently
~b9444–b9490 include/llama.h llama_set_warmup() deprecated via LLAMA_DEPRECATED macro (warmup is now handled internally during model load + first decode). Not called from jllama.cpp or any project source — absorbed inside upstream-compiled code, no project changes required. If a future jllama feature wants to control warmup explicitly, that path is the deprecated one and should pick the new replacement instead
~b9444–b9490 include/llama.h + src/llama-context.cpp New llama_context_params::n_outputs_max field (default -1 = derived from n_batch). Limits the number of output slots allocated per context; useful for low-memory setups that always request logits_all=false. Not exposed by project today — consider adding ModelParameters.setMaxOutputs(int) if a user requests fine-grained control. Tracked under Open TODOs
~b9444–b9490 common/arg.cpp + common/common.cpp common_params_handle_models() no longer sets hf_opts.download_mmproj = true unconditionally; instead uses opts.download_mmproj = !params.no_mmproj so the new --no-mmproj flag suppresses the multimodal projector download. Not called from project source — arg parsing happens upstream, no project changes required
~b9444–b9490 common/sampling.h + common/sampling.cpp New common_sampler_reasoning_budget_force(common_sampler *) API that triggers the budget sampler to inject the end-of-thinking token on the next sample. Paired with new common_params_sampling::reasoning_control bool: when set, arms the budget sampler so external code (e.g. a server control endpoint) can end reasoning at runtime. Not used by project today — would pair with a future InferenceParameters.setReasoningControl(boolean) setter and a LlamaModel.endReasoning(...) helper. Tracked under Open TODOs
~b9444–b9490 common/common.h + common/arg.cpp New common_params::sse_ping_interval (int32, env LLAMA_ARG_SSE_PING_INTERVAL, CLI --sse-ping-interval); server emits SSE keep-alive comments at this interval. Server-only; project does not run the upstream HTTP server (uses a direct in-process API), no Java setter required
~b9444–b9490 tools/server/server-http.cpp New POST /v1/chat/completions/control endpoint accepting {"id": "...", "action": "reasoning_end"} — tells a streaming completion to wrap up reasoning early. Server-only; not linked into the JNI build (libllama + libcommon only), no project source changes required. If exposed in Java, would map to a new LlamaModel.endReasoning(String taskId) method that calls common_sampler_reasoning_budget_force on the slot's sampler. Tracked under Open TODOs
~b9444–b9490 src/llama-hparams.h + src/llama-model.cpp Internal renames: hparams::recurrent_layer_arrhparams::is_recr_impl; hparams::swa_layershparams::is_swa_impl. Internal helper fields not part of the public API; not referenced by jllama.cpp or any project source, no changes required
~b9444–b9490 src/llama-arch.h + src/llama-arch.cpp + gguf-py/ New LLM_KV_HIDDEN_ACT GGUF key (%s.hidden_act) for ModernBert SwiGLU/GeGLU activation selection; new LLM_KV_ATTENTION_RECURRENT_LAYERS key for hybrid (recurrent + attention) models. Additive vocabulary keys consumed automatically when loading a GGUF that sets them; no project source or Java API changes required
~b9444–b9490 src/llama-arch.h + src/models/*.cpp (new) New model architectures: LLM_ARCH_MELLUM (JetBrains code-completion), LLM_ARCH_EXAONE4_5 (LG AI multimodal), LLM_ARCH_STEP3P7 (StepFun Step-3.7 with MTP support); LLM_ARCH_QWEN3NEXT/LLM_ARCH_QWEN35/LLM_ARCH_QWEN35MOE removed from llama_model_saver_supports_arch() allowlist. New tokenizer pre-types: LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54, LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required
~b9444–b9490 common/arg.cpp New --mtp / --no-mtp flags (env LLAMA_ARG_MTP) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to ModelParameters.setMtp(boolean)). Tracked under Open TODOs if a user requests it
~b9444–b9490 upstream build / verification Local build with GIT_TAG b9490 was verified clean: cmake -B build configures cleanly; cmake --build build --config Release -j$(nproc) links libjllama.so with zero warnings on jllama.cpp or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (common.cpp, arg.cpp, llama.cpp, server-*.cpp, download.cpp); no project source edits required for the version bump itself
~b9490–b9495 include/llama.h + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-graph.{h,cpp} + common/speculative.{h,cpp} + src/models/{qwen35,qwen35moe,step35}.cpp Mass terminology rename: pre_normnextn everywhere the pre-final-norm hidden state is referenced. Affects the public API: llama_set_embeddings_pre_norm()llama_set_embeddings_nextn(), llama_get_embeddings_pre_norm()llama_get_embeddings_nextn(), llama_get_embeddings_pre_norm_ith()llama_get_embeddings_nextn_ith(). Internal: cparams.embeddings_pre_normcparams.embeddings_nextn, cparams.embeddings_pre_norm_maskedcparams.embeddings_nextn_masked, llm_graph_result::t_h_pre_normt_h_nextn, common_speculative_need_embd_pre_norm()common_speculative_need_embd_nextn(). Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting t_h_nextn (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (speculative.cpp, llama-context.cpp, server-context.cpp, model TUs). Verified by grep across src/main/cpp/*.{cpp,hpp}: zero matches for any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm* symbol. No project source changes required
~b9490–b9495 ggml/src/ggml-cuda/common.cuh + 10 CUDA kernel files New GGML_CUDA_RESTRICT macro replaces __restrict__ on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires __restrict__ to be disabled per llama.cpp PR #24030; the macro expands to nothing under GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER, otherwise to __restrict__. Kernel signatures change from direct T * __restrict__ x parameters to T * x_ptr parameter + an internal T * GGML_CUDA_RESTRICT x = x_ptr; alias line; GGML_UNUSED_VARS calls in fallback branches updated to reference the _ptr names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required
~b9490–b9495 src/llama-arch.{h,cpp} + src/llama-vocab.{h,cpp} + gguf-py/gguf/constants.py + gguf-py/gguf/gguf_writer.py New LLM_KV_TOKENIZER_SUPPRESS_TOKENS GGUF key (tokenizer.ggml.suppress_tokens). When a GGUF declares this array, the loader stores it on llama_vocab::impl::suppress_tokens and exposes it via new llama_vocab::get_suppress_tokens() accessor. The Gemma4 model graph (src/models/gemma4.cpp) reads this list and appends a -INFINITY logit bias to those token IDs at the end of the forward graph (new llm_graph_input_logits_bias class). Additive: existing models without the key produce an empty suppress_tokens vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers suppress_tokens parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image
~b9490–b9495 gguf-py/gguf/constants.py + gguf-py/gguf/tensor_mapping.py + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/clip.cpp + new tools/mtmd/models/gemma4uv.cpp + new tools/mtmd/models/gemma4ua.cpp + tools/mtmd/mtmd-audio.{h,cpp} + tools/mtmd/mtmd.cpp + conversion/__init__.py + conversion/gemma.py New Gemma4 Unified vision + audio variant (Gemma4UnifiedForConditionalGeneration). Adds new projector types PROJECTOR_TYPE_GEMMA4UV and PROJECTOR_TYPE_GEMMA4UA (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New V_ENC_EMBD_PATCH_NORM tensor enum (v.patch_norm.{bid}) and 3 indexed patch_norm_{1,2,3}_{w,b} weights on clip_model (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New mtmd_audio_preprocessor_gemma4ua mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream mtmd-cli / mtmd-debug binaries that the project does not link; the JNI build links libllama + libcommon only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required
~b9490–b9495 tools/ui/ (package.json, src/lib/components/app/content/MarkdownContent/, new MermaidPreview.svelte, new DialogMermaidPreview.svelte, new constants / icons / rehype plugins) Upstream llama-server web UI gains Mermaid diagram rendering: new mermaid@^11.15 dependency, lazy-loaded; new rehype plugin chain (rehype-mermaid-pre, rehype-enhance-mermaid-blocks) converts ```mermaid code fences to <pre class="mermaid"> and wraps them with copy / preview action buttons; the existing single-file MarkdownContent.svelte is split into a .svelte + sibling .css / markdown-utils.ts / markdown-handlers.ts so the new mermaid renderer can share helpers. Project does not compile or ship the upstream tools/ui (server-only feature, classpath-only JNI build); no impact
~b9490–b9495 upstream build / verification Local build with GIT_TAG b9495 was verified clean: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit; ctest --test-dir build --output-on-failure reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself
~b9495–b9543 src/llama-hparams.{h,cpp} + every src/models/*.cpp (~150 files) Field hparams::n_layer (uint32_t) was split: the raw count moved to hparams::n_layer_all and hparams::n_layer() is now a member function that returns n_layer_all - n_layer_nextn (the effective non-MTP layer count). Sibling rename: hparams::nextn_predict_layershparams::n_layer_nextn. Every per-model TU in src/models/*.cpp was updated to call hparams.n_layer() and hparams.n_layer_nextn. New hparams::set_recr_pattern() mirror of set_swa_pattern() for hybrid recurrent architectures. New per-layer hparams::deepstack_mapping_arr (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key LLM_KV_DEEPSTACK_MAPPING for Granite4-Vision-style per-layer deepstack injection. hparams::kv_only_nextn was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly — verified via grep -rn "hparams\.n_layer|nextn_predict_layers|n_layer_nextn|n_layer_all|deepstack_mapping" src/main/cpp/ src/test/cpp/ returns zero matches. All consumers are inside upstream-compiled TUs (llama-model.cpp, llama-context.cpp, model TUs); no project source changes required
~b9495–b9543 include/llama.h (state-seq flags) + tools/server/server-context.cpp + examples/speculative-simple/speculative-simple.cpp The LLAMA_STATE_SEQ_FLAGS_ON_DEVICE flag was removed from the llama_state_seq_flags enum. All upstream call sites that passed LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE were updated to pass only LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY — the on-device path is now the default for partial saves/loads. Project does not call llama_state_seq_get_* / llama_state_seq_set_* directly from jllama.cpp; the only consumer in the JNI build is upstream server-context.cpp (speculative checkpoint helpers), which was updated upstream. Verified via grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/ returns zero matches. No project source changes required
~b9495–b9543 new common/imatrix-loader.{h,cpp} + refactor of tools/imatrix/imatrix.cpp + tools/quantize/quantize.cpp Extracted shared imatrix-loading logic into a standalone library: new common_imatrix struct (entries, datasets, chunk_count, chunk_size, is_legacy, has_metadata) and common_imatrix_load(const std::string &, common_imatrix &) reader. New GGUF metadata keys exposed as LLM_KV_IMATRIX_DATASETS, LLM_KV_IMATRIX_CHUNK_COUNT, LLM_KV_IMATRIX_CHUNK_SIZE. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: common/CMakeLists.txt now includes imatrix-loader.cpp and imatrix-loader.h in libcommon, which means the JNI build picks up the new TU automatically via FetchContent + the existing target_link_libraries(jllama PRIVATE common) line. Project does not use imatrix loading from Java today (no LlamaImatrix class); the new symbols ship as additive surface area only. No project source changes required
~b9495–b9543 tools/mtmd/clip.{h,cpp} + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/mtmd.{h,cpp} + tools/mtmd/mtmd-helper.{h,cpp} + tools/mtmd/mtmd-image.cpp + every tools/mtmd/models/*.cpp Large MTMD subsystem refactor: (1) clip_image_u8 and clip_image_f32 switched from public POD-style nx / ny / buf fields to private members with get_size() / set_size() / get_ro_buf() / cpy_buf() / get_pixel() / set_pixel() / is_placeholder() getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from tools/mtmd/clip.h: clip_embd_nbytes, clip_embd_nbytes_by_img, clip_image_u8_get_data, clip_build_img_from_pixels, clip_get_newline_tensor, clip_encode_float_image, clip_image_f32_batch_add_mel. (3) mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() gained a required bool placeholder parameter (when true the bitmap reserves shape only, no pixel decode — used for token counting). (4) mtmd_bitmap is now a true class (private buffer + is_placeholder() / can_batch_with()); mtmd_bitmap_init() and mtmd_bitmap_init_from_audio() accept nullptr data to create placeholder bitmaps. (5) New Granite4 Vision projector type PROJECTOR_TYPE_GRANITE4_VISION and tensor enums (V_MULTI_PROJ_*, V_QF_*) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: clip_graph_qwen2vl::build_inp_with_temporal_merge() plus n_batch_max=2 for batch-merged consecutive image frames. Project does not link any tools/mtmd/* TUs into the JNI build (libllama + libcommon only); the JNI vision API surfaces through mtmd-helper.h and was reviewed: zero clip_image_* / removed-helper references found across src/main/cpp/ and src/test/cpp/. No project source changes required
~b9495–b9543 tools/server/server-context.cpp + tools/server/server-http.cpp + tools/server/server.cpp (new /v1/responses/input_tokens + /v1/chat/completions/input_tokens + /v1/messages/count_tokens) New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: server_routes::handle_count_tokens() consolidates the body parsing path (chat completions, responses, anthropic messages) and emits {"input_tokens": N, "object": "response.input_tokens"}. process_mtmd_prompt() signature gained a bool is_placeholder = false parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither tools/server/server.cpp nor server-http.cpp); the only server TU we link is server-context.cpp, where the only project-visible change is the new optional process_mtmd_prompt parameter, which is defaulted — existing project call sites compile unchanged. No project source changes required
~b9495–b9543 common/chat-peg-parser.{h,cpp} + common/chat.cpp (LFM2/2.5 unified) LFM2.5's chat-completion parser was merged into the single common_chat_params_init_lfm2() (was a separate _lfm2_5 function); a bool tool_list_tokens flag toggles between the two template flavours. New helper common_chat_peg_builder::python_or_json_value() and a new bool allow_json_literals parameter on python_style_tool_calls() so LFM2.5 can accept JSON-cased true / false / null alongside the Python-cased literals. Pure-Python literal normalisation in chat-peg-parser.cpp (True/False/None → JSON during streaming). Project does not call any common_chat_peg_* or common_chat_params_init_lfm2* symbols; routing happens inside upstream-compiled chat.cpp. No project source changes required
~b9495–b9543 ggml/src/ggml-cuda/mmvq.cu + ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c + ggml/src/ggml-metal/ggml-metal-device.m + ggml/src/ggml-opencl/* + ggml/src/ggml-sycl/* + ggml/src/ggml-vulkan/* + ggml/src/ggml-webgpu/* + ggml/src/ggml-cpu/kleidiai/kleidiai.cpp Per-backend numerical & performance work: (1) CUDA mul_mat_vec_q_moe switched to GGML_CUDA_RESTRICT aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (vl128 / vl256 / vl512 / vl1024 separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster concat/cpy/get_rows packed kernels for narrow tensors (<32 cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; should_reorder_tensor gate widened from ne[1]==1 to ne[1]<=8. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every coopmat2_features.* bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (U32_DEQUANT_HELPERS); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars GGML_KLEIDIAI_CHUNK_MULTIPLIER & GGML_KLEIDIAI_SME thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to jllama.cpp. No project source changes required
~b9495–b9543 conversion/__init__.py + conversion/granite.py + conversion/gemma.py + convert_lora_to_gguf.py + gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py Python-side: new Granite4VisionMmprojModel (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (hidden_size falls back to audio_embed_dim; model_patch_size falls back to patch_size * pooling_kernel_size). convert_lora_to_gguf.py gained --trust-remote-code. New LLM_KV_DEEPSTACK_MAPPING writer (add_deepstack_mapping) and new clip-vision keys (KEY_PROJ_SAMPLE_QUERY_SIDE, KEY_PROJ_SAMPLE_WINDOW_SIDE, KEY_PROJ_SPATIAL_OFFSETS, KEY_FEATURE_LAYERS, KEY_IMAGE_GRID_PINPOINTS) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required
~b9495–b9543 upstream build / verification Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero grep matches in src/main/cpp/ for any of the renamed or removed symbols: hparams.n_layer, nextn_predict_layers, n_layer_nextn, n_layer_all, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE, clip_image_u8/clip_image_f32 field access, clip_build_img_from_pixels, clip_get_newline_tensor, clip_image_u8_get_data, clip_embd_nbytes, clip_embd_nbytes_by_img, clip_encode_float_image, clip_image_f32_batch_add_mel, mtmd_helper_bitmap_init_from_file, mtmd_helper_bitmap_init_from_buf, common_imatrix_load). The only project-visible signature change — process_mtmd_prompt()'s new bool is_placeholder parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself
~b9543–b9549 include/llama.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-ext.h New llama_context_params::ctx_other field (a source/target/parent llama_context *, default nullptr) used to share results or llama_memory between two contexts; mirrored by new cparams.ctx_other and the new staging API llama_get_ctx_other() (llama-ext.h). llama_get_memory() was moved earlier in llama-context.cpp and made null-safe (returns nullptr for a null ctx). llama_context_default_params() initializes ctx_other = nullptr. Project does not aggregate-init llama_context_params (it goes through llama_context_default_params() inside upstream server-context.cpp) and never includes llama-ext.h — verified via grep -rn "llama_context_params|ctx_other|llama-ext.h|llama_get_ctx_other|llama_get_memory" src/main/cpp/ returns zero matches. No project source changes required
~b9543–b9549 src/llama-kv-cache.{h,cpp} + llama-kv-cache-iswa.{h,cpp} + llama-kv-cache-dsa.cpp + llama-memory.h + llama-memory-hybrid{,-iswa}.cpp KV-cache constructors gained two new parameters: llama_memory_t mem_other and layer_share_cb share (std::function<int32_t(int32_t il)> returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). llama_memory_params gained a mem_other field. All call sites (iswa/dsa/hybrid wrappers, llama_model::create_memory) updated upstream; the project never constructs a llama_kv_cache* or llama_memory_* directly. No project source changes required
~b9543–b9549 src/llama-arch.{h,cpp} + new src/models/gemma4-assistant.cpp + src/models/models.h + src/llama-model.{h,cpp} + src/llama-hparams.{h,cpp} + src/llama-graph.{h,cpp} + gguf-py/ + conversion/gemma.py New model architecture LLM_ARCH_GEMMA4_ASSISTANT ("gemma4-assistant") — a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors LLM_TENSOR_NEXTN_PROJ_PRE/NEXTN_PROJ_POST (nextn.pre_projection/post_projection) plus model-level nextn_proj_pre/nextn_proj_post; new hparams n_embd_inp_impl (input-embedding dim override, honoured by n_embd_inp()) and graph field n_layer_nextn. Python conversion registers Gemma4AssistantForCausalLM/Gemma4UnifiedAssistantForCausalLM. This is the headline new feature; it is a speculative-decoding / MTP mechanism, which this project tracks as deferred-by-policy (see Open TODOs / spec-draft-backend-sampling + MTP). Consumed entirely inside upstream-compiled TUs — loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO
~b9543–b9549 common/chat.cpp + new models/templates/LFM2.5-8B-A1B.jinja LFM2 chat-template handling: prior-turn reasoning_content is now copied into the template's thinking field, and <think> reasoning extraction is gated on the template source actually containing <think> (and no longer on enable_thinking). New LFM2.5-8B-A1B template + parser test consolidation. Routing happens inside upstream-compiled chat.cpp; the project calls no common_chat_params_init_lfm2* symbol. Handled automatically when such a model is loaded; no project source or Java API changes required
~b9543–b9549 common/arg.cpp + common/speculative.cpp + src/llama-graph.cpp common_params_handle_models() mmproj auto-download now also requires params.mmproj.path.empty() && params.mmproj.url.empty() (an explicitly-specified mmproj is no longer re-downloaded). speculative.cpp MTP path adds a shared-memory fast path (is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to llama_model_n_embd_out() for the MTP row width. llama-graph.cpp moved the set_input_kq_mask / can_reuse_kq_mask calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required
~b9543–b9549 tools/server/server-context.cpp (project-linked) The one project-linked server TU changed: now #includes ggml-cpp.h and ../../src/llama-ext.h; sets cparams.ctx_other = ctx_tgt for MTP draft/MTP contexts; moved the ctx_dft_seq_rm_type = common_context_can_seq_rm(...) assignment to after context init (guarded by if (ctx_dft)); downgraded the spec memory-measure failure log from SRV_ERR to SRV_WRN; and gated the mtmd draft-processing block on llama_get_ctx_other(ctx_dft) != ctx_tgt. All changes are internal to the TU and the new includes resolve against the FetchContent'd src/ and ggml headers. Compiles into jllama unchanged from the project's side. No project source changes required
~b9543–b9549 .github/workflows/docker.yml (upstream CI) Upstream's cuda13 Docker image bumped from CUDA 13.1.1 to 13.3.0. Upstream's own CI only; this project ships its own publish.yml and pins CUDA 13.2 via .github/build_cuda_linux.sh (see CLAUDE.md "Upgrading CUDA Version"). No impact
~b9543–b9549 project CMakeLists.txt (pre-existing latent bug, fixed in this bump) Not an upstream change — surfaced while build-testing this bump locally. The OS/arch detection block invoked net.ladenthin.llama.OSInfo, but the class had moved to net.ladenthin.llama.loader.OSInfo in the earlier layered-package restructure, so cmake -B build failed with "Could not determine OS name" on any host that does not pass -DOS_NAME/-DOS_ARCH explicitly (CI does, which is why it went unnoticed). Fixed both execute_process invocations (--os and --arch) to the loader.OSInfo FQN. Same stale-FQN-after-restructure class as the earlier spotbugs-exclude.xml / PIT-targetClasses repairs — the standing reminder to re-validate every FQN-bearing config after a package move now also covers CMakeLists.txt
~b9543–b9549 upstream build / verification Local build with GIT_TAG b9549 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly (after the loader.OSInfo FQN fix above), cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit (incl. the changed server-context.cpp), and ctest --test-dir build --output-on-failure reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9549–b9553 common/sampling.h + common/sampling.cpp + common/arg.cpp + common/common.cpp + tools/server/server-task.cpp common_sampler_types_from_names() dropped its bool allow_alt_names parameter — the signature is now common_sampler_types_from_names(const std::vector<std::string> & names). The body was rewritten to (a) auto-generate kebab-case (top-k) and no-dash (topk) aliases from the canonical snake_case names, plus misc aliases (nucleus→top_p, temp→temperature, typ→typical_p), and (b) lowercase the input so matching is case-insensitive; aliases are now always accepted (the old gate is gone). All three call sites were updated upstream (arg.cpp / common.cpp dropped the , true arg; server-task.cpp dropped the , false arg). Project impact: none at the source levelgrep -rn common_sampler_types_from_names src/main/cpp src/test/cpp returns zero matches; the symbol is reached only through the upstream-compiled server-task.cpp linked into jllama. New behaviour exposed for free: because server-task.cpp previously passed allow_alt_names=false, the project's InferenceParameters samplers JSON array only matched canonical names like top_k; it now also accepts top-k / topk / nucleus / temp / typ and is case-insensitive (TOP_K, Min-P). Pinned by 5 new ParamsFromJsonCmpl.Samplers_* tests in test_server.cpp
~b9549–b9553 src/llama-kv-cache.cpp + src/llama-kv-cache.h + src/llama-kv-cells.h KV-cache shared-cells refactor (continues TAG_KV_CACHE_SHARE_CELLS, used by the Gemma4-assistant MTP head): the v_cells member changed from a by-value std::vector<llama_kv_cells> to a std::shared_ptr<llama_kv_cells_vec> v_cells_impl plus a llama_kv_cells_vec & v_cells reference, so a target cache now views the source cache's cells instead of copying them in apply_ubatch(); the constructor also clamps kv_size down to the shared source's size. New type alias using llama_kv_cells_vec = std::vector<llama_kv_cells>; in llama-kv-cells.h. All internal src/ headers the JNI build does not include (the project pulls public llama.h / llama-cpp.h, never llama-kv-cache.h / llama-kv-cells.h) — verified via grep -rn "llama_kv_cells|llama-kv-cache" src/main/cpp src/test/cpp → zero matches. No project source changes required
~b9549–b9553 conversion/mistral.py + convert_hf_to_gguf.py Python conversion-script robustness only: hparams["llama_4_scaling"] and "moe" in hparams replaced with hparams.get(...) / is not None guards so a present-but-null key no longer crashes conversion. Python tooling, not part of the JNI build. No impact
~b9549–b9553 upstream build / verification Local build with GIT_TAG b9553 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest --test-dir build --output-on-failure reports 440/440 tests passing (435 prior + 5 new Samplers_* tests). The sole breaking change in this range (the common_sampler_types_from_names signature) is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9553–b9555 .devops/intel.Dockerfile + ggml/src/ggml-metal/ggml-metal-device.cpp + tests/test-backend-ops.cpp Tiny maintenance bump — no API change and no new feature. (1) intel.Dockerfile: Intel GPU userspace driver pins bumped (IGC v2.20.5v2.34.4, compute-runtime 25.40.35563.1026.18.38308.1, IGDGMM 22.8.222.10.0) with the old multi-GPU-safe versions commented out; upstream's own Docker image only — this project ships its own publish.yml and does not consume .devops/. No impact. (2) ggml-metal-device.cpp: bugfix to the Metal im2col pipeline selector — the standard-vs-_ext kernel choice now keys off the actual conv-kernel footprint (KH*KW, with KH = is_2D ? ne01 : 1, KW = ne00) instead of the raw ne00*ne01 product, fixing kernel selection for 1-D convolutions. Backend-internal Metal TU compiled via FetchContent; no API surface visible to jllama.cpp, and only affects the macOS/Metal backend at runtime. (3) tests/test-backend-ops.cpp: one extra test_im2col case ({3000,384,1,1} / {3,384,384,1}) added — upstream test only, not linked into the JNI build. No project source changes required; no new Java-API-exposable feature. Build verification deferred to CI (publish.yml) / a developer host as usual
~b9555–b9621 ggml/include/ggml.h + ggml/src/ggml.c + ggml/src/ggml-cuda/gated_delta_net.cu + ggml/src/ggml-metal/ggml-metal.metal + ggml/src/ggml-vulkan/vulkan-shaders/gated_delta_net.comp ggml_gated_delta_net state tensor reshaped again: the 3D (S_v*S_v*H, K, n_seqs) layout is now the 4D [S_v, S_v, H, n_seqs] with an explicit int64_t K seventh parameter (snapshot count, K=1 is final-state-only). Signature: ggml_gated_delta_net(ctx, q, k, v, g, beta, state, K) (was 6-argument). Snapshot-slot ordering also flipped to most-recent-first. Internal Qwen3.5 / Qwen3-Next recurrent-attention kernel; project does not call ggml_gated_delta_net directly — no project source changes required
~b9555–b9621 ggml/include/ggml.h New ggml_col2im_1d(ctx, a, s0, oc, p0) function and GGML_OP_COL2IM_1D enum value added; GGML_OP_COUNT incremented 96 → 97. Additive; not called by project — no project source changes required
~b9555–b9621 common/fit.h + tools/server/server-context.cpp common_get_device_memory_data() return type changed: now returns common_device_memory_data_vec (typedef for std::vector<common_device_memory_data>). New common_device_memory_data struct carries .total, .free, .model, .context, .compute fields directly (previously the caller reached them via .mb.model etc.). fit.h also dropped its #include "ggml-backend.h" and #include "../src/llama-ext.h" lines (those types are no longer needed at the header level). Consumed exclusively in upstream-compiled server-context.cpp (field-accessor update from .mb.model.model etc. was applied upstream); project does not include fit.h or call common_get_device_memory_data() directly — no project source changes required
~b9555–b9621 tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp + tools/server/server-common.cpp mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() return type changed: both now return mtmd_helper_bitmap_wrapper struct (contains bitmap + video_ctx fields) instead of mtmd_bitmap*. All call sites updated in upstream server-common.cpp. Project does not call these functions from src/main/cpp/ (verified via grep: zero matches) — no project source changes required
~b9555–b9621 tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp New video pipeline: mtmd_helper_video_context, mtmd_helper_video_* API family (init/free/decode), ffmpeg-based frame extraction. New --video CLI flag in common/arg.cpp; new input_video content type in server-common.cpp. Multimodal helper additions flow through the upstream-compiled mtmd-helper.cpp and server-common.cpp; project does not reference any mtmd_helper_video_* symbol — no project source changes required. Could be exposed in a future Java API as InferenceParameters.setVideoPath(String)
~b9555–b9621 common/common.h New common_params fields: path_prompts_log_dir (prompt-logging output directory, string) and mtmd_batch_max_tokens (multimodal batch token limit, default 1024). Both additive with harmless defaults. Not surfaced by ModelParameters today — could be added in a future enhancement. No project source changes required
~b9555–b9621 src/llama-ext.h New EAGLE3 speculative-decoding support APIs: llama_set_embeddings_layer_inp(ctx, lid, value), llama_get_embeddings_layer_inp(ctx, lid), llama_model_target_layer_ids(model)const int32_t*, llama_model_target_layer_ids_n(model)uint32_t. New LLM_ARCH_EAGLE3 model architecture; new llama_model_eagle3 struct in upstream model sources. EAGLE3 enables full encoder+decoder graph implementation for speculative decoding. All consumed inside upstream-compiled speculative.cpp and model TUs; project does not reference any of these symbols — no project source changes required. Could be exposed later as a speculative-decoding backend type in ModelParameters
~b9555–b9621 src/llama-graph.h + src/llama-graph.cpp llm_graph_result::set_outputs() signature changed: now takes a const llm_graph_params & parameter (was no-parameter). New t_layer_inp vector added to llm_graph_result for layer-input embedding extraction (used by EAGLE3). Internal graph-building API; not called from project sources — no project source changes required
~b9555–b9621 src/llama-context.cpp llama_context now initializes embeddings_layer_inp storage for EAGLE3 layer-input extraction; n_outputs_max is forced to n_batch when llama_model_has_encoder() returns true (encoder models always need all outputs). Internal context lifecycle; no project sources reference these fields — no project source changes required
~b9555–b9621 vendor/cpp-httplib/httplib.h + httplib.cpp cpp-httplib bumped to v0.47.0. Compiled automatically via FetchContent — no project source changes required
~b9555–b9621 ggml/src/ggml-cuda/ggml-cuda.cu ggml_concat on CUDA now handles F16, BF16, I8, I16, I32, I64 element types in addition to F32; active_count tracking added to CUDA context to prevent memory leak from lazy cudaMemGetInfo context creation. Internal CUDA backend, no project changes required
~b9555–b9621 ggml/src/ggml-vulkan/ + Vulkan shaders New VK_VALVE_shader_mixed_float_dot_product extension support for F16→F32 fused dot products (dot2_f16) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required
~b9555–b9621 ggml/src/ggml-opencl/ + OpenCL kernels New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects opencl-android-aarch64 classifier build only); no project source changes required
~b9555–b9621 ggml/src/ggml-cuda/ssm-scan.cu Added __syncthreads() before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required
b9621–b9637 common/chat.cpp New Cohere2 MoE ("North Code") chat parser common_chat_params_init_cohere2moe + auto-detection (template containing <|START_TEXT|> and <|START_ACTION|>). Purely additive — compiled in the chat.cpp TU and reached through the existing specialized-template path, so the project's oaicompat_chat_params_parse picks it up automatically. No project source changes required. New feature: Cohere2 MoE reasoning + JSON tool-call chat support
b9621–b9637 common/jinja/runtime.cpp, common/jinja/value.cpp Jinja chat-template engine fixes: filter aliases countlength, ddefault, eescape; negative-step slice start/stop defaults; split raises on empty separator; replace('', x) now expands between every char. Compiled into common; improves chat-template compatibility automatically. No project source changes required
b9621–b9637 src/llama-arch.{h,cpp}, src/models/cohere2moe.cpp (new), src/models/models.h, src/llama-model.cpp, src/llama-model-saver.cpp, src/llama-vocab.cpp New LLM_ARCH_COHERE2MOE architecture (MoE + MTP/NextN) with llama_model_cohere2moe; cohere2moe tokenizer pre-type (maps to LLAMA_VOCAB_PRE_TYPE_TINY_AYA); Cohere2 dense path gains ffn_*_s NVFP4 scale tensors; tied-NVFP4-output assert relaxed to allow sidecar LM-head scales. Additive enum/struct internal to libllama; the project includes llama.h, not llama-arch.h/models.h, and switches on no arch enum. No project source changes required. New feature: loads North-Mini-Code GGUFs
b9621–b9637 ggml/src/ggml-vulkan/ + shaders Unary shaders consolidated into one templated unary.comp; new EXPM1 Vulkan op; GLU push-constants reworked (per-dim strides + misalign offsets); fastdiv L values byte-packed to stay under the 128B push-constant limit. Internal Vulkan backend — the project builds CPU/CUDA/Metal/OpenCL only, never Vulkan. No project changes required
b9621–b9637 tools/server/server-http.cpp, tools/ui/, scripts/ui-assets.cmake Optional gzip-compressed WebUI asset serving (LLAMA_UI_GZIP, llama_ui_use_gzip()). The project compiles server-context/queue/task/models but not server-http.cpp or tools/ui, so the HTTP/WebUI layer is absent from jllama. No project changes required
b9621–b9637 tools/cli/cli.cpp, .devops/*.Dockerfile, .github/, conversion/, convert_hf_to_gguf_update.py, gguf-py/, models/templates/Cohere2MoE.jinja, docs/, tests/ CLI preserved-token wiring, Docker image docker.io/ prefixes, CI labeler/release tweaks, Python GGUF converters, the new model template asset, doc typos, and upstream tests. None are compiled into jllama or shipped by the project. No project changes required