Skip to content

Latest commit

 

History

History
409 lines (404 loc) · 169 KB

File metadata and controls

409 lines (404 loc) · 169 KB

llama.cpp upstream breaking changes — version-range changelog

Per-version-range record of upstream API breaks observed in the b5022 → latest range, what the affected upstream files are, and the project-side fix (or "no project changes required" when the break stayed inside an upstream-compiled translation unit).

Used during llama.cpp version bumps: when upgrading, scan this file from the row matching the current pinned version forward to the target, apply any rows marked as needing project source changes, and append a new row covering the upgrade range. See the "Upgrading/Downgrading llama.cpp Version" section in ../../CLAUDE.md for the upgrade workflow.

Version File Change
~b7217–b7433 common/common.h, include/llama-cpp.h common_init_result became common_init_result_ptr; access changed to ->model() / ->context() / ->free_context()
~b7433 common/arg.h n_parallel default changed to sentinel -1 (auto); Java bindings must resolve to 1 before model load
~b7217–b7783 common/arg.hcommon/download.h common_remote_get_content and common_remote_params split into new download.h; headers changed from vector<string> to vector<pair>
~b7783 common/common.h build_info string moved into common.h; local definition must be removed
~b7783–b7858 common/chat.h common_chat_syntax renamed to common_chat_parser_params; to_json_oaicompat<json>() template removed (no template arg); ensure_tool_call_ids_set()set_tool_call_ids()
~b7858–b7864 common/speculative.h Full redesign: common_speculative_init(ctx_tgt, ctx_dft)common_speculative_init(params_speculative, ctx); common_speculative_gen_draftcommon_speculative_draft; new common_speculative_accept(); common_speculative_params struct replaced by common_params_speculative; draft model loaded via llama_model_load_from_file into llama_model_ptr
~b7858–b7864 common/common.h params_speculative: .model.path/.hf_repo replaced by .has_dft()/.mparams_dft; new .model_dft and .cparams_dft fields; speculative.type enum added (COMMON_SPECULATIVE_TYPE_NONE)
~b7858–b7864 server.hpp (internal) slot_action.slot_idslot_action.id_slot; llama_init_dft removed from server_context; model_dft changed from llama_model* to llama_model_ptr; slot.ctx_tgt/ctx_dft removed
~b7864 common/mtmd.h mtmd_init_params.verbosity field removed
~b7904–b8190 common/common.h params_base.model_alias changed from std::string to a container; use *model_alias.begin() instead of direct string cast
~b8778–b8808 tools/mtmd/mtmd.h MTMD_DEFAULT_IMAGE_MARKER macro removed; mtmd_image_tokens_get_nx/ny deprecated; new mtmd_decoder_pos struct + mtmd_image_tokens_get_decoder_pos(); mtmd_context_params_default() now sets image_marker = nullptr (throws "custom image_marker is not supported anymore" if non-null); upstream server adds randomized get_media_marker() in server-common.h — our server.hpp is unaffected since it does not include that header and uses mtmd_default_marker() consistently
~b8808–b8831 project CMakeLists.txt CMake target common renamed to llama-common; update target_link_libraries for jllama and jllama_test
~b8808–b8831 common/common.h → new common/build-info.h build_info std::string removed; replaced by llama_build_info() (const char*) in new build-info.h; add #include "build-info.h" in server.hpp and utils.hpp; call sites: std::string(llama_build_info()) in server.hpp (6×), llama_build_info() in jllama.cpp (1×) and utils.hpp (1×)
~b8808–b8831 ggml/src/ggml.c New ggml_graph_next_uid() calls _InterlockedIncrement64 via <intrin.h> on x86; intrinsic unavailable on 32-bit MSVC; fix: src/main/cpp/compat/ggml_x86_compat.c provides __cdecl _InterlockedIncrement64 via InterlockedIncrement64 (CMPXCHG8B), added to ggml-base via target_sources guarded by MSVC AND CMAKE_SIZEOF_VOID_P EQUAL 4
~b8838–b8841 src/llama-model.h Attention bias fields renamed: bqwq_b, bkwk_b, bvwv_b, bowo_b, bqkvwqkv_b; internal to llama.cpp, no impact on this project
~b8841–b8854 common/common.h common_params::clear_idle renamed to cache_idle_slots; new common_context_seq_rm_type enum + common_context_can_seq_rm() replacing common_speculative_is_compat(); get_model_endpoint()common_get_model_endpoint()
~b8841–b8854 tools/mtmd/mtmd.h + mtmd-helper.h mtmd_decoder_pos gains z field; mtmd_image_tokens_get_decoder_pos() + mtmd_helper_image_get_decoder_pos() gain new pos_0 parameter
~b8841–b8854 project utils.hpp / server.hpp server_tokens::get_text_tokens() split: get_tokens() returns raw const llama_tokens &; new get_text_tokens() returns filtered copy (removes LLAMA_TOKEN_NULL mtmd placeholders); save/load and context-shift call sites updated to get_tokens()
~b8854–b8887 common/chat.h common_chat_msg_diff_to_json_oaicompat removed; moved to tools/server/server-chat.cpp; project defines it locally in server.hpp — importing server-chat.cpp is impractical because it pulls in convert_transcriptions_to_chatcmplget_media_markerserver-common.cpp
~b8854–b8887 common/common.h common_params::reasoning_budget and reasoning_budget_message moved into common_params::sampling sub-struct as reasoning_budget_tokens; update: params_base.reasoning_budgetparams_base.sampling.reasoning_budget_tokens
~b8854–b8887 common/fit.h (new) llama_params_fit and llama_memory_breakdown_print removed from include/llama.h; now common_fit_params / common_memory_breakdown_print in new common/fit.h; not used directly by project
~b8887–b8913 tools/server/server-chat.h convert_transcriptions_to_chatcmpl gained a new const common_chat_templates * tmpls second parameter; not called by project's server.hpp — handled automatically by upstream server-chat.cpp
~b8887–b8913 tools/server/server-task.cpp n_discard clamped to non-negative: params.n_discard = std::max(0, params.n_discard); applied in project's server.hpp after the json_value parse
~b8887–b8913 tools/server/server-common.cpp parallel_tool_calls now defaults to caps["supports_parallel_tool_calls"] instead of hardcoded false; handled automatically by upstream file
~b8887–b8913 common/chat.h New additive common_chat_prompt_preset struct and common_chat_get_asr_prompt() function; no project changes required
~b8887–b8913 common/common.h New string_starts_with(std::string_view, char) overload added; no project changes required
~b8887–b8913 tools/mtmd/mtmd.cpp Added LLAMA_ROPE_TYPE_NONE case to rope-type switch; internal fix, no project changes required
~b8913–b8953 common/debug.h base_callback_data renamed to common_debug_cb_user_data; template common_debug_cb_eval<false/true> replaced by plain common_debug_cb_eval; not used by this project
~b8913–b8953 tools/server/server-http.h New uploaded_file struct; files map type changed from map<string, raw_buffer> to map<string, uploaded_file>; upstream server sources compiled directly — no project impact
~b8913–b8953 src/llama-quant.cpp Default quantization ftype changed from LLAMA_FTYPE_MOSTLY_Q5_1 to LLAMA_FTYPE_MOSTLY_Q8_0; upstream only
~b8913–b8953 src/models/llama.cpp, qwen3.cpp, qwen3moe.cpp Removed duplicate ggml_mul for wo_s scale (now handled exclusively by build_attn); upstream only
~b8953–b8962 common/common.h struct cpu_paramsstruct common_cpu_params; cpu_get_num_physical_cores()common_cpu_get_num_physical_cores(); cpu_get_num_math()common_cpu_get_num_math(); not used directly by project
~b8953–b8962 common/common.h common_params_speculative fully restructured with nested sub-structs: .mparams_dft/.model_dft/.cparams_dft/.n_max/.n_min/.p_split/.p_min.draft.mparams/.draft.model/.draft.cparams/.draft.n_max/.draft.n_min/.draft.p_split/.draft.p_min; ngram fields moved to .ngram_cache/.ngram_mod/.ngram_simple/etc sub-structs; not referenced by project directly
~b8953–b8962 common/arg.h is_sparam bool split into is_sampling + is_spec; set_sparam() split into set_sampling() + set_spec(); not used by project
~b8953–b8962 tools/server/server-task.cpp task_params::to_json() drops "speculative.n_max", "speculative.n_min", "speculative.p_min" from output; only "speculative.type" remains; test SlotParamsToJson.SpeculativeFields_Present updated accordingly
~b8953–b8962 common/speculative.h New public API: common_speculative_n_max() and common_speculative_n_min() added; server-context.cpp uses these instead of direct field access; no project changes required
~b8962–b8982 common/sampling.h common_sampler_accept 3rd param renamed accept_grammaris_generated; semantics broadened: false now also skips reasoning budget update (not just grammar); no project call sites affected
~b8962–b8982 common/reasoning-budget.h Two overloads merged: prefill_tokens variant removed; new single overload takes initial_state = REASONING_BUDGET_IDLE; prefill now fed via llama_sampler_accept() loop after init; not called directly by project
~b8962–b8982 ggml/src/ggml-cuda/ssm-conv.cuh ggml_cuda_op_ssm_conv gained optional bias_add_node param; SSM_CONV + ADD + SILU fusion now supported; internal CUDA code, no project changes required
~b8962–b8982 common/speculative.cpp Draft token confidence check (p_min) moved before push to result: low-confidence tokens are now discarded entirely rather than included then ignored; behavior fix, no project changes required
~b8962–b8982 tools/server/server-context.cpp n_draft_total accounting moved to draft generation site instead of acceptance site (bug fix); upstream only
~b8982–b8994 ggml/src/ggml-cuda.cu ggml_backend_cuda_i struct: .get_tensor_2d_async and .set_tensor_2d_async function pointers were swapped (get pointed to set impl and vice versa); corrected; internal CUDA backend, no project changes required
~b8982–b8994 ggml/src/ggml-vulkan.cpp ggml_vk_buffer_write_2d_async and ggml_vk_buffer_write_2d gained a dpitch parameter; Vulkan now implements set_tensor_2d/get_tensor_2d in buffer interface; internal backend code, no project changes required
~b8982–b8994 common/speculative.cpp Checkpoint helpers renamed: draft_create_checkpointcreate_checkpoint, draft_restore_checkpointrestore_checkpoint; ckpt_size field removed (size computed from context directly); internal speculative module, not called by project
~b8982–b8994 common/arg.cpp CLI option typo fixed: --spec--draft-p-split--spec-draft-p-split (extra dash removed); CLI-only, no project changes required
~b8982–b8994 src/llama-mmap.cpp Windows large-file (>2 GB) fix: ftell/fseek replaced with _ftelli64/_fseeki64; upstream only
~b8982–b8994 tools/server/httplib.h cpp-httplib bumped to v0.43.2: Windows FILE_SHARE_WRITE fix, Linux DNS cancel race fix, mbedTLS close_notify fix; upstream server header, no project changes required
~b8982–b8994 tools/server/server-context.cpp New LLAMA_TRACE env variable enables slot acceptance tracing; upstream only
~b8994–b9004 ggml/src/ggml-vulkan/ggml-vulkan.cpp vk_fa_pipeline_state gains k_type/v_type fields; get_fa_tuning_params_coopmat2 now takes separate k_type/v_type params; mixed K/V type FA pipeline creation refactored to CREATE_FA_CM2_MIXED() macro; flash_attn_cm2.comp shader uses runtime FaTypeK/FaTypeV spec constants (spec constants 12–15 added); DECODEFUNC/NEEDS_INIT_IQ_SHMEM macros removed; internal Vulkan backend, no project changes required
~b8994–b9004 ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp get_mul_mat_fast_pipeline vectorized-path condition fixed: dst->ne[1] % 4 == 0 check removed (was preventing vectorization for non-multiple-of-4 batch sizes); internal WebGPU backend, no project changes required
~b8994–b9004 ggml/src/ggml-hexagon/ Hexagon HTP backend: FA exp2 half-precision option, unary-op non-contiguous tensor fix; internal DSP backend, no project changes required
~b8994–b9004 tools/server/webui/ Major frontend component reorganization (Svelte/TypeScript); purely UI, no C++ or JNI impact
~b9004–b9016 src/llama-io.h llama_io_read_i interface changed: read(size_t)→read(void*,size_t), read_to(void*,size_t) removed, new read_tensor(tensor,offset,size) added; llama_io_write_buffer/llama_io_read_buffer now batch backend tensor ops in destructors for performance; internal state-save/load path, not called by project
~b9004–b9016 tools/server/server-context.cpp Static server_get_checkpoint() (returns by value) renamed to server_prompt_checkpoint_update() (takes server_prompt_checkpoint & by reference, in-place update); compiled directly into jllama, no call site in project code
~b9004–b9016 common/arg.cpp + docs Speculative decoding CLI args renamed: --draft/--draft-n/--draft-max and --draft-min/--draft-n-min were REMOVED (handler throws std::invalid_argument at parse time, not just deprecated); other draft flags (--draft-p-min, --ctx-size-draft, --device-draft, --gpu-layers-draft, --model-draft) kept as aliases for new canonical --spec-draft-* names. Java impact: ModelParameters.setDraftMax/setDraftMin produced removed flags → threw at model load; fixed to canonical --spec-draft-n-max/--spec-draft-n-min. Other set*Draft methods updated to canonical names for forward compatibility. Env vars also renamed (LLAMA_ARG_DRAFT_MAXLLAMA_ARG_SPEC_DRAFT_N_MAX, etc.)
~b9004–b9016 ggml/src/ggml-cuda/ggml-cuda.cu PCI bus ID detection replaced snprintf with cudaDeviceGetPCIBusId (buffer 16→32 bytes); HIP/MUSA compat headers gain cudaDeviceGetPCIBusId alias; internal CUDA backend
~b9004–b9016 ggml/src/ggml-opencl/ Adreno MoE MXFP4: new kernel_convert_block_mxfp4_trans4_ns/restore kernels in cvt.cl; new gemm_moe_mxfp4_f32_ns, gemv_moe_mxfp4_f32_ns, moe_reorder_b, moe_sort_by_expert kernel files; GPU-side router reorder replaces CPU-side preprocessing; q_img created for GEMM path; internal OpenCL backend
~b9004–b9016 ggml/src/ggml-vulkan/ggml-vulkan.cpp GGML_VK_MAX_NODES 8192 macro removed (node limit now determined differently); internal Vulkan backend
~b9004–b9016 ggml/src/ggml-webgpu/ ggml_webgpu_row_norm_pipeline_key gains src_type/dst_type fields; GGML_OP_NORM now supported alongside GGML_OP_RMS_NORM/GGML_OP_L2_NORM; row_norm.wgsl gains SRC_TYPE/DST_TYPE parameterization and NORM two-pass algorithm; internal WebGPU backend
~b9004–b9016 src/llama-model.cpp rope_yarn_log_mul get_key call changed from required=0.0f to required=false; fixes Mistral YaRN log_mul loading; internal model loading, no project impact
~b9004–b9016 common/chat.cpp common_chat_templates_generation_prompt() extracted from common_chat_templates_apply_jinja(); internal refactor, no API change
~b9016–b9022 src/llama-model.h + src/llama-model.cpp + src/models/ llama_model becomes abstract base with pure virtual methods (load_stats, load_hparams, load_vocab, load_tensors, load_arch_hparams, load_arch_tensors, build_arch_graph); load_arch() removed; new intermediate llama_model_base class provides concrete implementations; per-arch subclasses (e.g. llama_model_llama, llama_model_gemma2) in src/models/; factory llama_model_create(llm_arch, params) and llama_model_create(ml, params) replace direct instantiation; LLAMA_LOAD_LOCALS convenience macro added; public C API (llama_model_load_from_file etc.) unchanged — no project impact
~b9016–b9022 src/models/ Many model files renamed: cohere2-iswa.cppcohere2.cpp, gemma2-iswa.cppgemma2.cpp, gemma3n-iswa.cppgemma3n.cpp, gemma4-iswa.cppgemma4.cpp, mimo2-iswa.cppmimo2.cpp, openai-moe-iswa.cppopenai-moe.cpp, pangu-embedded.cpppangu-embed.cpp, qwen3vl-moe.cppqwen3vlmoe.cpp, step35-iswa.cppstep35.cpp; new model files added (deepseek2ocr.cpp, glm-dsa.cpp, granite-moe.cpp, hunyuan-vl.cpp, jina-bert-v2/v3.cpp, lfm2moe.cpp, llama-embed.cpp, mamba2.cpp, minicpm.cpp, mistral4.cpp, nemotron-h-moe.cpp, nomic-bert.cpp, nomic-bert-moe.cpp, phimoe.cpp); upstream only, no project changes required
~b9016–b9022 tools/server/server-context.cpp server_prompt_checkpoint_update (the renamed function from b9016) static function signature changed from returning by value to taking server_prompt_checkpoint & by reference; compiled directly into jllama, no project call site
~b9016–b9022 tools/server/server-tools.cpp New built-in get_datetime tool added via new server_tool_get_datetime struct in build_tools(); no project changes required (handled automatically by compiled upstream source)
~b9016–b9022 common/chat-auto-parser-generator.cpp force_tools variable removed from build_tool_parser_json_native, build_tool_parser_tag_json, build_tool_parser_tag_tagged; content before tool calls is now always p.optional(p.content(...)) regardless of tool_choice=required; upstream only, no project changes required
~b9016–b9022 common/chat-peg-parser.h/cpp New optspace(const std::string & tag) method added to common_chat_peg_builder; makes leading/trailing spaces in reasoning tags optional; upstream only, no project changes required
~b9016–b9022 common/reasoning-budget.cpp Forced token logit now set to +INFINITY (previously left at whatever the model computed); reasoning budget enforcement is now absolute; upstream only, no project changes required
~b9016–b9022 common/chat.cpp thinking_start_tag and thinking_end_tag now trimmed via trim_whitespace(); upstream only, no project changes required
~b9016–b9022 examples/diffusion/ diffusion_generate extracted from diffusion-cli.cpp to new diffusion.h/diffusion.cpp static library; enum names prefixed: ORIGINDIFFUSION_ALGORITHM_ORIGIN, TIMESTEP_BASEDDIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED etc.; examples only, no project changes required
~b9022–b9049 include/llama.h New LLAMA_STATE_SEQ_FLAGS_ON_DEVICE 2 macro added alongside existing LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY 1; enables on-device KV cache state save/restore without host round-trip via llama_state_seq_get_size_ext/get_data_ext/set_data_ext; no project call-site changes required (not used by JNI layer)
~b9022–b9049 src/llama-context.cpp State seq data format breaking change: llama_state_seq_get_data/set_data now prepend a 4-byte magic (0xaf143cd8) + 4-byte seq_id header; state data saved with ≤b9022 is incompatible with b9049+; internal I/O classes renamed llama_io_write_bufferllama_io_write_host, llama_io_read_bufferllama_io_read_host; new llama_io_write_device/llama_io_read_device classes for on-device paths; no project changes required (not called by JNI layer)
~b9022–b9049 ggml/include/ggml.h New ggml_op_hint enum (GGML_HINT_DEFAULT=0, GGML_HINT_SRC0_IS_HADAMARD=1) and ggml_mul_mat_set_hint() function added for FWHT (Fast Walsh-Hadamard Transform) support; used internally in llama-graph.cpp / llama-kv-cache.cpp; no project call-site changes required
~b9022–b9049 src/llama.cpp llama_backend_init() now auto-calls ggml_backend_load_all() if no backends are yet registered; ggml_backend_load_all() removed from common_params_parser_init() (was in common/arg.cpp); no project changes required — backend loading still happens correctly
~b9022–b9049 tools/server/server-context.cpp server_prompt_checkpoint_update() gained an on_device bool parameter; speculative checkpoints now use LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE; compiled directly into jllama from upstream source — no project call-site changes required
~b9022–b9049 src/llama-model.cpp Unsupported model architecture now throws std::runtime_error instead of calling GGML_ABORT; allows callers to catch unknown-arch errors gracefully; no project changes required
~b9022–b9049 ggml/CMakeLists.txt GGML version bumped 0.10.2 → 0.11.0; no project changes required
~b9022–b9049 vendor/cpp-httplib/ Updated to 0.43.3: str2tag converted to iterative loop (eliminates recursion stack depth risk), res.body.reserve now OOM-safe; upstream server header, no project changes required
~b9049–b9071 common/chat.h contains_media() method added to common_chat_msg; to_json_oaicompat() now forces text concatenation when message contains media markers; additive change, no project impact
~b9049–b9071 src/llama-arch.h/cpp + src/llama-hparams.h New LLM_KV_ATTENTION_VALUE_SCALE KV key and f_attn_value_scale hparam field added for MiMo-V2 attention value scaling; additive, no project changes required
~b9049–b9071 src/llama.cpp llama_supports_gpu_offload() and llama_supports_rpc() now auto-call ggml_backend_load_all() if no backends are registered; behavior fix, no project changes required
~b9049–b9071 src/llama-context.cpp state_seq_set_data: removed too-strict seq_id matching guard that was gated on LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY; KV slot restorer now checks tensor shapes and view offsets before deciding to reallocate (avoids unnecessary realloc on shape-compatible updates); both are bug fixes, no project API changes required
~b9049–b9071 src/models/mimo2.cpp MiMo-V2 extended with MTP (Multi-Token Prediction) layer support via nextn_predict_layers; fused wqkv projection; attention_value_scale post-attention scaling; all internal model-loading changes, no project changes required
~b9049–b9071 ggml/src/ggml-sycl/ SYCL implementations added for CUMSUM, DIAG, FILL, SSM_SCAN, SOLVE_TRI ops; additive, no project changes required
~b9049–b9071 ggml/src/ggml-cuda/out-prod.cu CUDA outer-product uses cublasSgemmStridedBatched for batched path (dps2==1, ne2>1); HIP/MUSA compat headers gain the alias; performance improvement, no project changes required
~b9049–b9071 tools/mtmd/ MiniCPM-V 4.6 multimodal support added (PROJECTOR_TYPE_MINICPMV4_6, ViT merger graph, new tensor names); additive, no project changes required
~b9049–b9071 tools/server/webui/ LLM-based conversation title generation; CSS animation fill-mode-forwards fixes; UI-only changes compiled into upstream server, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh (NEW) 2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via GGML_CUDA_ALLREDUCE env var (nccl/internal/none); compiled automatically via FetchContent, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/snake.cu + snake.cuh (NEW) Fused CUDA Snake activation kernel (y = x + sin(a*x)^2 * inv_b) for BigVGAN/Vocos audio models; fuses 5-op chain MUL→SIN→SQR→MUL→ADD at graph level; F32/F16/BF16; compiled automatically, no project changes required
~b9071–b9094 ggml/src/ggml-cuda/ggml-cuda.cu Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to ggml_backend_cuda_comm_context with try_allreduce function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required
~b9071–b9094 ggml/src/ggml-sycl/ Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required
~b9071–b9094 ggml/src/ggml-hexagon/ GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required
~b9071–b9094 src/models/sarvam.cpp (NEW) Sarvam-MoE model (sarvamai/sarvam-30b); reuses BailingMoeV2 arch; new vocab pre-type LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51; additive, no project changes required
~b9071–b9094 src/models/gemma4.cpp Gemma4 split gate/up experts: ffn_gate_up_exps now TENSOR_NOT_REQUIRED; fallback to separate ffn_gate_exps/ffn_up_exps; NVFP4 per_expert_scale folding; internal model-loading, no project changes required
~b9071–b9094 tools/server/server-context.h + server-context.cpp New get_model_info() method on server_context; /v1/models response now includes "n_ctx" field (value: slot_n_ctx); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently)
~b9071–b9094 tools/server/server-http.h + server.cpp handlers map moved from private to public in server_http_context; new register_gcp_compat() method exposes GCP/Vertex AI Prediction Protocol endpoint reading AIP_MODE/AIP_PREDICT_ROUTE/AIP_HEALTH_ROUTE/AIP_HTTP_PORT env vars; compiled from upstream sources, no project changes required
~b9071–b9094 tools/server/server-models.h + server.cpp Router child→parent model info propagation: new CMD_CHILD_TO_ROUTER_INFO command; setup_child_server() gains const json & model_info parameter; new update_loaded_info() method; server_model_meta gains loaded_info field; all internally consistent across compiled upstream sources, no project changes required
~b9071–b9094 common/reasoning-budget.cpp Forced token logit no longer set to +INFINITY; only competing tokens set to -INFINITY; internal sampler behavior change, no project changes required
~b9071–b9094 tools/server/webui/ Settings registry refactored (settings-config.ts/settings-fields.ts/settings-sections.ts merged into settings-registry.ts); MCP route #/settings/mcp#/mcp-servers; settings route /settings/chat/[section]/settings/[[section]]; UI-only, no project changes required
~b9094–b9102 ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh Internal CUDA AllReduce pipeline refactored with ggml_cuda_ar_pipeline struct; ggml_cuda_ar_pipeline_init(devices, n_devices) / _free / _allreduce APIs; supports 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+); chunked kernel path (small tensors) vs copy-engine path (large tensors); GGML_CUDA_ALLREDUCE env = nccl/internal/none; env tuning vars GGML_CUDA_AR_COPY_THRESHOLD / GGML_CUDA_AR_COPY_CHUNK_BYTES / GGML_CUDA_AR_BF16_THRESHOLD; HIP/MUSA builds return nullptr stub; compiled automatically via FetchContent, no project changes required
~b9094–b9102 ggml/src/ggml-cuda/ggml-cuda.cu GGML_LOG_WARN_ONCE macro added; ggml_backend_cuda_comm_context gains try_allreduce fn pointer and ar_pipeline; three dispatch fns: try_allreduce_nccl, try_allreduce_internal, try_allreduce_butterfly; init chain: comm_init_ncclcomm_init_internalcomm_init_none; platform default Linux→NCCL, Windows→internal; no project changes required
~b9094–b9102 ggml/src/ggml-sycl/ggml-sycl.cpp + im2col.cpp + im2col.hpp New ggml_sycl_im2col_3d function; GGML_OP_IM2COL_3D now supported on Intel GPU via SYCL; 2D im2col kernel rewritten with tile-based IC_KH_KW thread decomposition; new SYCL_IM2COL_BLOCK_SIZE 256; additive, no project changes required
~b9094–b9102 ggml/CMakeLists.txt GGML version patch bumped 0.11.0 → 0.11.1; no project changes required
~b9094–b9102 common/sampling.cpp Bug fix in common_sampler_sample: set_logits now called at the top before backend-sampling check; backend sampling token-selection now scans all of cur_p.data to find matching token (instead of artificial 1-element array), fixing cur_p.selected for downstream n_probs; post-sampling probabilities now work correctly with backend sampling
~b9094–b9102 tools/server/server-context.cpp need_logits renamed to need_pre_sample_logits; only set when n_probs > 0 && !post_sampling_probs; backend sampling now works with post_sampling_probs; 0.0-probability tokens filtered from result.probs; compiled from upstream, no project JNI changes required
~b9094–b9102 src/llama-model.cpp n_vocab loading moved from llama_model_base::load_hparams() to per-model load_arch_hparams() (e.g. src/models/deepseek2.cpp, src/models/llama.cpp); internal model-loading refactor, no project changes required
~b9094–b9102 src/llama-model.cpp ggml/src/ggml-virtgpu/ggml-backend-device.cpp gains #include <mutex> for std::once_flag; internal backend fix, no project changes required
~b9094–b9102 vendor/cpp-httplib/httplib.cpp + httplib.h Security fix: chunk-size parsing replaced strtoul with manual hex-digit scanning to prevent overflow and reject invalid chunk extensions; version bumped to 0.43.4; compiled automatically, no project changes required
~b9102–b9103 vendor/cpp-httplib/httplib.cpp + httplib.h cpp-httplib bumped to v0.44.0: (1) RFC 9110 §5.5 compliance — header field values are no longer percent-decoded by the recipient in parse_header; Location/Referer special-casing removed; callers that need URI-component decoding must call decode_uri_component() explicitly; (2) ThreadPool constructor is now exception-safe — if thread creation fails partway through, already-started workers are signalled to exit and joined before rethrowing, preventing std::terminate from joinable threads in the destructor; compiled automatically, no project changes required
~b9103–b9106 ggml/src/ggml-vulkan/ggml-vulkan.cpp + Vulkan shaders Vulkan flash attention refactored: pipeline_flash_attn_f32_f16 changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (flash_attn_f32_f16 and flash_attn_f32_f16_int8) that select K/V type at runtime via FaTypeK/FaTypeV spec constants; new flash_attn_dequant.glsl contains aliased SSBO views and an uber dequantize4() switch; the K/V type mismatch guard removed from ggml_backend_vk_device_supports_op; internal Vulkan backend refactor, no project changes required
~b9103–b9106 ggml/src/ggml-cuda/argsort.cu Added #include <cuda/iterator> for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required
~b9103–b9106 convert_hf_to_gguf.py Mistral Medium 3.5 mmproj support: n_embd_text now reads "dim" key instead of "hidden_dim"; negative img_break_tok_id placeholders resolved from tekken.json or tokenizer.json; conversion tool only, no project changes required
~b9106–b9134 common/arg.cpp CLI option --spec-draft-ctx-size / -cd / --ctx-size-draft REMOVED — throws std::invalid_argument at parse time; ModelParameters.setCtxSizeDraft() removed; no replacement (context size now managed internally by speculative engine)
~b9106–b9134 common/arg.cpp CLI option --spec-draft-replace / --spec-replace REMOVED — throws std::invalid_argument at parse time; no corresponding Java method existed
~b9106–b9134 common/speculative.h Full redesign: common_speculative_type enum values renamed DRAFTDRAFT_SIMPLE, EAGLE3DRAFT_EAGLE3; common_params_speculative.type (single enum) → .types (vector); common_speculative_n_max() / common_speculative_n_min() REMOVED; new common_speculative_init(params, n_seq) no longer takes ctx; new common_speculative_begin(spec, seq_id, prompt), common_speculative_draft(spec), common_speculative_accept(spec, seq_id, n), common_speculative_process(spec, batch) signatures; common_speculative_draft_params struct added; server sources compiled directly, no project JNI changes required
~b9106–b9134 common/common.h New common_prompt_checkpoint struct (contains data_tgt + data_dft) replaces the old server_prompt_checkpoint in server-task.h; compiled from upstream server sources, no project JNI changes required
~b9106–b9134 tools/server/server-task.cpp task_params::to_json() renamed field "speculative.type""speculative.types" (now serialises the vector); test SlotParamsToJson.SpeculativeFields_Present updated accordingly
~b9106–b9134 include/llama.h New LLAMA_STATE_SEQ_FLAGS_NONE = 0 macro added; additive, no project changes required
~b9134–b9145 tools/server/server-common.cpp New continue_final_message boolean request field in oaicompat_chat_params_parse; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when true, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with add_generation_prompt=true (throws 400); compiled from upstream server sources; InferenceParameters.setContinueFinalMessage(boolean) added
~b9134–b9145 ggml/src/ggml-sycl/ Level Zero API integration for SYCL device memory allocation (GGML_SYCL_SUPPORT_LEVEL_ZERO build option, GGML_SYCL_ENABLE_LEVEL_ZERO runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required
~b9134–b9145 ggml/src/ggml-opencl/ Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required
~b9134–b9145 ggml/src/ggml-cuda/allreduce.cu AllReduce accumulation now routed through float intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required
~b9134–b9145 ggml/src/ggml-hexagon/ GGML_UNARY_OP_TANH added to Hexagon HTP backend; internal DSP backend, no project changes required
~b9134–b9145 ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp use_subgroup_matrix condition now also checks sg_mat_k > 0 && sg_mat_n > 0 and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required
~b9145–b9150 ggml/src/ggml-vulkan/ggml-vulkan.cpp Bug fix: mul_mat_l_int[i] / mul_mat_m_int[i] / mul_mat_s_int[i] / mul_mat_id_l_int[i] / mul_mat_id_m_int[i] / mul_mat_id_s_int[i] were unconditionally set to true instead of mirroring the actual device pipeline capabilities from mul_mat_l[i] etc.; now properly initialized; internal Vulkan backend bug fix, no project changes required
~b9145–b9150 src/unicode.cpp New unicode_regex_split_custom_qwen35() function registered for the Qwen 3.5 tokenizer regex pattern; uses [\p{L}\p{M}]+ letter-plus-combining-mark runs vs. Qwen2's \p{L}+; additive internal tokenizer change, no project changes required
~b9145–b9150 ggml/src/ggml-cpu/ggml-cpu-riscv64-spacemit/ SpaceMIT RISC-V IME backend major refactor: IME2 kernels, expanded quantization (Q2_K, Q3_K, Q6_K, Q8_0, Q5_0, Q5_1, Q5_K, MXFP4), TCM (Tightly Coupled Memory) pool; new source files ime2_kernels.cpp, ime_env.cpp, repack.cpp, rvv_kernels.cpp, spine_mem_pool.cpp; guarded by GGML_CPU_RISCV64_SPACEMIT build flag; no project changes required
~b9150–b9151 common/log.h New LOG_TRC macro added at LOG_LEVEL_TRACE = 4 (between INFO=3 and DEBUG=5); LOG_LEVEL_DEBUG bumped from 4 to 5; new LOG_TRCV verbosity variant; additive, no project changes required
~b9150–b9151 common/common.h + common/common.cpp New common_params_print_info(const common_params &) function: prints verbosity level, per-device memory (name, total, free), and system info at LOG_INF level; replaces the two-line pattern LOG_INF("build_info: %s\n", llama_build_info()); LOG_INF("%s\n", common_params_get_system_info(params).c_str()); — updated in jllama.cpp
~b9150–b9151 common/common.cpp common_init() now unconditionally calls common_log_set_prefix(…, true) and common_log_set_timestamps(…, true) before setting the log callback; log output will always include prefix and timestamps unless explicitly disabled with --no-log-prefix / --no-log-timestamps
~b9150–b9151 common/arg.cpp --log-prefix and --log-timestamps now also accept negated forms --no-log-prefix / --no-log-timestamps (lambda receives a bool value); backing env vars renamed LLAMA_LOG_PREFIXLLAMA_ARG_LOG_PREFIX and LLAMA_LOG_TIMESTAMPSLLAMA_ARG_LOG_TIMESTAMPS; Java layer does not expose these, so no project changes required
~b9150–b9151 tools/server/server-common.h New SLT_TRC and SRV_TRC macros (emit at LOG_TRC level); additive, no project changes required
~b9150–b9151 tools/server/server-context.cpp New server_slot::t_print_last field + print_timings_tg() / print_timings_pp() methods: emit periodic in-flight token-generation and prompt-processing throughput to SLT_INF (throttled to ≥100 decoded tokens and ≥3 s interval); server_context_impl constructor now calls mtmd_helper_log_set unconditionally (was guarded by !is_resume); many SLT_INF/SRV_WRN downgraded to SLT_TRC/SRV_INF; compiled from upstream, no project JNI changes required
~b9150–b9151 tools/server/server-task.cpp Several SRV_WRN calls downgraded to SRV_INF; one SRV_WRN upgraded to SRV_ERR for failed state restore; compiled from upstream, no project changes required
~b9151–b9172 tools/mtmd/clip.h clip_has_whisper_encoder() removed from public API; not referenced by project — no changes required
~b9151–b9172 tools/server/CMakeLists.txt + scripts/webui-download.cmake (new) WebUI assets no longer committed (tools/server/public/ gitignored); provisioned at build time via HF bucket (LLAMA_USE_PREBUILT_WEBUI=ON default) or built from source (LLAMA_BUILD_WEBUI); project sets LLAMA_BUILD_WEBUI=OFF CACHE BOOL "" FORCE before FetchContent to skip asset download
~b9151–b9172 common/common.h common_params::webui default made conditional on LLAMA_WEBUI_DEFAULT_ENABLED macro (falls back to true when undefined); compiled server sources unaffected
~b9151–b9172 common/reasoning-budget.cpp common_reasoning_budget_clone rewritten to use llama_sampler_init properly; pure bug fix, no API change, no project changes required
~b9151–b9172 ggml/src/ggml-cuda/fattn-mma-f16.cuh + mma.cuh AMD RDNA3 WMMA flash attention support; new DATA_LAYOUT_I_MAJOR_SCRAMBLED, tile<16,16,half2,I_MAJOR_SCRAMBLED>, extended config tables; internal CUDA backend, no project changes required
~b9151–b9172 tools/server/server-chat.cpp Non-function Responses API tools now silently skipped (continue) instead of throwing; server behavior fix, no Java API change required
~b9172–b9198 project CMakeLists.txt Option LLAMA_BUILD_WEBUI renamed to LLAMA_BUILD_UI (and LLAMA_USE_PREBUILT_WEBUILLAMA_USE_PREBUILT_UI); upstream keeps a backward-compat shim that forwards the old cache variable with a DEPRECATION message, so this project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged
~b9172–b9198 common/common.h common_params::webui / webui_mcp_proxy / webui_config_json deprecated in favour of ui / ui_mcp_proxy / ui_config_json; both pairs of fields are kept and synced by common/arg.cpp, compiled upstream sources unaffected; new common_params::ctx_type and cparams.n_rs_seq fields added (default LLAMA_CONTEXT_TYPE_DEFAULT / 0), additive
~b9172–b9198 common/common.cpp + common.h common_params_print_info gained optional print_devices parameter (default true); upstream tools/server/server.cpp passes !is_router_server to skip GPU enumeration on the router process; this project does not compile server.cpp, no impact
~b9172–b9198 common/speculative.h + speculative.cpp New enum value COMMON_SPECULATIVE_TYPE_DRAFT_MTP (count is now 9); new common_speculative_need_embd() API; MTP draft implementation added (common_speculative_state_draft_mtp); --spec-type draft-mtp CLI flag added in common/arg.cpp; additive, no project changes (could be exposed later as a ModelParameters enhancement)
~b9172–b9198 include/llama.h New enum llama_context_type { LLAMA_CONTEXT_TYPE_DEFAULT, LLAMA_CONTEXT_TYPE_MTP }; new llama_context_params::n_rs_seq (recurrent-state snapshots per seq for rollback) and ctx_type fields; new llama_n_rs_seq() accessor; all additive, default-zero, no project impact
~b9172–b9198 src/llama-ext.h (new) + src/llama-context.cpp New pre-norm embedding extraction path: llama_set_embeddings_pre_norm / llama_get_embeddings_pre_norm[_ith] APIs and an embd_pre_norm output buffer in llama_context; used by the MTP draft loop only, additive
~b9172–b9198 src/llama-memory-recurrent.cpp Recurrent-state rollback support: per-seq rs_idx snapshot index and set_rs_idx() helper; tensors widened to (1 + n_rs_seq) groups; seq_rm now rolls back via snapshot when within n_rs_seq bounds. Backwards-compatible when n_rs_seq == 0 (this project's default), no project changes
~b9172–b9198 tools/server/server-context.cpp Embedding endpoint default now reads params.embd_normalize (was hard-coded 2); compiled upstream, no project changes
~b9172–b9198 tools/server/CMakeLists.txt + new tools/ui/CMakeLists.txt WebUI asset wiring moved into a new llama-ui static library; tools/server now links llama-ui; project does not build the llama-server binary (only compiles server-context.cpp / server-queue.cpp / server-task.cpp / server-models.cpp directly into jllama), so no impact. HF bucket name renamed LLAMA_WEBUI_HF_BUCKETLLAMA_UI_HF_BUCKET (old name still honoured)
~b9172–b9198 vendor/cpp-httplib/httplib.{h,cpp} Bumped to v0.45.0: RFC 9112 §6 message-body framing — requests without Content-Length / Transfer-Encoding no longer scan for stray body bytes on persistent connections (fixes #2450 keep-alive misframing); X-Forwarded-For parser falls back to the connection remote address when the header is empty/malformed; compiled automatically, no project changes
~b9172–b9198 ggml/CMakeLists.txt GGML version bumped 0.11.1 → 0.12.0; no project changes
~b9172–b9198 ggml/src/ggml.c + ggml-cuda/gated_delta_net.cu + ggml-metal/ggml-metal.metal + ggml-vulkan/vulkan-shaders/gated_delta_net.comp ggml_gated_delta_net state tensor reshaped from 2D (S_v*S_v*H, n_seqs) to 3D (S_v*S_v*H, K, n_seqs) where K is the snapshot slot count (K=1 is final-state-only, K>1 keeps last min(n_tokens, K) per-token snapshots); internal Qwen3.5 / Qwen3-Next recurrent-attention kernel, no project changes
~b9198–b9219 common/chat.{h,cpp} New common_chat_continuation enum (NONE/AUTO/REASONING/CONTENT); new common_chat_msg::render_content(delimiter) method; new continue_final_message field on common_chat_templates_inputs; new common_chat_continuation_parse() accepts both bool and "reasoning_content"/"content" strings; common_chat_template_generation_prompt() extracted; oaicompat_chat_params_parse refactored to route the prefill-assistant heuristic through the new continuation enum. Existing bool wire-format unchanged; the new string variants are exposed via InferenceParameters.setContinueFinalMessage(ContinuationMode)
~b9198–b9219 common/hf-cache.{h,cpp} + common/arg.cpp hf_cache::migrate_old_cache_to_hf_cache() and hf_file::size field removed; the migration call in common_params_parse_ex was dropped. Internal to arg.cpp, no project impact
~b9198–b9219 common/speculative.{h,cpp} + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h llama_set_embeddings_pre_norm(ctx, value)llama_set_embeddings_pre_norm(ctx, value, masked) (3rd bool arg distinguishes "embeddings for outputs only" from "embeddings for every token"); new cparams.embeddings_pre_norm_masked; new common_speculative_need_embd_pre_norm() API; MTP draft path now uses pre-norm extraction. Project does not call any of these APIs (speculative decoding is configured via ModelParameters only), no source changes required
~b9198–b9219 tools/server/server-task.{h,cpp} task_result_state ctor moved from header into .cpp — now seeds chat_msg via common_chat_parse("", true, …) when !echo so the assistant prefill is not echoed back as a delta; new bool echo field on chat_parser_params (default false, populated from request body via json_value(data, "echo", false)). Project compiles server-task.cpp from upstream and does not instantiate task_result_state directly, no source changes required
~b9198–b9219 tools/server/server-context.cpp + server-models.cpp New cors_proxy_enabled boolean field added to /props and /v1/models JSON responses (set from params.ui_mcp_proxy || params.webui_mcp_proxy). Additive, no Java consumer in this project
~b9198–b9219 upstream CMakeLists.txt Backward-compat shim widened: if(DEFINED LLAMA_BUILD_WEBUI AND NOT DEFINED LLAMA_BUILD_UI)if(DEFINED LLAMA_BUILD_WEBUI) — setting the old name now always forwards to the new one (and emits the existing DEPRECATION message). Project sets only LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE (CMakeLists.txt:107), behaviour unchanged
~b9198–b9219 ggml/src/ggml-cuda/ssm-conv.cu + top-k.cu Added kernel size 15 to SSM-conv launcher (now supports 3/4/5/9/15); top-k.cu includes <cuda/iterator> for CCCL ≥ 3.1; internal CUDA backend, no project changes
~b9198–b9219 ggml/src/ggml-sycl/ggml-sycl.cpp + vecdotq.hpp SYCL GEMM now falls back to direct MKL for small problems (gemm_flops < 256³); Q6_K dot product refactored to a single scalar fast-path helper vec_dot_q6_K_q8_1_impl_mmvq_scalar; internal SYCL backend, no project changes
~b9219–b9222 ggml/src/ggml-hexagon/ + htp/pad-ops.c (new) + htp/unary-ops.c Hexagon HTP backend gains GGML_OP_PAD (HVX + optional VTCM/DMA double-buffered, both zero-pad and circular-pad variants) and GGML_OP_TRI (HVX-vectorised triangular masking) support; new HTP_OP_PAD / HTP_OP_TRI opcodes; internal Qualcomm DSP backend, no project changes
~b9219–b9222 .devops/*.Dockerfile + .github/workflows/docker.yml OCI image labels (org.opencontainers.image.*) added via BUILD_DATE/APP_VERSION/APP_REVISION build args; new skip_s390x workflow_dispatch input; manifest annotations on docker buildx imagetools create; upstream packaging/CI only, no project changes
~b9222–b9245 common/common.h + common.cpp common_init_result(common_params &, bool model_only = false) and common_init_from_params(common_params &, bool model_only = false) gain an optional model_only flag that skips context/sampler/lora/warmup setup and returns only the loaded model. Additive with default value; no project call sites in src/main/cpp/, no source changes required
~b9222–b9245 common/common.h common_params_speculative_draft defaults retuned: n_max 16→3, p_min 0.75f→0.0f. Defaults only; Java ModelParameters sets these explicitly via JSON, so behaviour is unchanged for this project
~b9222–b9245 common/speculative.{h,cpp} common_speculative_impl::accept() virtual gains a 3rd bool is_other parameter; common_speculative_accept() now broadcasts the accepted-token count to every registered impl (with is_other=true for impls that did not generate the draft). common_speculative_impl_ngram_map_k ctor signature simplified (no longer takes common_params_speculative). Lots of new LOG_INF startup banners per impl. Internal to upstream-compiled server-context.cpp; no project call sites
~b9222–b9245 common/arg.cpp + common/common.cpp + tools/fit-params/fit-params.cpp --verbosity levels relabeled: level 4 now means "trace (more info)" and level 5 means "debug"; LOG_LEVEL_DEBUG constant value moved from 4 to 5. Direct params.verbosity >= 4 comparisons in upstream common.cpp and fit-params.cpp replaced with >= LOG_LEVEL_DEBUG. Project does not reference LOG_LEVEL_DEBUG or numeric verbosity thresholds in src/main/cpp/; no source changes required
~b9222–b9245 common/arg.cpp --spec-type duplicate-arg DEPRECATED warning suppressed (the flag legitimately accepts repeated values to form the comma-list). Behaviour-only
~b9222–b9245 common/ngram-map.cpp One per-draft LOG_INF downgraded to LOG_DBG. Log-level only
~b9222–b9245 src/llama-graph.h llm_graph_params::operator== adds a third disjunct so ubatches with both token and embd arrays present compare equal (graph reuse fix for MTP pre-norm path). Internal
~b9222–b9245 src/llama-memory-recurrent.{h,cpp} + src/llama-memory-hybrid.cpp + src/llama-memory-hybrid-iswa.cpp init_batch() now forces sequential split (split_seq) instead of equal split when n_rs_seq > 0 (recurrent-state rollback is incompatible with equal splits). Internal upstream model code, no project impact
~b9222–b9245 src/models/delta-net-base.cpp + src/models/models.h + src/models/qwen35.cpp llm_build_delta_net_base::keep_rs() helper removed; conv-state and recurrent-attn paths reworked to read cparams.n_rs_seq directly and loop K = n_rs_seq + 1 snapshot slots. Comment fix in qwen35.cpp MTP layer index. All internal upstream model code
~b9222–b9245 tools/server/server-context.cpp pos_min_thold lowered by one (pos_next - n_swapos_next - n_swa - 1); checkpoint trigger guard relaxed from n_past < slot.prompt.n_tokens() to <=; per-slot print_timings_pp/print_timings_tg lines split into separate SLT_INF calls; new graphs reused and draft acceptance lines; n_draft_total log moved from SLT_CNT to SLT_INF. Compiled upstream-as-is, no project changes
~b9222–b9245 ggml/src/ggml-cuda/mmvq.cu calc_nwarps table tweak: Q6_K returns 2 warps (was grouped with the 8-warp tier). Internal CUDA backend
~b9222–b9245 ggml/src/ggml-hexagon/ (htp/rope-ops.c, htp/unary-ops.c, htp-ops.h, main.c, ggml-hexagon.cpp) New HTP_OP_NORM opcode (mean+variance norm); rope-ops.c adds MROPE / IMROPE position-id support via new mrope_cache_init(). Internal Qualcomm DSP backend
~b9222–b9245 ggml/src/ggml-opencl/ (ggml-opencl.cpp, kernels/cvt.cl, six new gemm_moe_q{4,5,6}_k_f32_ns + gemv_moe_q{4,5,6}_k_f32_ns kernels) Adreno MoE pipeline extended to Q4_K / Q5_K / Q6_K (image1d_buffer_t transposed layout, dedicated convert/restore kernels, GEMM + GEMV paths). Internal OpenCL backend
~b9222–b9245 ggml/src/ggml-rpc/ggml-rpc.cpp last_graph_uid field moved from ggml_backend_rpc_context (per-backend) into ggml_backend_rpc_device_context (per-device) so multiple backends sharing a device reuse cached graphs. Internal RPC backend
~b9222–b9245 ggml/src/ggml-sycl/ggml-sycl.cpp New GGML_SYCL_USE_ASYNC_MEM_OP env (default 1) decouples async USM alloc/free from the graph path. Internal SYCL backend
~b9222–b9245 ggml/src/ggml-webgpu/ggml-webgpu.cpp + wgsl-shaders/gated_delta_net.wgsl Gated-delta-net shader gains a K snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend
~b9222–b9245 convert_hf_to_gguf.py, convert_lora_to_gguf.py, examples/save-load-state/save-load-state.cpp, examples/llama-eval/*, tools/cli/README.md, tools/server/README.md, docs/speculative.md, docs/backend/SYCL.md Doc/example/tooling updates only. Not compiled by this project
~b9222–b9245 tools/ui/* WebUI source reorganisation (enum file renames *.ts*.enums.ts, new chat components, Tailwind plugin imports). Project sets LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE in CMakeLists.txt, so the UI is never built — no impact
~b9245–b9264 src/llama-chat.{h,cpp} LLM_CHAT_TEMPLATE_HUNYUAN_OCR renamed to LLM_CHAT_TEMPLATE_HUNYUAN_VL (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip-impl.h + tools/mtmd/models/ PROJECTOR_TYPE_HUNYUANOCR removed and merged into PROJECTOR_TYPE_HUNYUANVL; hunyuanocr.cpp renamed to hunyuanvl.cpp; clip graph class clip_graph_hunyuanocr renamed to clip_graph_hunyuanvl. Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip.h clip_is_minicpmv() and clip_is_glm() removed from public API. Not referenced by project — no source changes required
~b9245–b9264 tools/mtmd/clip.h (struct clip_context_params) New bool no_alloc field added (initialized via mtmd_context_params_default()). Additive default-zero — no project changes required
~b9245–b9264 tools/mtmd/mtmd.h New mtmd_get_memory_usage() C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project
~b9245–b9264 tools/mtmd/clip-model.h New enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST } replacing the bool image_resize_pad flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links mtmd as-is
~b9245–b9264 common/common.h (struct common_params_speculative_draft) New bool backend_sampling = true field — offloads draft sampling to the backend. Additive default-on; Java ModelParameters doesn't set it, so the upstream default applies. Backend sampler auto-disables when split_mode == TENSOR in src/llama-context.cpp — safe
~b9245–b9264 common/speculative.cpp common_speculative_impl_draft_mtp now registers a per-seq backend sampler chain (top-k 10) on ctx_dft via llama_set_sampler; cleaned up in destructor. Falls back to CPU sampler if llama_set_sampler fails. Internal to upstream-compiled speculative module, no project call sites
~b9245–b9264 app/ (new) New optional unified llama binary (llama-app target) dispatching to serve/cli/completion/bench. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it
~b9245–b9264 tools/{cli,completion,llama-bench,server}/CMakeLists.txt Each tool split into a *-impl static library (the logic) plus a thin main.cpp wrapper; the main() in cli.cpp/completion.cpp/llama-bench.cpp/server.cpp is renamed to llama_cli/llama_completion/llama_bench/llama_server and now satisfies -Wmissing-declarations via a forward decl. Project does NOT compile any of these .cpp files — only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp (see CMakeLists.txt:237/:302) — so no impact
~b9245–b9264 tools/server/server-context.cpp Adds mmproj memory estimation: when params_base.fit_params is set, calls mtmd_get_memory_usage(mmproj_path, mparams) and adds the per-device cost into params_base.fit_params_target before common_init_from_params. Also calls mtmd_helper_log_set(common_log_default_callback, nullptr) once when !is_resume. Compiled upstream-as-is, no project call sites
~b9245–b9264 src/llama-context.cpp New llama_context::set_sampler() short-circuits with a one-shot LLAMA_LOG_WARN and returns false when model.split_mode() == LLAMA_SPLIT_MODE_TENSOR (backend sampling not supported with tensor split). Internal safety check, no project call sites
~b9245–b9264 common/arg.cpp New CLI flags --spec-draft-backend-sampling / --no-spec-draft-backend-sampling and env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING to toggle the new backend_sampling field. Not exposed by ModelParameters; could be added later as a Java-side enhancement
~b9245–b9264 ggml/src/ggml-cuda/CMakeLists.txt + common.cuh + binbcast.cu, concat.cu, cpy.cu, fattn-*.cu, gated_delta_net.cu, getrows.cu, mean.cu, mmvf.cu, mmvq.cu, norm.cu, quantize.cu, reduce_rows.cuh, rope.cu, scale.cu, set-rows.cu, softcap.cu, ssm-conv.cu, ssm-scan.cu, sumrows.cu, topk-moe.cu, unary.cu New PDL (Programmatic Dependent Launch) infrastructure: GGML_CUDA_USE_PDL build flag (CUDART ≥ 11.8, non-HIP/MUSA); ggml_cuda_pdl_sync() / ggml_cuda_pdl_lc() device helpers (active on Hopper sm_90+); ggml_cuda_kernel_launch_params + ggml_cuda_kernel_launch() host template that calls cudaLaunchKernelEx with stream-serialization attribute when GGML_CUDA_PDL env var allows. Adds 90-virtual (Hopper) to default CMAKE_CUDA_ARCHITECTURES when CUDA ≥ 11.8. Internal CUDA backend, no project changes required
~b9245–b9264 ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp + ggml-metal.metal New 4-element kernel_pad_*_4 variant (currently disabled — is_c4 = false); kernel_pad rewritten with 1024-element-per-block tiling for larger tensors; kernel_cpy_* rewritten to use tpitg rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend
~b9245–b9264 ggml/src/ggml-hexagon/htp/ (hmx-matmul-ops.c, hmx-ops.h, matmul-ops.c, main.c) HMX matmul refactor: K-loop tiled in 32-tile blocks with Q6_activation_hf_mxmem_RR_deep; the out-stationary fallback path for large M·K·N was deleted; function rename hmx_mat_mul_permuted_w16a32hmx_matmul_f16_f32, hmx_mat_mul_permuted_qk_0_d16a32hmx_matmul_q_f32, hmx_mat_mul_permuted_w16a32_batched_params_thmx_matmul_f16_f32_batched_params_t. HMX power-up code reorganized (HAP_power_set_HMX_v2 now combines power-on + clock in one step for __HVX_ARCH__ ≥ 75). Internal Qualcomm DSP backend
~b9245–b9264 ggml/src/ggml-opencl/ggml-opencl.cpp Lazy kernel compilation: argsort and flash_attn programs are now built only when first needed (load_cl_kernels_argsort / load_cl_kernels_flash_attn called from supports_op); new device-supported probe in ggml_opencl_is_device_supported runs at registration time; renamed ggml_cl2_init/ggml_cl2_freeggml_cl_init/ggml_cl_free; OpenCL contexts now live as long as the process. Internal OpenCL backend
~b9245–b9264 ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes BLOCK_SIZE outputs per step. Internal Vulkan backend
~b9245–b9264 src/models/delta-net-base.cpp Renamed local variables (state_in_3ds_3d, state_3ds_3d_pad) when reshaping the recurrent state; behaviour unchanged
~b9245–b9264 tools/mtmd/mtmd-image.cpp img_tool::resize() takes a pad_style enum (was bool add_padding); new PAD_NEAREST rounding path for Pillow byte-parity; mtmd_image_preprocessor_deepseekocr::preprocess rewritten with static constexpr resolution table and RESIZE_ALGO_BICUBIC_PILLOW + PAD_NEAREST. Internal mtmd, project links as-is
~b9245–b9264 tools/mtmd/models/deepseekocr.cpp Extracted build_sam(ggml_tensor *inp_raw) member function from the monolithic build path; FA mask casting to F16 only when flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED. Internal
~b9245–b9264 conversion/hunyuan.py, gguf-py/gguf/constants.py, gguf-py/gguf/tensor_mapping.py HunyuanOCR / HunyuanVL unified in conversion: VisionProjectorType.HUNYUANOCR removed; HunYuanVLForConditionalGeneration registers a single HunyuanVLVisionModel + HunyuanVLTextModel; vit.perceive.* tensor mappings now only mention HunyuanVL. Python tooling, not compiled by project
~b9245–b9264 CMakeLists.txt (upstream) New LLAMA_BUILD_APP option (default OFF); deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUILLAMA_BUILD_UI/LLAMA_USE_PREBUILT_UI preserved. Project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged
~b9245–b9264 .devops/*.Dockerfile, .github/workflows/build-and-test-snapdragon.yml, scripts/snapdragon/, docs/backend/snapdragon/, tools/cli/README.md, tools/server/README.md, tools/mtmd/tests/ Docker images add conversion/ dir; snapdragon toolchain bumped v0.3 → v0.6 with +dotprod+i8mm; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project
~b9264–b9279 tools/server/server-context.cpp Slot-info JSON adds three additive fields (n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache) on each in-flight task; server_context_impl::destroy() now resets spec / ctx_dft / model_dft BEFORE llama_init.reset() to avoid use-after-free when a draft model holds back-references into the target context. Compiled directly into jllama from upstream — no project source changes required
~b9264–b9279 tools/server/server-models.cpp Adds #include <cstdlib> and a LLAMA_APP_CMD env-var lookup in server_model_meta::update_args() to re-inject the unified-binary subcommand into router-spawned child argv. Env var is only set by the new llama-app binary (which this project does not build), so the lookup harmlessly returns null and the code path is a no-op. Compiled upstream-as-is, no project changes
~b9264–b9279 src/llama-vocab.cpp New hybriddna BPE tokenizer model (DNA k-mer tokenization with <dna>…</dna> tag handling, k=6, OOV fallback) registered as a BPE variant; reached only when GGUF metadata declares tokenizer.model = "hybriddna". Adds a virtual destructor + virtual tokenize() to llm_tokenizer_bpe_session and a llm_tokenizer_hybriddna_session subclass; existing BPE callers unchanged. Additive, no project changes
~b9264–b9279 src/llama-graph.cpp llm_graph_input_attn_kv_iswa::set_input() / can_reuse() now guard the base and SWA tensor accesses behind if (self_k_idxs && self_k_idxs->buffer) / if (self_k_idxs_swa && self_k_idxs_swa->buffer). Fixes crashes on models with only-SWA or only-non-SWA attention layers. Internal, no project impact
~b9264–b9279 src/models/qwen35.cpp + src/models/qwen35moe.cpp MTP draft sub-graph now builds an inp_out_ids input and applies ggml_get_rows(cur, inp_out_ids) just before the head norm, so only the requested output rows are projected. Bug fix for MTP draft path; internal, no project changes
~b9264–b9279 ggml/src/ggml-backend.cpp ggml_backend_tensor_get_2d() fast-path condition fixed: now checks iface.get_tensor_2d == NULL (was incorrectly checking set_tensor_2d), so multi-copy gets correctly fall back to the per-copy loop when the backend lacks get_tensor_2d. Bug fix, no project changes
~b9264–b9279 ggml/src/ggml-vulkan/ (ggml-vulkan.cpp, new vulkan-shaders/snake.comp, vulkan-shaders-gen.cpp) New Vulkan Snake activation fusion: detects the 5-op chain MUL → SIN → SQR → MUL → ADD (matching CUDA b9094 introduction) and dispatches a single fused snake_{f32,f16,bf16} kernel y = x + sin(a*x)^2 * inv_b. New ggml_vk_can_fuse_snake() validates contiguity, 2D shape, and broadcast operands [1, C, 1, 1]. Internal Vulkan backend, no project changes
~b9264–b9279 ggml/src/ggml-metal/ggml-metal-ops.cpp + ggml-metal.metal kernel_concat / kernel_set now batch multiple small rows into one threadgroup (nrptg = min(256/ne0, ne1), capped at 256 threads/group) to improve small-row throughput; kernel_concat gains an early-return bounds check. Internal Metal backend, no project changes
~b9264–b9279 ggml/src/ggml-hexagon/ (ggml-hexagon.cpp, htp/ssm-conv.c, htp/rope-ops.c) SSM_CONV HVX kernel rewritten with VTCM-staged 32×32 fp32 in-register transpose and per-thread tiling (1 MiB VTCM budget); strictly-contiguous gate replaced with byte-stride checks (nb[0]==sizeof(float) and nb[1]==ne[0]*sizeof(float)); rope_cache_init / mrope_cache_init marked __attribute__((noinline)) to reduce code-bloat on Hexagon. Internal Qualcomm DSP backend, no project changes
~b9264–b9279 examples/save-load-state/ removed, tests/test-save-load-state.cpp added; tools/{batched-bench,fit-params,quantize,perplexity}/CMakeLists.txt The llama-save-load-state example binary was removed and re-homed as a CTest target; the four remaining standalone tools were each split into a *-impl static library + a thin main.cpp wrapper (mirroring the b9245 split of cli/completion/llama-bench/server), with the entry-point renamed to llama_batched_bench / llama_fit_params / llama_quantize / llama_perplexity to satisfy -Wmissing-declarations. Project does not compile any of these .cpp files (only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp — see CMakeLists.txt), so no impact
~b9264–b9279 app/ (CMakeLists.txt, llama.cpp) llama-app unified binary gains four new subcommands (batched-bench, fit-params, quantize, perplexity) and sets LLAMA_APP_CMD in the env before dispatching so that the router can re-inject the subcommand into spawned child argv. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it, no impact
~b9264–b9279 conversion/base.py + conversion/llama.py New _set_vocab_hybriddna() Python helper that emits a gpt2-style BPE vocab tagged as tokenizer.model = "hybriddna"; LlamaModel.set_vocab() dispatches to it when tokenizer_config.json declares "tokenizer_class": "HybridDNATokenizer"; add_prefix_space handling moved earlier in the same method. Conversion tooling only, not compiled by project
~b9279–b9284 upstream CMakeLists.txt LLAMA_BUILD_APP default flipped OFFON. Project's LLAMA_BUILD_TOOLS is OFF (FetchContent, LLAMA_STANDALONE=OFF), so tools/-dependent app targets are not configured; nevertheless CMakeLists.txt:108 now explicitly forces set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) to keep the cache pinned across upgrades
~b9279–b9284 tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt Each *-impl target switched from add_library(... STATIC ...) to default library type (becomes SHARED when BUILD_SHARED_LIBS=ON); added WINDOWS_EXPORT_ALL_SYMBOLS ON and conditional install(TARGETS ... LIBRARY) under LLAMA_TOOLS_INSTALL. Project doesn't enable LLAMA_BUILD_TOOLS, so none of these targets are configured — no impact
~b9279–b9284 src/llama-vocab.cpp + conversion/base.py HybridDNA tokenizer fix: k-mers are now stored in token_to_id with a reserved \xee\x80\x80 (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. CCCCCC); the suffix is stripped from id_to_token text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required
~b9279–b9284 ggml/src/ggml-cuda/common.cuh PDL-launch gating now uses ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required
~b9284–b9297 upstream CMakeLists.txt LLAMA_BUILD_APP default reverted from ON back to ${LLAMA_STANDALONE} (i.e. OFF for FetchContent consumers). Project's set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) shim is now redundant but harmless; kept as defensive pin against future flips
~b9284–b9297 common/chat.h + tools/server/server-task.cpp New additive common_chat_parser_params::is_continuation field (default false); params_from_json_cmpl now parses the continue_final_message request field via common_chat_continuation_parse() and sets is_continuation when the result is non-NONE. task_result_state ctor guard tightened: the empty-prefill chat_msg = common_chat_parse("", true, ...) initialization is now gated on is_continuation && !echo (was just !echo) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java InferenceParameters.setContinueFinalMessage(boolean|ContinuationMode) already writes continue_final_message to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it
~b9284–b9297 src/llama-model.{h,cpp} + src/models/qwen35.cpp + src/models/qwen35moe.cpp NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: llama_layer_nextn gains four scale fields (eh_proj_s, eh_proj_in_s, shared_head_head_s, shared_head_head_in_s); load_tensors() loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into build_lora_mm(). Internal model-loading + graph-building changes, no project changes required
~b9284–b9297 ggml/src/ggml-backend.cpp Bug fix in ggml_backend_tensor_get_2d_async: fast-path condition checked iface.set_tensor_2d_async == NULL (typo) instead of iface.get_tensor_2d_async == NULL; multi-copy gets now correctly fall back when the backend lacks get_tensor_2d_async. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required
~b9284–b9297 ggml/src/ggml-opencl/ (ggml-opencl.cpp + 17 kernel files) Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the ne01 bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when ne01 % 64 != 0. Fixed by: (1) rounding global_size[0] up to the next multiple of 64 in ggml_cl_mul_mat_id, (2) moving the per-thread ne01 early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from ne01 % 64 == 0 to ne01 % 32 == 0 in use_adreno_moe_kernels. Internal OpenCL backend, affects the opencl-android-aarch64 classifier build only — no project source changes
~b9284–b9297 ggml/src/ggml-sycl/ (ggml-sycl.cpp, dmmv.cpp, gated_delta_net.cpp, common.hpp) (1) BF16 added to ggml_sycl_supports_dmmv() and can_use_dequantize_mul_mat_vec(); new convert_mul_mat_vec_bf16_sycl path. (2) Level Zero auto-detect moved into ggml_sycl_init()info.ext_oneapi_level_zero flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for GGML_SYCL_ENABLE_LEVEL_ZERO env. (3) mmid_counting_sort_rows() replaces the per-expert atomic scan in ggml_sycl_mul_mat_id — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with keep_rs_t template parameter and per-token snapshot writes when K > 1, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required
~b9284–b9297 ggml/src/ggml-vulkan/CMakeLists.txt find_package(SPIRV-Headers) switched to CONFIG REQUIRED and adds $ENV{VULKAN_SDK} to CMAKE_PREFIX_PATH; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required
~b9284–b9297 ggml/src/ggml-zendnn/ (CMakeLists.txt, ggml-zendnn.cpp) ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles GGML_TYPE_Q8_0 with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required
~b9284–b9297 tools/perplexity/perplexity.cpp log_probs.resize(n_ctx * nv) widened to size_t(n_ctx) * nv to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact
~b9297–b9305 upstream CMakeLists.txt Top-level backward-compat shims that forwarded LLAMA_BUILD_WEBUILLAMA_BUILD_UI and LLAMA_USE_PREBUILT_WEBUILLAMA_USE_PREBUILT_UI were REMOVED (they now live only in tools/ui/CMakeLists.txt). Java impact: project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) no longer hits the shim at top level. tools/ui is not configured in FetchContent mode (LLAMA_BUILD_TOOLS=OFF), so the old setting was inert in practice, but the project's CMakeLists.txt:107 was renamed to set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE) for clarity and to defend against future flips of LLAMA_BUILD_UI default
~b9297–b9305 common/common.h LLAMA_UI_DEFAULT_ENABLED macro removed; common_params::ui default is now unconditionally true. Not referenced by project, no changes required
~b9297–b9305 common/fit.{h,cpp} common_get_device_memory_data() made non-static and exported from fit.h (was a file-local helper). fit.h now also pulls in ggml-backend.h, llama.h, and ../src/llama-ext.h. Used by upstream tools/server/server-context.cpp (compiled directly into jllama). The #include "../src/llama-ext.h" resolves relative to fit.h's location (common/../src/llama-ext.h), so no extra include paths are required. No project source changes
~b9297–b9305 tools/server/server-context.cpp New #include "fit.h" and a new draft/MTP memory measurement block: when params_base.fit_params is set AND the speculative config includes a draft model or COMMON_SPECULATIVE_TYPE_DRAFT_MTP, common_get_device_memory_data() is called against the draft model (or a copy of the target params with LLAMA_CONTEXT_TYPE_MTP for MTP) and the resulting per-device model + context + compute bytes are added to params_base.fit_params_target before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. ModelParameters.setFit(boolean) defaults to on, so this kicks in automatically when a user configures a draft model — no Java-side wiring required
~b9297–b9305 tools/server/server-context.cpp [mtmd] estimated memory usage of mmproj log line reworded to estimated worst-case memory usage; log only, no behavioural change
~b9297–b9305 tools/server/server-http.cpp UI serving path migrated from per-asset extern arrays (index_html, bundle_js, …) and the LLAMA_BUILD_UI macro to a runtime llama_ui_find_asset() lookup gated on the new LLAMA_UI_HAS_ASSETS macro generated by the new llama-ui-embed host tool. Project does NOT compile server-http.cpp (only server-context.cpp/server-queue.cpp/server-task.cpp/server-models.cpp), no impact
~b9297–b9305 tools/ui/ (CMakeLists.txt, new embed.cpp, new sources.cmake, new scripts/ui-assets.cmake, removed scripts/ui-download.cmake + scripts/xxd.cmake, removed ui.cpp+ui.h) Full UI build pipeline rewrite: xxd.cmake+ui-download.cmake replaced by a host-compiled llama-ui-embed C++ tool that generates ui.cpp/ui.h (declaring a g_assets[] table and llama_ui_find_asset() lookup, plus LLAMA_UI_HAS_ASSETS macro) from arbitrary asset files; new scripts/ui-assets.cmake orchestrates asset provisioning with a clearer priority (pre-built tools/ui/dist → npm build → HF Bucket); tools/ui is now an add_custom_target always re-run per build. The deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUI/LLAMA_WEBUI_HF_BUCKET moved here from the top-level CMakeLists.txt. Project does not build the UI (LLAMA_BUILD_TOOLS=OFF in FetchContent mode), no impact
~b9297–b9305 ggml/include/ggml-alloc.h Comment-only API documentation update for ggml_backend_alloc_ctx_tensors_from_buft. No project changes required
~b9297–b9305 ggml/src/ggml-backend-meta.cpp Bug fix for zero-sized split tensor slices: set_tensor/get_tensor/set_tensor_async/get_tensor_async paths now continue when chunk_size_j == 0; ggml_backend_meta_alloc_ctx_tensors_from_buft now allocates a dummy buffer when all tensors in a context are zero-sized (was returning NULL and asserting); ggml_backend_buft_alloc_buffer result now GGML_ASSERTed non-null. Internal backend code, no project changes required
~b9297–b9305 ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c hvx_vec_splat_f16(hvx_vec_get_f16(...)) round-trip replaced with hvx_vec_repl_f16(...) which stays in the vector domain via vdelta (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required
~b9297–b9305 ggml/src/ggml-opencl/ggml-opencl.cpp GGML_OPENCL_PROFILING batching fix: when profiling_info reaches 2048 entries the batch is now flushed into a persistent profiling_results vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing ] closing the JSON array in cl_trace.json. Profile-only code (GGML_OPENCL_PROFILING is off by default), no project changes required
~b9305–b9333 common/common.h + common/arg.cpp common_params::checkpoint_every_nt renamed to checkpoint_min_step; default changed 8192 → 256; CLI flag -cpent/--checkpoint-every-n-tokens REMOVED (throws std::invalid_argument at parse time) and replaced by -cms/--checkpoint-min-step; env var LLAMA_ARG_CHECKPOINT_EVERY_NTLLAMA_ARG_CHECKPOINT_MIN_SPACING_NT. Java layer does not expose this flag, no project source changes required
~b9305–b9333 common/chat.h + common/chat.cpp New common_chat_msg_span and common_chat_msg_delimiter structs; new common_chat_params::message_spans field (default empty vector); new common_chat_split_by_role() function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected user_start/assistant_start markers; passed through server-common.cpp as message_spans JSON array in the task params; compiled from upstream, no Java changes required
~b9305–b9333 common/chat-diff-analyzer.cpp + common/chat-auto-parser.h New autoparser::user_start and autoparser::assistant_start fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required
~b9305–b9333 tools/server/server-task.h + tools/server/server-context.cpp New task_params::n_before_user field (default -1); server computes it from message_spans to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates draft.cache_type_k/v; compiled directly into jllama from upstream, no project source changes required
~b9305–b9333 ggml/include/gguf.h + ggml/src/gguf.cpp New gguf_reader_callback_t typedef; new gguf_init_from_buffer(data, size, params) and gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params) public APIs; internal gguf_init_from_reader() helper refactored to use a callback-based reader; additive, not used by project
~b9305–b9333 ggml/CMakeLists.txt GGML version bumped 0.12.0 → 0.13.0; no project changes required
~b9305–b9333 ggml/src/CMakeLists.txt + ggml/src/ggml-cpu/CMakeLists.txt OpenMP detection and target_link_libraries moved from ggml-cpu into ggml-base; exported ggml-config.cmake.in updated to add GGML_BASE_INTERFACE_LINK_LIBRARIES and guard OpenMP targets before appending; fixes static-lib consumers that link only ggml-base; no project source changes required
~b9305–b9333 ggml/src/ggml-alloc.c Off-by-one bug fix in ggml_dyn_tallocr_remove_block: loop ran one iteration past the last valid element; internal allocator fix, no project changes required
~b9305–b9333 ggml/src/ggml-backend-meta.cpp Rotating-pair compute containers: external views created between evals now use a stc_compute[2] double-buffer scheme so they don't slowly deplete stc_static memory; split_state_cache is now unbounded (comment documents it as FIXME); ggml_backend_meta_alloc_ctx_tensors_from_buft uses ggml_get_mem_size(ctx) for static container and 16× that for each compute container; internal multi-GPU meta backend refactor, no project changes required
~b9305–b9333 ggml/src/ggml-cuda/fwht.cu + fwht.cuh + ggml-cuda.cu New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (fwht_cuda<N>) for N = 64/128/256/512; dispatched from ggml_cuda_mul_mat when GGML_HINT_SRC0_IS_HADAMARD op hint is set on a ggml_mul_mat node (hint index 1); internal CUDA backend, no project changes required
~b9305–b9333 ggml/src/ggml-metal/ggml-metal-device.{h,m} New ggml_metal_device_id enum covering M1–M5 variants; device_id field added to ggml_metal_device_props, populated by new ggml_metal_device_id_parse() from the MTL device name string; additive, no project changes required
~b9305–b9333 ggml/src/ggml-quants.c IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on counter under OpenMP; guards with #ifdef GGML_USE_OPENMP; internal quantization init, no project changes required
~b9305–b9333 src/llama-arch.cpp LLM_TENSOR_FFN_LATENT_DOWN and LLM_TENSOR_FFN_LATENT_UP probe op changed from GGML_OP_MUL to GGML_OP_MUL_MAT; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use MUL_MAT to keep them there); internal upstream fix, no project changes required
~b9305–b9333 vendor/cpp-httplib/httplib.{h,cpp} Bumped to v0.45.1: close_socket, shutdown_socket, Server::stop marked noexcept; macOS Keychain cert loading migrated from deprecated SecTrustCopyAnchorCertificates to SecTrustSettingsCopyCertificates (all three trust domains: system, admin, user); CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN now restricted to TARGET_OS_OSX only with compile-time #error on iOS/tvOS/watchOS; compiled automatically, no project changes required
~b9305–b9333 common/common.h New string_lcs(std::string_view a, std::string_view b) function (longest common substring via DP); additive, not used by project directly
~b9333–b9354 src/models/talkie.cpp (new) + src/llama-arch.h/cpp + src/llama-model.cpp + src/llama-vocab.cpp/h New Talkie model architecture (LLM_ARCH_TALKIE); uses NEOX rope type; embedding skip connections via out_scale; per-head Q gain via attn_q_norm; logit scale; new LLAMA_VOCAB_PRE_TYPE_MINICPM5 = 52 ("minicpm5" pre-type with ignore_merges = true); "talkie" tokenizer_pre mapped to GPT4O; Gemma4ForCausalLM registered as Gemma4 in HF conversion map; all additive, no project source changes required
~b9333–b9354 src/models/mistral3.cpp Dense FFN now passes ffn_up_s/ffn_gate_s/ffn_down_s instead of nullptr; MoE passes ffn_up_exps_s/ffn_gate_exps_s/ffn_down_exps_s to build_moe_ffn; bug fix for NVFP4 Mistral3/Mistral-MoE models; upstream only, no project changes required
~b9333–b9354 tools/server/server-http.h + server-http.cpp bool is_ssl = false field added to server_http_context; listening_address now uses https:// prefix when SSL is configured (was always http://); compiled from upstream, no project changes required
~b9333–b9354 ggml/src/ggml-sycl/ggml-sycl.cpp Virtual memory pool (ggml_sycl_pool_vmm) implemented when SYCL_EXT_ONEAPI_VIRTUAL_MEM is available; GGML_SYCL_ENABLE_VMM env var (default 1) controls it; DEBUG_SYCL_MALLOC compile flag for verbose allocation logging; vmm_granularity field in sycl_device_info; internal SYCL backend, no project changes required
~b9333–b9354 ggml/src/ggml-cuda/fwht.cu + fwht.cuh ggml_cuda_op_fwht return type changed voidbool; returns false for non-contiguous tensors or unsupported N values instead of calling GGML_ABORT; caller in ggml-cuda.cu now skips FWHT gracefully; internal CUDA backend, no project changes required
~b9333–b9354 ggml/src/ggml-vulkan/ggml-vulkan.cpp + conv2d_mm.comp Cooperative matrix 1 (cm1) path for conv2d; new CONV_SHAPE_64x128 tile size; aligned spec constant skips bounds checks when K/CRS/NPQ are tile-aligned; csh_store stages cm2/cm1 output through shared memory for coalesced global stores; internal Vulkan backend, no project changes required
~b9333–b9354 ggml/src/ggml-webgpu/ New MMVQ path for mat-vec using packed_4x8_integer_dot_product; legacy mul_mat.wgsl removed (replaced by register-tile path); new quantize_q8.wgsl and mul_mat_vec_q_acc.tmpl; vendor and dot-product capability detection at init; q8_1.m renamed to q8_1.s in WGSL struct; internal WebGPU backend, no project changes required
~b9333–b9354 upstream CI (.github/workflows/) CANN and SYCL builds disabled to save Actions resources; macOS builds moved to build-apple.yml; cache keys prefixed with cache-gha-; [no release] commit message token skips release pipeline; no project changes required
~b9354–b9437 common/common.h + common/arg.h + common/arg.cpp common_params_handle_models() return type voidbool (caller can detect skip-download misses); new common_params::skip_download; common_params::timeout_read default raised 600 → 3600. Project does not call common_params_handle_models() directly — arg parsing happens upstream; the new defaults flow through transparently
~b9354–b9437 common/download.h + common/download.cpp common_download_model() parameter list trimmed: download_mmproj/download_mtp moved into common_download_opts; new common_skip_download_exception; new opt skip_download returns -2 on missing/etag mismatch. Project does not include download.h directly, no source changes required
~b9354–b9437 tools/server/server-task.h + server-task.cpp task_params::stream default truefalse; new server_task_result_cmpl_partial::is_begin bool to let HTTP layer emit SSE headers before the first delta; to_json() returns nullptr for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets stream explicitly from Java (LlamaIterator.java, LlamaModel.java) so the default change is inert. The is_begin / nullable-to_json contract DOES leak into the JNI bridge — see the row below for the required fix
~b9354–b9437 tools/server/server-context.cpp + server-queue.cpp send_partial_response() gained is_begin parameter (defaulted); SSE stream now emits a no-content opening event when stream &amp;&amp; !return_progress (server-context.cpp:2835) so the client sees HTTP 200 + headers before first token. server_response_reader::next() 30s warn-on-cancel diagnostic message updated. Required project source change: Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson in src/main/cpp/jllama.cpp called result->to_json() once and assigned response["stop"], which silently auto-promoted the nullptr to an object {"stop": false} and surfaced a phantom empty LlamaOutput to every Java streaming caller (LlamaModelTest.testGenerateAnswer and four sibling tests overran by +1 token). Fixed by wrapping the rd->next() call in a loop that skips response.is_null() results so only real events reach Java
~b9354–b9437 common/arg.cpp (env-var renames) LLAMA_LOG_*LLAMA_ARG_LOG_*, LLAMA_OFFLINELLAMA_ARG_OFFLINE, LLAMA_LOG_FILELLAMA_ARG_LOG_FILE, LLAMA_CHAT_TEMPLATE_KWARGSLLAMA_ARG_CHAT_TEMPLATE_KWARGS. CLI verbosity values relabeled (4=trace, 5=debug). The --license CLI flag was REMOVED and moved to the new llama-app licenses subcommand. Project does not expose these env vars or the --license flag through the Java API, no changes required
~b9354–b9437 src/llama.cpp llama_backend_init() device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in jllama.cpp unchanged
~b9354–b9437 src/llama-chat.cpp New LLM_CHAT_TEMPLATE_GRANITE_4_1 enum value + "granite-4.1" template name; granite-4.0 detection now requires the literal token g4_default_system_message in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly — routing happens inside compiled-from-upstream code, no source changes required
~b9354–b9437 vendor/cpp-httplib/ Bumped to v0.46.0: adds Client::set_no_proxy(std::vector&lt;std::string&gt;) with full hostname-suffix and IPv4/IPv6 CIDR matching; Server::ThreadPool constructor is exception-safe (already in v0.45.0); Client::set_proxy() now disconnects the held socket immediately so a later proxy change cannot reuse the old TLS session. Compiled automatically, no project changes required
~b9354–b9437 common/arg.cpp (additive flags) New --spec-draft-backend-sampling / --no-spec-draft-backend-sampling (env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING) and --skip-download (mapped to common_params::skip_download). Both default-on / default-off in a way that preserves current Java behaviour. Consider exposing as ModelParameters.setSpecDraftBackendSampling(boolean) and setSkipDownload(boolean) in a follow-up — tracked under Open TODOs
~b9354–b9437 ggml/src/ggml-cuda/common.cuh GGML_CUDA_USE_PDL gating tightened: for MSVC, now requires CTK ≥ 12.3 (was 11.8) due to a compiler bug in the older Windows CUDA toolchains. Project's only CUDA build is Linux (dockcross, CUDA 13.2) so the MSVC gate has no CI impact; Windows CI builds CPU-only
~b9437–b9442 src/llama-vocab.{h,cpp} + src/llama-arch.{h,cpp} New LLAMA_VOCAB_PRE_TYPE_WHITESPACE = 53 and llm_tokenizer_whitespace_session (used by jina-v2-base-zh embeddings); new "whitespace" tokenizer_model routed as LLAMA_VOCAB_TYPE_BPE; new LLM_KV_TOKENIZER_NORMALIZER_LOWERCASE key (tokenizer.ggml.normalizer.lowercase) read into llama_vocab::impl::normalizer_lowercase; new public accessor llama_vocab::get_normalizer_lowercase(). All additive — existing tokenizers untouched; new whitespace + lowercase normalizer is consumed automatically when loading a GGUF that sets these vocabulary keys, no project source or Java API changes required
~b9437–b9442 src/llama.cpp llama_prepare_model_devices() iGPU collection now appends only the FIRST GGML_BACKEND_DEVICE_TYPE_IGPU device (prevents duplicate iGPU registration on multi-iGPU hosts). Behavioural fix, single-line caller in jllama.cpp unchanged, no project source changes required
~b9437–b9442 tools/ui/embed.cpp + tools/ui/src/... (Svelte) Webasset embedder tightened printf format specifiers (%lu%zu and PRIx64); UI settings split custom into customJson + customCss; runtime CSS injection via <svelte:head>. Project does not ship the upstream UI, no impact
~b9437–b9442 gguf-py/, conversion/ (Python) New _set_vocab_whitespace() helper and add_normalizer_lowercase() GGUF writer for the new whitespace tokenizer + lowercase normalizer keys (mirrors the vocab additions above); jina-v2 Roberta-tokenizer path now branches to whitespace when tokenizer.json declares a Whitespace pre-tokenizer. Python-side only, no impact on the Java/JNI build
~b9442–b9444 .github/workflows/build-cpu.yml (upstream CI) Upstream's CPU-build CI trigger paths narrowed to **/*.h, **/*.hpp, **/*.c, **/*.cpp (dropped **/*.cu, **/*.cuh, **/*.swift, **/*.m, **/*.metal, **/*.comp, **/*.glsl, **/*.wgsl) so GPU/Metal/Vulkan/WebGPU/Swift source edits no longer trigger the CPU build. Upstream-only CI plumbing; this project consumes none of upstream's workflow files and has its own publish.yml, no impact
~b9442–b9444 tools/server/server-http.cpp If-None-Match conditional-GET handling now also accepts the weak ETag form W/"..." (previously matched only strong ETag bytes-equal); 304 Not Modified returned for either form. This is the standalone llama-server HTTP tool, which is not linked into the JNI build (libllama + libcommon only); no project source changes required and no new Java API surface to expose
~b9444–b9490 common/common.cpp common_prompt_batch_decode() signature changed: new int n_new parameter added between all_tokens and n_past. Callers must pass the count of newly-decoded tokens for the batch. Only called inside upstream tools/server/server-context.cpp (compiled directly into jllama); no project source changes required — the new signature flows through transparently
~b9444–b9490 include/llama.h llama_set_warmup() deprecated via LLAMA_DEPRECATED macro (warmup is now handled internally during model load + first decode). Not called from jllama.cpp or any project source — absorbed inside upstream-compiled code, no project changes required. If a future jllama feature wants to control warmup explicitly, that path is the deprecated one and should pick the new replacement instead
~b9444–b9490 include/llama.h + src/llama-context.cpp New llama_context_params::n_outputs_max field (default -1 = derived from n_batch). Limits the number of output slots allocated per context; useful for low-memory setups that always request logits_all=false. Not exposed by project today — consider adding ModelParameters.setMaxOutputs(int) if a user requests fine-grained control. Tracked under Open TODOs
~b9444–b9490 common/arg.cpp + common/common.cpp common_params_handle_models() no longer sets hf_opts.download_mmproj = true unconditionally; instead uses opts.download_mmproj = !params.no_mmproj so the new --no-mmproj flag suppresses the multimodal projector download. Not called from project source — arg parsing happens upstream, no project changes required
~b9444–b9490 common/sampling.h + common/sampling.cpp New common_sampler_reasoning_budget_force(common_sampler *) API that triggers the budget sampler to inject the end-of-thinking token on the next sample. Paired with new common_params_sampling::reasoning_control bool: when set, arms the budget sampler so external code (e.g. a server control endpoint) can end reasoning at runtime. Not used by project today — would pair with a future InferenceParameters.setReasoningControl(boolean) setter and a LlamaModel.endReasoning(...) helper. Tracked under Open TODOs
~b9444–b9490 common/common.h + common/arg.cpp New common_params::sse_ping_interval (int32, env LLAMA_ARG_SSE_PING_INTERVAL, CLI --sse-ping-interval); server emits SSE keep-alive comments at this interval. Server-only; project does not run the upstream HTTP server (uses a direct in-process API), no Java setter required
~b9444–b9490 tools/server/server-http.cpp New POST /v1/chat/completions/control endpoint accepting {"id": "...", "action": "reasoning_end"} — tells a streaming completion to wrap up reasoning early. Server-only; not linked into the JNI build (libllama + libcommon only), no project source changes required. If exposed in Java, would map to a new LlamaModel.endReasoning(String taskId) method that calls common_sampler_reasoning_budget_force on the slot's sampler. Tracked under Open TODOs
~b9444–b9490 src/llama-hparams.h + src/llama-model.cpp Internal renames: hparams::recurrent_layer_arrhparams::is_recr_impl; hparams::swa_layershparams::is_swa_impl. Internal helper fields not part of the public API; not referenced by jllama.cpp or any project source, no changes required
~b9444–b9490 src/llama-arch.h + src/llama-arch.cpp + gguf-py/ New LLM_KV_HIDDEN_ACT GGUF key (%s.hidden_act) for ModernBert SwiGLU/GeGLU activation selection; new LLM_KV_ATTENTION_RECURRENT_LAYERS key for hybrid (recurrent + attention) models. Additive vocabulary keys consumed automatically when loading a GGUF that sets them; no project source or Java API changes required
~b9444–b9490 src/llama-arch.h + src/models/*.cpp (new) New model architectures: LLM_ARCH_MELLUM (JetBrains code-completion), LLM_ARCH_EXAONE4_5 (LG AI multimodal), LLM_ARCH_STEP3P7 (StepFun Step-3.7 with MTP support); LLM_ARCH_QWEN3NEXT/LLM_ARCH_QWEN35/LLM_ARCH_QWEN35MOE removed from llama_model_saver_supports_arch() allowlist. New tokenizer pre-types: LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54, LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required
~b9444–b9490 common/arg.cpp New --mtp / --no-mtp flags (env LLAMA_ARG_MTP) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to ModelParameters.setMtp(boolean)). Tracked under Open TODOs if a user requests it
~b9444–b9490 upstream build / verification Local build with GIT_TAG b9490 was verified clean: cmake -B build configures cleanly; cmake --build build --config Release -j$(nproc) links libjllama.so with zero warnings on jllama.cpp or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (common.cpp, arg.cpp, llama.cpp, server-*.cpp, download.cpp); no project source edits required for the version bump itself
~b9490–b9495 include/llama.h + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-graph.{h,cpp} + common/speculative.{h,cpp} + src/models/{qwen35,qwen35moe,step35}.cpp Mass terminology rename: pre_normnextn everywhere the pre-final-norm hidden state is referenced. Affects the public API: llama_set_embeddings_pre_norm()llama_set_embeddings_nextn(), llama_get_embeddings_pre_norm()llama_get_embeddings_nextn(), llama_get_embeddings_pre_norm_ith()llama_get_embeddings_nextn_ith(). Internal: cparams.embeddings_pre_normcparams.embeddings_nextn, cparams.embeddings_pre_norm_maskedcparams.embeddings_nextn_masked, llm_graph_result::t_h_pre_normt_h_nextn, common_speculative_need_embd_pre_norm()common_speculative_need_embd_nextn(). Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting t_h_nextn (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (speculative.cpp, llama-context.cpp, server-context.cpp, model TUs). Verified by grep across src/main/cpp/*.{cpp,hpp}: zero matches for any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm* symbol. No project source changes required
~b9490–b9495 ggml/src/ggml-cuda/common.cuh + 10 CUDA kernel files New GGML_CUDA_RESTRICT macro replaces __restrict__ on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires __restrict__ to be disabled per llama.cpp PR #24030; the macro expands to nothing under GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER, otherwise to __restrict__. Kernel signatures change from direct T * __restrict__ x parameters to T * x_ptr parameter + an internal T * GGML_CUDA_RESTRICT x = x_ptr; alias line; GGML_UNUSED_VARS calls in fallback branches updated to reference the _ptr names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required
~b9490–b9495 src/llama-arch.{h,cpp} + src/llama-vocab.{h,cpp} + gguf-py/gguf/constants.py + gguf-py/gguf/gguf_writer.py New LLM_KV_TOKENIZER_SUPPRESS_TOKENS GGUF key (tokenizer.ggml.suppress_tokens). When a GGUF declares this array, the loader stores it on llama_vocab::impl::suppress_tokens and exposes it via new llama_vocab::get_suppress_tokens() accessor. The Gemma4 model graph (src/models/gemma4.cpp) reads this list and appends a -INFINITY logit bias to those token IDs at the end of the forward graph (new llm_graph_input_logits_bias class). Additive: existing models without the key produce an empty suppress_tokens vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers suppress_tokens parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image
~b9490–b9495 gguf-py/gguf/constants.py + gguf-py/gguf/tensor_mapping.py + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/clip.cpp + new tools/mtmd/models/gemma4uv.cpp + new tools/mtmd/models/gemma4ua.cpp + tools/mtmd/mtmd-audio.{h,cpp} + tools/mtmd/mtmd.cpp + conversion/__init__.py + conversion/gemma.py New Gemma4 Unified vision + audio variant (Gemma4UnifiedForConditionalGeneration). Adds new projector types PROJECTOR_TYPE_GEMMA4UV and PROJECTOR_TYPE_GEMMA4UA (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New V_ENC_EMBD_PATCH_NORM tensor enum (v.patch_norm.{bid}) and 3 indexed patch_norm_{1,2,3}_{w,b} weights on clip_model (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New mtmd_audio_preprocessor_gemma4ua mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream mtmd-cli / mtmd-debug binaries that the project does not link; the JNI build links libllama + libcommon only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required
~b9490–b9495 tools/ui/ (package.json, src/lib/components/app/content/MarkdownContent/, new MermaidPreview.svelte, new DialogMermaidPreview.svelte, new constants / icons / rehype plugins) Upstream llama-server web UI gains Mermaid diagram rendering: new mermaid@^11.15 dependency, lazy-loaded; new rehype plugin chain (rehype-mermaid-pre, rehype-enhance-mermaid-blocks) converts ```mermaid code fences to <pre class="mermaid"> and wraps them with copy / preview action buttons; the existing single-file MarkdownContent.svelte is split into a .svelte + sibling .css / markdown-utils.ts / markdown-handlers.ts so the new mermaid renderer can share helpers. Project does not compile or ship the upstream tools/ui (server-only feature, classpath-only JNI build); no impact
~b9490–b9495 upstream build / verification Local build with GIT_TAG b9495 was verified clean: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit; ctest --test-dir build --output-on-failure reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself
~b9495–b9543 src/llama-hparams.{h,cpp} + every src/models/*.cpp (~150 files) Field hparams::n_layer (uint32_t) was split: the raw count moved to hparams::n_layer_all and hparams::n_layer() is now a member function that returns n_layer_all - n_layer_nextn (the effective non-MTP layer count). Sibling rename: hparams::nextn_predict_layershparams::n_layer_nextn. Every per-model TU in src/models/*.cpp was updated to call hparams.n_layer() and hparams.n_layer_nextn. New hparams::set_recr_pattern() mirror of set_swa_pattern() for hybrid recurrent architectures. New per-layer hparams::deepstack_mapping_arr (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key LLM_KV_DEEPSTACK_MAPPING for Granite4-Vision-style per-layer deepstack injection. hparams::kv_only_nextn was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly — verified via grep -rn "hparams\.n_layer|nextn_predict_layers|n_layer_nextn|n_layer_all|deepstack_mapping" src/main/cpp/ src/test/cpp/ returns zero matches. All consumers are inside upstream-compiled TUs (llama-model.cpp, llama-context.cpp, model TUs); no project source changes required
~b9495–b9543 include/llama.h (state-seq flags) + tools/server/server-context.cpp + examples/speculative-simple/speculative-simple.cpp The LLAMA_STATE_SEQ_FLAGS_ON_DEVICE flag was removed from the llama_state_seq_flags enum. All upstream call sites that passed LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE were updated to pass only LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY — the on-device path is now the default for partial saves/loads. Project does not call llama_state_seq_get_* / llama_state_seq_set_* directly from jllama.cpp; the only consumer in the JNI build is upstream server-context.cpp (speculative checkpoint helpers), which was updated upstream. Verified via grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/ returns zero matches. No project source changes required
~b9495–b9543 new common/imatrix-loader.{h,cpp} + refactor of tools/imatrix/imatrix.cpp + tools/quantize/quantize.cpp Extracted shared imatrix-loading logic into a standalone library: new common_imatrix struct (entries, datasets, chunk_count, chunk_size, is_legacy, has_metadata) and common_imatrix_load(const std::string &, common_imatrix &) reader. New GGUF metadata keys exposed as LLM_KV_IMATRIX_DATASETS, LLM_KV_IMATRIX_CHUNK_COUNT, LLM_KV_IMATRIX_CHUNK_SIZE. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: common/CMakeLists.txt now includes imatrix-loader.cpp and imatrix-loader.h in libcommon, which means the JNI build picks up the new TU automatically via FetchContent + the existing target_link_libraries(jllama PRIVATE common) line. Project does not use imatrix loading from Java today (no LlamaImatrix class); the new symbols ship as additive surface area only. No project source changes required
~b9495–b9543 tools/mtmd/clip.{h,cpp} + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/mtmd.{h,cpp} + tools/mtmd/mtmd-helper.{h,cpp} + tools/mtmd/mtmd-image.cpp + every tools/mtmd/models/*.cpp Large MTMD subsystem refactor: (1) clip_image_u8 and clip_image_f32 switched from public POD-style nx / ny / buf fields to private members with get_size() / set_size() / get_ro_buf() / cpy_buf() / get_pixel() / set_pixel() / is_placeholder() getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from tools/mtmd/clip.h: clip_embd_nbytes, clip_embd_nbytes_by_img, clip_image_u8_get_data, clip_build_img_from_pixels, clip_get_newline_tensor, clip_encode_float_image, clip_image_f32_batch_add_mel. (3) mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() gained a required bool placeholder parameter (when true the bitmap reserves shape only, no pixel decode — used for token counting). (4) mtmd_bitmap is now a true class (private buffer + is_placeholder() / can_batch_with()); mtmd_bitmap_init() and mtmd_bitmap_init_from_audio() accept nullptr data to create placeholder bitmaps. (5) New Granite4 Vision projector type PROJECTOR_TYPE_GRANITE4_VISION and tensor enums (V_MULTI_PROJ_*, V_QF_*) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: clip_graph_qwen2vl::build_inp_with_temporal_merge() plus n_batch_max=2 for batch-merged consecutive image frames. Project does not link any tools/mtmd/* TUs into the JNI build (libllama + libcommon only); the JNI vision API surfaces through mtmd-helper.h and was reviewed: zero clip_image_* / removed-helper references found across src/main/cpp/ and src/test/cpp/. No project source changes required
~b9495–b9543 tools/server/server-context.cpp + tools/server/server-http.cpp + tools/server/server.cpp (new /v1/responses/input_tokens + /v1/chat/completions/input_tokens + /v1/messages/count_tokens) New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: server_routes::handle_count_tokens() consolidates the body parsing path (chat completions, responses, anthropic messages) and emits {"input_tokens": N, "object": "response.input_tokens"}. process_mtmd_prompt() signature gained a bool is_placeholder = false parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither tools/server/server.cpp nor server-http.cpp); the only server TU we link is server-context.cpp, where the only project-visible change is the new optional process_mtmd_prompt parameter, which is defaulted — existing project call sites compile unchanged. No project source changes required
~b9495–b9543 common/chat-peg-parser.{h,cpp} + common/chat.cpp (LFM2/2.5 unified) LFM2.5's chat-completion parser was merged into the single common_chat_params_init_lfm2() (was a separate _lfm2_5 function); a bool tool_list_tokens flag toggles between the two template flavours. New helper common_chat_peg_builder::python_or_json_value() and a new bool allow_json_literals parameter on python_style_tool_calls() so LFM2.5 can accept JSON-cased true / false / null alongside the Python-cased literals. Pure-Python literal normalisation in chat-peg-parser.cpp (True/False/None → JSON during streaming). Project does not call any common_chat_peg_* or common_chat_params_init_lfm2* symbols; routing happens inside upstream-compiled chat.cpp. No project source changes required
~b9495–b9543 ggml/src/ggml-cuda/mmvq.cu + ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c + ggml/src/ggml-metal/ggml-metal-device.m + ggml/src/ggml-opencl/* + ggml/src/ggml-sycl/* + ggml/src/ggml-vulkan/* + ggml/src/ggml-webgpu/* + ggml/src/ggml-cpu/kleidiai/kleidiai.cpp Per-backend numerical & performance work: (1) CUDA mul_mat_vec_q_moe switched to GGML_CUDA_RESTRICT aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (vl128 / vl256 / vl512 / vl1024 separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster concat/cpy/get_rows packed kernels for narrow tensors (<32 cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; should_reorder_tensor gate widened from ne[1]==1 to ne[1]<=8. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every coopmat2_features.* bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (U32_DEQUANT_HELPERS); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars GGML_KLEIDIAI_CHUNK_MULTIPLIER & GGML_KLEIDIAI_SME thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to jllama.cpp. No project source changes required
~b9495–b9543 conversion/__init__.py + conversion/granite.py + conversion/gemma.py + convert_lora_to_gguf.py + gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py Python-side: new Granite4VisionMmprojModel (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (hidden_size falls back to audio_embed_dim; model_patch_size falls back to patch_size * pooling_kernel_size). convert_lora_to_gguf.py gained --trust-remote-code. New LLM_KV_DEEPSTACK_MAPPING writer (add_deepstack_mapping) and new clip-vision keys (KEY_PROJ_SAMPLE_QUERY_SIDE, KEY_PROJ_SAMPLE_WINDOW_SIDE, KEY_PROJ_SPATIAL_OFFSETS, KEY_FEATURE_LAYERS, KEY_IMAGE_GRID_PINPOINTS) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required
~b9495–b9543 upstream build / verification Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero grep matches in src/main/cpp/ for any of the renamed or removed symbols: hparams.n_layer, nextn_predict_layers, n_layer_nextn, n_layer_all, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE, clip_image_u8/clip_image_f32 field access, clip_build_img_from_pixels, clip_get_newline_tensor, clip_image_u8_get_data, clip_embd_nbytes, clip_embd_nbytes_by_img, clip_encode_float_image, clip_image_f32_batch_add_mel, mtmd_helper_bitmap_init_from_file, mtmd_helper_bitmap_init_from_buf, common_imatrix_load). The only project-visible signature change — process_mtmd_prompt()'s new bool is_placeholder parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself
~b9543–b9549 include/llama.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-ext.h New llama_context_params::ctx_other field (a source/target/parent llama_context *, default nullptr) used to share results or llama_memory between two contexts; mirrored by new cparams.ctx_other and the new staging API llama_get_ctx_other() (llama-ext.h). llama_get_memory() was moved earlier in llama-context.cpp and made null-safe (returns nullptr for a null ctx). llama_context_default_params() initializes ctx_other = nullptr. Project does not aggregate-init llama_context_params (it goes through llama_context_default_params() inside upstream server-context.cpp) and never includes llama-ext.h — verified via grep -rn "llama_context_params|ctx_other|llama-ext.h|llama_get_ctx_other|llama_get_memory" src/main/cpp/ returns zero matches. No project source changes required
~b9543–b9549 src/llama-kv-cache.{h,cpp} + llama-kv-cache-iswa.{h,cpp} + llama-kv-cache-dsa.cpp + llama-memory.h + llama-memory-hybrid{,-iswa}.cpp KV-cache constructors gained two new parameters: llama_memory_t mem_other and layer_share_cb share (std::function<int32_t(int32_t il)> returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). llama_memory_params gained a mem_other field. All call sites (iswa/dsa/hybrid wrappers, llama_model::create_memory) updated upstream; the project never constructs a llama_kv_cache* or llama_memory_* directly. No project source changes required
~b9543–b9549 src/llama-arch.{h,cpp} + new src/models/gemma4-assistant.cpp + src/models/models.h + src/llama-model.{h,cpp} + src/llama-hparams.{h,cpp} + src/llama-graph.{h,cpp} + gguf-py/ + conversion/gemma.py New model architecture LLM_ARCH_GEMMA4_ASSISTANT ("gemma4-assistant") — a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors LLM_TENSOR_NEXTN_PROJ_PRE/NEXTN_PROJ_POST (nextn.pre_projection/post_projection) plus model-level nextn_proj_pre/nextn_proj_post; new hparams n_embd_inp_impl (input-embedding dim override, honoured by n_embd_inp()) and graph field n_layer_nextn. Python conversion registers Gemma4AssistantForCausalLM/Gemma4UnifiedAssistantForCausalLM. This is the headline new feature; it is a speculative-decoding / MTP mechanism, which this project tracks as deferred-by-policy (see Open TODOs / spec-draft-backend-sampling + MTP). Consumed entirely inside upstream-compiled TUs — loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO
~b9543–b9549 common/chat.cpp + new models/templates/LFM2.5-8B-A1B.jinja LFM2 chat-template handling: prior-turn reasoning_content is now copied into the template's thinking field, and <think> reasoning extraction is gated on the template source actually containing <think> (and no longer on enable_thinking). New LFM2.5-8B-A1B template + parser test consolidation. Routing happens inside upstream-compiled chat.cpp; the project calls no common_chat_params_init_lfm2* symbol. Handled automatically when such a model is loaded; no project source or Java API changes required
~b9543–b9549 common/arg.cpp + common/speculative.cpp + src/llama-graph.cpp common_params_handle_models() mmproj auto-download now also requires params.mmproj.path.empty() && params.mmproj.url.empty() (an explicitly-specified mmproj is no longer re-downloaded). speculative.cpp MTP path adds a shared-memory fast path (is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to llama_model_n_embd_out() for the MTP row width. llama-graph.cpp moved the set_input_kq_mask / can_reuse_kq_mask calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required
~b9543–b9549 tools/server/server-context.cpp (project-linked) The one project-linked server TU changed: now #includes ggml-cpp.h and ../../src/llama-ext.h; sets cparams.ctx_other = ctx_tgt for MTP draft/MTP contexts; moved the ctx_dft_seq_rm_type = common_context_can_seq_rm(...) assignment to after context init (guarded by if (ctx_dft)); downgraded the spec memory-measure failure log from SRV_ERR to SRV_WRN; and gated the mtmd draft-processing block on llama_get_ctx_other(ctx_dft) != ctx_tgt. All changes are internal to the TU and the new includes resolve against the FetchContent'd src/ and ggml headers. Compiles into jllama unchanged from the project's side. No project source changes required
~b9543–b9549 .github/workflows/docker.yml (upstream CI) Upstream's cuda13 Docker image bumped from CUDA 13.1.1 to 13.3.0. Upstream's own CI only; this project ships its own publish.yml and pins CUDA 13.2 via .github/build_cuda_linux.sh (see CLAUDE.md "Upgrading CUDA Version"). No impact
~b9543–b9549 project CMakeLists.txt (pre-existing latent bug, fixed in this bump) Not an upstream change — surfaced while build-testing this bump locally. The OS/arch detection block invoked net.ladenthin.llama.OSInfo, but the class had moved to net.ladenthin.llama.loader.OSInfo in the earlier layered-package restructure, so cmake -B build failed with "Could not determine OS name" on any host that does not pass -DOS_NAME/-DOS_ARCH explicitly (CI does, which is why it went unnoticed). Fixed both execute_process invocations (--os and --arch) to the loader.OSInfo FQN. Same stale-FQN-after-restructure class as the earlier spotbugs-exclude.xml / PIT-targetClasses repairs — the standing reminder to re-validate every FQN-bearing config after a package move now also covers CMakeLists.txt
~b9543–b9549 upstream build / verification Local build with GIT_TAG b9549 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly (after the loader.OSInfo FQN fix above), cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit (incl. the changed server-context.cpp), and ctest --test-dir build --output-on-failure reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9549–b9553 common/sampling.h + common/sampling.cpp + common/arg.cpp + common/common.cpp + tools/server/server-task.cpp common_sampler_types_from_names() dropped its bool allow_alt_names parameter — the signature is now common_sampler_types_from_names(const std::vector<std::string> & names). The body was rewritten to (a) auto-generate kebab-case (top-k) and no-dash (topk) aliases from the canonical snake_case names, plus misc aliases (nucleus→top_p, temp→temperature, typ→typical_p), and (b) lowercase the input so matching is case-insensitive; aliases are now always accepted (the old gate is gone). All three call sites were updated upstream (arg.cpp / common.cpp dropped the , true arg; server-task.cpp dropped the , false arg). Project impact: none at the source levelgrep -rn common_sampler_types_from_names src/main/cpp src/test/cpp returns zero matches; the symbol is reached only through the upstream-compiled server-task.cpp linked into jllama. New behaviour exposed for free: because server-task.cpp previously passed allow_alt_names=false, the project's InferenceParameters samplers JSON array only matched canonical names like top_k; it now also accepts top-k / topk / nucleus / temp / typ and is case-insensitive (TOP_K, Min-P). Pinned by 5 new ParamsFromJsonCmpl.Samplers_* tests in test_server.cpp
~b9549–b9553 src/llama-kv-cache.cpp + src/llama-kv-cache.h + src/llama-kv-cells.h KV-cache shared-cells refactor (continues TAG_KV_CACHE_SHARE_CELLS, used by the Gemma4-assistant MTP head): the v_cells member changed from a by-value std::vector<llama_kv_cells> to a std::shared_ptr<llama_kv_cells_vec> v_cells_impl plus a llama_kv_cells_vec & v_cells reference, so a target cache now views the source cache's cells instead of copying them in apply_ubatch(); the constructor also clamps kv_size down to the shared source's size. New type alias using llama_kv_cells_vec = std::vector<llama_kv_cells>; in llama-kv-cells.h. All internal src/ headers the JNI build does not include (the project pulls public llama.h / llama-cpp.h, never llama-kv-cache.h / llama-kv-cells.h) — verified via grep -rn "llama_kv_cells|llama-kv-cache" src/main/cpp src/test/cpp → zero matches. No project source changes required
~b9549–b9553 conversion/mistral.py + convert_hf_to_gguf.py Python conversion-script robustness only: hparams["llama_4_scaling"] and "moe" in hparams replaced with hparams.get(...) / is not None guards so a present-but-null key no longer crashes conversion. Python tooling, not part of the JNI build. No impact
~b9549–b9553 upstream build / verification Local build with GIT_TAG b9553 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest --test-dir build --output-on-failure reports 440/440 tests passing (435 prior + 5 new Samplers_* tests). The sole breaking change in this range (the common_sampler_types_from_names signature) is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself
~b9553–b9555 .devops/intel.Dockerfile + ggml/src/ggml-metal/ggml-metal-device.cpp + tests/test-backend-ops.cpp Tiny maintenance bump — no API change and no new feature. (1) intel.Dockerfile: Intel GPU userspace driver pins bumped (IGC v2.20.5v2.34.4, compute-runtime 25.40.35563.1026.18.38308.1, IGDGMM 22.8.222.10.0) with the old multi-GPU-safe versions commented out; upstream's own Docker image only — this project ships its own publish.yml and does not consume .devops/. No impact. (2) ggml-metal-device.cpp: bugfix to the Metal im2col pipeline selector — the standard-vs-_ext kernel choice now keys off the actual conv-kernel footprint (KH*KW, with KH = is_2D ? ne01 : 1, KW = ne00) instead of the raw ne00*ne01 product, fixing kernel selection for 1-D convolutions. Backend-internal Metal TU compiled via FetchContent; no API surface visible to jllama.cpp, and only affects the macOS/Metal backend at runtime. (3) tests/test-backend-ops.cpp: one extra test_im2col case ({3000,384,1,1} / {3,384,384,1}) added — upstream test only, not linked into the JNI build. No project source changes required; no new Java-API-exposable feature. Build verification deferred to CI (publish.yml) / a developer host as usual
~b9555–b9621 ggml/include/ggml.h + ggml/src/ggml.c + ggml/src/ggml-cuda/gated_delta_net.cu + ggml/src/ggml-metal/ggml-metal.metal + ggml/src/ggml-vulkan/vulkan-shaders/gated_delta_net.comp ggml_gated_delta_net state tensor reshaped again: the 3D (S_v*S_v*H, K, n_seqs) layout is now the 4D [S_v, S_v, H, n_seqs] with an explicit int64_t K seventh parameter (snapshot count, K=1 is final-state-only). Signature: ggml_gated_delta_net(ctx, q, k, v, g, beta, state, K) (was 6-argument). Snapshot-slot ordering also flipped to most-recent-first. Internal Qwen3.5 / Qwen3-Next recurrent-attention kernel; project does not call ggml_gated_delta_net directly — no project source changes required
~b9555–b9621 ggml/include/ggml.h New ggml_col2im_1d(ctx, a, s0, oc, p0) function and GGML_OP_COL2IM_1D enum value added; GGML_OP_COUNT incremented 96 → 97. Additive; not called by project — no project source changes required
~b9555–b9621 common/fit.h + tools/server/server-context.cpp common_get_device_memory_data() return type changed: now returns common_device_memory_data_vec (typedef for std::vector<common_device_memory_data>). New common_device_memory_data struct carries .total, .free, .model, .context, .compute fields directly (previously the caller reached them via .mb.model etc.). fit.h also dropped its #include "ggml-backend.h" and #include "../src/llama-ext.h" lines (those types are no longer needed at the header level). Consumed exclusively in upstream-compiled server-context.cpp (field-accessor update from .mb.model.model etc. was applied upstream); project does not include fit.h or call common_get_device_memory_data() directly — no project source changes required
~b9555–b9621 tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp + tools/server/server-common.cpp mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() return type changed: both now return mtmd_helper_bitmap_wrapper struct (contains bitmap + video_ctx fields) instead of mtmd_bitmap*. All call sites updated in upstream server-common.cpp. Project does not call these functions from src/main/cpp/ (verified via grep: zero matches) — no project source changes required
~b9555–b9621 tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp New video pipeline: mtmd_helper_video_context, mtmd_helper_video_* API family (init/free/decode), ffmpeg-based frame extraction. New --video CLI flag in common/arg.cpp; new input_video content type in server-common.cpp. Multimodal helper additions flow through the upstream-compiled mtmd-helper.cpp and server-common.cpp; project does not reference any mtmd_helper_video_* symbol — no project source changes required. Could be exposed in a future Java API as InferenceParameters.setVideoPath(String)
~b9555–b9621 common/common.h New common_params fields: path_prompts_log_dir (prompt-logging output directory, string) and mtmd_batch_max_tokens (multimodal batch token limit, default 1024). Both additive with harmless defaults. Not surfaced by ModelParameters today — could be added in a future enhancement. No project source changes required
~b9555–b9621 src/llama-ext.h New EAGLE3 speculative-decoding support APIs: llama_set_embeddings_layer_inp(ctx, lid, value), llama_get_embeddings_layer_inp(ctx, lid), llama_model_target_layer_ids(model)const int32_t*, llama_model_target_layer_ids_n(model)uint32_t. New LLM_ARCH_EAGLE3 model architecture; new llama_model_eagle3 struct in upstream model sources. EAGLE3 enables full encoder+decoder graph implementation for speculative decoding. All consumed inside upstream-compiled speculative.cpp and model TUs; project does not reference any of these symbols — no project source changes required. Could be exposed later as a speculative-decoding backend type in ModelParameters
~b9555–b9621 src/llama-graph.h + src/llama-graph.cpp llm_graph_result::set_outputs() signature changed: now takes a const llm_graph_params & parameter (was no-parameter). New t_layer_inp vector added to llm_graph_result for layer-input embedding extraction (used by EAGLE3). Internal graph-building API; not called from project sources — no project source changes required
~b9555–b9621 src/llama-context.cpp llama_context now initializes embeddings_layer_inp storage for EAGLE3 layer-input extraction; n_outputs_max is forced to n_batch when llama_model_has_encoder() returns true (encoder models always need all outputs). Internal context lifecycle; no project sources reference these fields — no project source changes required
~b9555–b9621 vendor/cpp-httplib/httplib.h + httplib.cpp cpp-httplib bumped to v0.47.0. Compiled automatically via FetchContent — no project source changes required
~b9555–b9621 ggml/src/ggml-cuda/ggml-cuda.cu ggml_concat on CUDA now handles F16, BF16, I8, I16, I32, I64 element types in addition to F32; active_count tracking added to CUDA context to prevent memory leak from lazy cudaMemGetInfo context creation. Internal CUDA backend, no project changes required
~b9555–b9621 ggml/src/ggml-vulkan/ + Vulkan shaders New VK_VALVE_shader_mixed_float_dot_product extension support for F16→F32 fused dot products (dot2_f16) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required
~b9555–b9621 ggml/src/ggml-opencl/ + OpenCL kernels New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects opencl-android-aarch64 classifier build only); no project source changes required
~b9555–b9621 ggml/src/ggml-cuda/ssm-scan.cu Added __syncthreads() before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required
b9621–b9637 common/chat.cpp New Cohere2 MoE ("North Code") chat parser common_chat_params_init_cohere2moe + auto-detection (template containing <|START_TEXT|> and <|START_ACTION|>). Purely additive — compiled in the chat.cpp TU and reached through the existing specialized-template path, so the project's oaicompat_chat_params_parse picks it up automatically. No project source changes required. New feature: Cohere2 MoE reasoning + JSON tool-call chat support
b9621–b9637 common/jinja/runtime.cpp, common/jinja/value.cpp Jinja chat-template engine fixes: filter aliases countlength, ddefault, eescape; negative-step slice start/stop defaults; split raises on empty separator; replace('', x) now expands between every char. Compiled into common; improves chat-template compatibility automatically. No project source changes required
b9621–b9637 src/llama-arch.{h,cpp}, src/models/cohere2moe.cpp (new), src/models/models.h, src/llama-model.cpp, src/llama-model-saver.cpp, src/llama-vocab.cpp New LLM_ARCH_COHERE2MOE architecture (MoE + MTP/NextN) with llama_model_cohere2moe; cohere2moe tokenizer pre-type (maps to LLAMA_VOCAB_PRE_TYPE_TINY_AYA); Cohere2 dense path gains ffn_*_s NVFP4 scale tensors; tied-NVFP4-output assert relaxed to allow sidecar LM-head scales. Additive enum/struct internal to libllama; the project includes llama.h, not llama-arch.h/models.h, and switches on no arch enum. No project source changes required. New feature: loads North-Mini-Code GGUFs
b9621–b9637 ggml/src/ggml-vulkan/ + shaders Unary shaders consolidated into one templated unary.comp; new EXPM1 Vulkan op; GLU push-constants reworked (per-dim strides + misalign offsets); fastdiv L values byte-packed to stay under the 128B push-constant limit. Internal Vulkan backend — the project builds CPU/CUDA/Metal/OpenCL only, never Vulkan. No project changes required
b9621–b9637 tools/server/server-http.cpp, tools/ui/, scripts/ui-assets.cmake Optional gzip-compressed WebUI asset serving (LLAMA_UI_GZIP, llama_ui_use_gzip()). The project compiles server-context/queue/task/models but not server-http.cpp or tools/ui, so the HTTP/WebUI layer is absent from jllama. No project changes required
b9621–b9637 tools/cli/cli.cpp, .devops/*.Dockerfile, .github/, conversion/, convert_hf_to_gguf_update.py, gguf-py/, models/templates/Cohere2MoE.jinja, docs/, tests/ CLI preserved-token wiring, Docker image docker.io/ prefixes, CI labeler/release tweaks, Python GGUF converters, the new model template asset, doc typos, and upstream tests. None are compiled into jllama or shipped by the project. No project changes required
b9637–b9642 ggml/src/ggml-cuda/ggml-cuda.cu ggml_backend_cuda_device_supports_op for GGML_OP_REPEAT tightened: the supported-types check changed from a blocklist (!= I32 && != I16) to an allowlist (== F32 || == F16), because the CUDA REPEAT path only implements F32/F16 and other types asserted at runtime. Internal CUDA backend; the project switches on no op-support enum and never calls this. No project changes required
b9637–b9642 ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl WebGPU matmul shared-memory dequant templates rewritten: legacy/k-quant #elif chains converted to independent #if defined(...) blocks, and the i-quant (super-block 256) IQ1/IQ2/IQ3/IQ4 paths reworked to process NQ quants per thread with vectorized store_shmem_iquants/create_iq_gw4 helpers. Internal WebGPU backend — the project builds CPU/CUDA/Metal/OpenCL only, never WebGPU. No project changes required
b9637–b9642 tools/ui/, tools/ui/src/lib/utils/heic-to-jpeg.ts (new) WebUI gains a "render thinking as Markdown" display setting and client-side HEIC/HEIF image upload support (lazy CDN-loaded heic-to decoder → JPEG). The project compiles server-context/queue/task/models but not tools/ui, so the WebUI is absent from jllama. No project changes required
b9637–b9642 convert_lora_to_gguf.py, tests/test-backend-ops.cpp LoRA converter now resolves the base-model architecture via get_model_architecture(hparams, ModelType.TEXT) instead of hand-reading text_config/architectures; a GGML_TYPE_BF16 test_repeat case was added to the backend-ops test. Python tooling and an upstream test — neither is compiled into jllama. No project changes required
b9642–b9682 tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp mtmd_helper_decode_image_chunk gained two parameters — a post-decode callback plus its user_data — so callers can hook each decoded multimodal chunk; the standalone process_chunk helper was removed and folded into mtmd_helper_eval_chunk_single. Consumed only inside the upstream-compiled mtmd-helper.cpp / server-context.cpp; the project's hand-written C++ references no mtmd_*/process_chunk symbol (zero matches in src/main/cpp). No project source changes required. New feature: the post-decode callback enables multimodal speculative-draft decoding — exposable later as a vision + draft-model Java path
b9642–b9682 common/common.cpp (build_lora_mm_id) The LoRA multimodal id-embedding builder gained a w_s scale-weight argument for per-adapter scaling. Internal to the upstream-compiled common library; the project never calls it. No project source changes required
b9642–b9682 common/speculative.{h,cpp} Speculative decoding now accumulates per-draft-position acceptance statistics and adds an Eagle3 backend-sampling path (the draft model samples on the compute backend). common_speculative_* is compiled into common and reached only through the upstream server's speculative slot; the project's C++ references no speculative/draft symbol. No project source changes required. New feature: per-position draft-acceptance metrics — could surface as speculative-decoding telemetry in a future Java API
b9642–b9682 tools/server/server-context.cpp Server slot refactored so an mtmd (multimodal) prompt can feed a speculative draft model: image/media chunks are routed through the new mtmd_helper_decode_image_chunk callback before drafting. Compiled directly into jllama (the project builds server-context/queue/task/models), but the change is internal to the slot state machine and binds no new/renamed symbol; verified that jllama.cpp and the *_helpers.hpp headers call none of the touched functions. No project source changes required
b9642–b9682 ggml/src/ggml-* backends, tools/ (incl. llama-bench --offline), conda-forge packaging, docs/, .github/ Routine backend kernel updates and tooling/docs/CI tweaks (a new llama-bench --offline flag, conda-forge recipe notes). None are compiled into jllama beyond the already-built CPU/CUDA/Metal/OpenCL backends, and none change a symbol the project binds. No project changes required
b9682–b9739 tools/server/server-schema.{h,cpp} (new) + tools/server/server-task.{h,cpp} Build-breaking. server_task::params_from_json_cmpl() MOVED to server_schema::eval_llama_cmpl_schema() in new server-schema.h/server-schema.cpp. Required project changes: (1) add server-schema.cpp to the target_sources(jllama ...) block in CMakeLists.txt; (2) add #include "server-schema.h" in src/main/cpp/jllama.cpp and src/test/cpp/test_server.cpp; (3) update the call sites in jllama.cpp:203 and test_server.cpp:1722 from server_task::params_from_json_cmpl(...) to server_schema::eval_llama_cmpl_schema(...)
b9682–b9739 common/common.h (common_params_model) common_params_model::name field REMOVED; replaced by get_name() method. Not referenced in project source (model name is read from server_context_meta::model_name, populated upstream) — no project source changes required
b9682–b9739 common/common.h (common_params) webui, webui_mcp_proxy, webui_config_json fields REMOVED (deprecated aliases; replaced by ui/ui_mcp_proxy/ui_config_json introduced in b9172). Project never references these fields directly — no project source changes required
b9682–b9739 tools/server/server-models.h + server-models.cpp server_state enum: SERVER_STATE_LOADING_MODEL renamed to SERVER_STATE_LOADING; new SERVER_STATE_SLEEPING added. on_sleeping_changed callback replaced by set_state_callback with server_state_callback_t type. None are referenced in jllama.cpp — no project source changes required
b9682–b9739 vendor/cpp-httplib/httplib.{h,cpp} cpp-httplib bumped from v0.47.0 to v0.48.0. Compiled automatically via FetchContent — no project source changes required
b9682–b9739 common/speculative.{h,cpp} New common_speculative_get_state() / common_speculative_set_state() Eagle3 state checkpointing APIs; common_prompt_checkpoint::data_spec field added for Eagle3 speculative draft state stash. Additive; compiled into upstream common; project does not call these functions — no project source changes required. New feature: Eagle3 speculative decoding state save/restore — could expose later
b9682–b9739 common/download.h + common/download.cpp New common_download_remove() function for deleting cached model files. Additive; project does not call it — no project source changes required. New feature: could be exposed as LlamaModel.deleteCachedModel(String path)
b9682–b9739 common/arg.cpp New --agent flag that enables all tools + MCP CORS proxy in one step. Server-level CLI flag; not referenced by ModelParameters — no project source changes required. New feature: consider ModelParameters.setAgent(boolean)
b9682–b9739 common/arg.cpp + tools/server/server-http.cpp API key file: lines starting with # are now treated as comments and ignored. Behaviour fix for existing ModelParameters.setApiKeyFile(String) users — upgrade picks it up automatically, no source changes required
b9682–b9739 ggml/src/ggml-sycl/ New conv2d, conv2d_dw, conv2d_transpose, conv3d SYCL ops; Q1_0 quantization support. Internal SYCL backend, no project changes required
b9682–b9739 ggml/src/ggml-cuda/ New col2im_1d CUDA op. Internal CUDA backend, no project changes required
b9682–b9739 ggml/src/ggml-metal/ ROPE_BACK Metal support; concat kernel extended to additional types. Internal Metal backend, no project changes required
b9739–b9789 common/json-partial.{h,cpp} (removed) + common/peg-parser.{h,cpp} + common/chat.cpp The standalone partial-JSON parser was deleted (json-partial.h/.cpp, −363 lines) and its incremental-JSON handling folded into the PEG parser (peg-parser.cpp +194/−81). Partial JSON during streaming tool-call parsing is now produced by peg-parser instead of common_json_parse. Project never included json-partial.h — verified grep -rn "json-partial|common_json_parse" src/main/cpp src/test/cpp → zero matches. All consumers stay inside upstream-compiled chat.cpp. No project source changes required
b9739–b9789 common/chat.h + common/chat.cpp Message-span types restructured: new enum common_chat_role (+ common_chat_role_from_string/_to_string); common_chat_msg_span::role and common_chat_msg_delimiter::role changed std::stringcommon_chat_role; new container structs common_chat_msg_spans / common_chat_msg_delimiters (the latter with tokenize()/split()/to_json()); common_chat_params::message_spans (vector) → message_delimiters; free function common_chat_split_by_role() removed, replaced by common_chat_msg_delimiters_parse(). common_chat_msg_diff (used by test_server.cpp) is unchanged. Project references none of the changed span/delimiter symbols — verified grep -rn "message_spans|common_chat_split_by_role|common_chat_msg_span|common_chat_msg_delimiter" src/main/cpp src/test/cpp → zero matches. Routing happens inside upstream-compiled chat.cpp / server-*.cpp. No project source changes required
b9739–b9789 tools/server/server-task.h + server-context.cpp + server-common.{h,cpp} Context-checkpointing reworked from a precomputed offset to message spans: task_params::n_before_user (int32) removed, replaced by task_params::message_spans (common_chat_msg_spans); new server_tokens::find_message_spans(const common_chat_msg_delimiters &) helper. test_server.cpp asserts against task_params::to_json() but never references n_before_user — verified grep -rn "n_before_user|message_spans" src/test/cpp → zero matches, so it compiles and passes unchanged. Consumed inside upstream-compiled server-context.cpp linked into jllama. No project source changes required
b9739–b9789 include/llama.h New API llama_model_n_layer_nextn(const llama_model *) — returns the number of NextN/MTP layers (additive; the surrounding accessor block was otherwise only column-realigned). Not called by project; could back a future introspection accessor. No project source changes required
b9739–b9789 common/common.h common_params::checkpoint_min_step default raised 2568192 (minimum spacing between context checkpoints). Tuning default consumed inside upstream-compiled server-context.cpp; not surfaced by ModelParameters. No project source changes required
b9739–b9789 common/arg.h + common/arg.cpp + common/download.h common_params_handle_models() gained a 3rd parameter — a common_params_handle_models_params struct ({ common_download_callback*, bool preset_only }) for router-mode preset-only downloads; arg.h now #includes download.h; new common_download_opts::preset_only. Project does not call common_params_handle_models() directly (arg parsing happens upstream) — grep -rn common_params_handle_models src/ → zero matches. No project source changes required
b9739–b9789 common/arg.cpp + common/arg.h + ~34 tools/*,examples/*,tests/* mains + tests/test-arg-parser.cpp (patch target) Upstream's Windows common_params_parse argv handling changed again: the unconditional argc/argv = make_utf8_argv() override (the original #24779 regression) became a count-guard if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); } — exactly the variant the project already found breaks its Windows server-integration tests (embedded argv length coincides with java.exe's). patches/0001-win32-arg-parse-embed-guard.patch now carries the complete upstream fix (37 files): common_params_parse parses exactly the argv it is given; a new common_params_parse_main() wrapper holds the GetCommandLineW recovery; the ~34 standalone main() call sites flip to it; and a tests/test-arg-parser.cpp case pins the contract. The embedded JNI caller stays on common_params_parse and is respected. Our subproject build compiles only the arg.{cpp,h} core (LLAMA_BUILD_TOOLS/TESTS OFF, so our 454-test suite is unchanged); the flips + test were validated via a one-off tools+tests build (new test passes; test-arg-parser's only failure is the live ggml.ai download check — sandbox network). 37-file patch — refresh on every bump
b9739–b9789 tools/mtmd/mtmd.h + tools/mtmd/clip.h + clip.cpp + mtmd.cpp New feature — multimodal model-load progress: new mtmd_progress_callback typedef + progress_callback / progress_callback_user_data fields on mtmd_context_params and clip_context_params (additive, appended to the structs; returning false aborts the load). Project does not aggregate-init either struct (grep -rn mtmd_context_params src/ → zero matches) so the new fields are harmless; could later feed a Java LoadProgressCallback for vision models. No project source changes required
b9739–b9789 tools/server/server-models.{h,cpp} + server-context.h Multi-model router refactor: model downloading moved into a dedicated child-process mode (enum server_child_mode, server_models::load(name, load_options), server_child::run_download(); old server_models::download() removed); SERVER_STATE_DOWNLOADING re-enabled in server_state. Project links server-models.cpp but does not drive the router (grep -rn "server_models|SERVER_CHILD_MODE" src/ → zero matches). Compiles into jllama unchanged. No project source changes required
b9739–b9789 ggml/src/ggml-{hexagon,vulkan,sycl,opencl,webgpu,cuda}/ + shaders Backend-internal work only: Hexagon HTP matmul kernels re-tiled (hmx-matmul-ops.chmx-mm-kernels-tiled.h); Vulkan gains a conv3d_mm shader + get_rows_back and folds the elementwise unary shaders (clamp/cos/sin/sqrt/square/leaky_relu.comp removed) into unary.comp; SYCL element-wise / conv3d additions; OpenCL Adreno norm/gemv tweaks; WebGPU mul_mat_vec refactor. No API surface visible to jllama.cpp; the OpenCL set only affects the opencl-android-aarch64 classifier. No project source changes required
b9739–b9789 common/json-schema-to-grammar.cpp (Java-test impact) The JSON-schema → GBNF serializer changed where it emits the space whitespace rule: a closing object is now … )? space "}" (was … )? "}" space) and a root-level string rule no longer appends a trailing space (string ::= "\"" char* "\"", was … "\"" space). Functionally equivalent (leading- vs trailing-whitespace placement) but byte-different, so the pinned expectation in LlamaModelTest.testJsonSchemaToGrammar was updated to the b9789 output. LlamaModel.jsonSchemaToGrammar is a pure JNI call (no model), so this failed on every platform's Java-test job; the new expectation was verified locally against the built b9789 libjllama. Test-data change only
b9739–b9789 tools/server/server-context.cpp (patch target, regression) server_context::load_model now unconditionally installs the server's own load-progress reporter on params_base.load_progress_callback immediately before common_init_from_params (b9739 called common_init_from_params(params_base) with no such assignment). This clobbered libjllama's LoadProgressCallback JNI trampoline (set on common_params.load_progress_callback before load_model), so LoadProgressCallbackTest observed zero progress updates and the abort-on-false path stopped throwing. Fixed by new patches/0002-server-preserve-caller-load-progress-callback.patch, which guards the install behind if (params_base.load_progress_callback == nullptr) so a caller-supplied callback survives (standalone llama-server keeps its reporter — the field is null there). Re-verified to apply + reverse-apply cleanly against b9789 and to compile clean (ctest still 454/454)
b9739–b9789 upstream build / verification Local build with GIT_TAG b9789 verified clean on Linux x86_64 (GCC 13.3; sources were pre-staged from release tarballs + both patches hand-applied because this sandbox blocks github.com git-clone, so FetchContent's git path and PATCH_COMMAND could not run — the published CI pipeline uses the normal git FetchContent path). cmake -B build -DBUILD_TESTING=ON configures cleanly (the OuteTTS build-time extraction and the refreshed Windows patch both pass their fail-loud anchor checks against b9789), cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest --test-dir build --output-on-failure reports 454/454 tests passing. Every upstream breaking change in this range is absorbed inside upstream-compiled translation units, so no project C++ source edits were required — but PR CI's model-backed Java suite (which the restricted sandbox cannot run) surfaced two project-side fixes captured in the two rows above: the json-schema-to-grammar test-expectation update and the load_progress_callback server regression (patches/0002)
b9789–b9803 common/arg.{cpp,h} + common/download.{cpp,h} + common/common.h (model-download refactor) The model-download pipeline was rewritten: common_params_handle_models() / common_params_handle_models_params / common_download_model() / common_download_model_result removed and replaced by a two-phase common_models_handler API (common_models_handler_init() builds the HF plan + opts; common_models_handler_apply() runs the parallel common_download_task list); common_download_opts::skip_download / ::preset_only and the whole common_skip_download_exception type removed; new common_download_get_hf_plan() / common_download_run_tasks() / common_download_get_all_parts(); download.h now #includes hf-cache.h. Project C++ references none of these — verified grep -rn "common_params_handle_models|common_download_model|common_skip_download|skip_download|preset_only" src/main/cpp src/test/cpp → zero matches, and no project TU includes download.h directly. All consumers (arg parsing, server-models.cpp, llama-bench.cpp) are upstream-compiled. No project C++ source changes required. Java API follow-up (behavioural): this removal exposed that the project's ModelFlag.SKIP_DOWNLOAD (--skip-download) was never a registered upstream arg — it only ever forced a parse failure that SkipDownloadFailureTranslator mapped to ModelUnavailableException, and it could never load a present model. It was replaced with the real upstream --offline flag: ModelFlag.OFFLINE + ModelParameters.setOffline(boolean); the heuristic translator was replaced by a deterministic pre-check OfflineModelGuard (throws ModelUnavailableException when --offline is set and the configured local --model file is absent, before the native call); LlamaModelSkipDownloadTestLlamaModelOfflineTest. ModelUnavailableException is retained. Pure-Java change, no JNI rebuild
b9789–b9803 common/common.h common_params_model gained bool empty() and get_name() became const (additive); common_params::skip_download field removed; new LLAMA_EXAMPLE_DOWNLOAD enumerator appended before LLAMA_EXAMPLE_COUNT. None surfaced by ModelParameters; consumed inside upstream-compiled TUs. No project source changes required
b9789–b9803 CMakeLists.txt + tools/mtmd/CMakeLists.txt New top-level LLAMA_BUILD_MTMD option for standalone library-only mtmd builds; the mtmd CLI executables (llama-llava-cli, llama-gemma3-cli, llama-minicpmv-cli, llama-qwen2vl-cli, llama-mtmd-cli, llama-mtmd-debug) are now gated behind if (LLAMA_BUILD_TOOLS). The project adds tools/mtmd directly with LLAMA_BUILD_TOOLS=OFF, so after this bump those CLI executables are no longer built as collateral — beneficial (less build time); the mtmd library target the project links still builds via the if (TARGET mtmd) block above the gate. No project source changes required
b9789–b9803 common/arg.cpp + docs/speculative.md New feature — EAGLE-3 speculative decoding (--spec-type draft-eagle3): a small one-layer draft transformer that reads the target model's hidden states for higher acceptance; plus a new standalone llama download / llama get subcommand (app/download.cpp, LLAMA_EXAMPLE_DOWNLOAD) and a --mtp download flag. Server-level CLI; not surfaced by ModelParameters/InferenceParameters. Could later feed an inference-parameter setter (--spec-type). No project source changes required
b9789–b9803 ggml/src/ggml-cuda/{binbcast,cpy}.cu + ggml-opencl + src/llama-model.{cpp,h} + src/models/lfm2.cpp Backend/model-internal only: CUDA binbcast/cpy kernels reworked for >INT_MAX index safety (int→uint32/int64 widening + overflow guards); OpenCL flushes the profiling batch on context teardown; new LLM_TYPE_230M mapped for LFM2 (n_ff == 2560). No API surface visible to jllama.cpp; CUDA set only affects the cuda13-linux-x86-64 classifier, OpenCL only the opencl-android-aarch64 classifier. No project source changes required
b9789–b9803 upstream verification (sandbox) Both patches/0001-win32-arg-parse-embed-guard.patch (37 files) and patches/0002-server-preserve-caller-load-progress-callback.patch re-verified to apply cleanly against b9803 via git apply --check over the actual b9803 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for both patches). Patch 0001's common_params_parse target region is byte-identical to b9789; the b9803 arg.cpp churn is confined to the common_models_handler rewrite and set_examples tags, which don't overlap the patched hunks. OuteTTS generator anchors hold (upstream tts.cpp unchanged in this range apart from patch 0001's main()-only parse flip). Full build + ctest to be confirmed by the CI pipeline
b9803–b9829 tools/server/server-stream.{cpp,h} (new) + server-context.cpp + server-http.{cpp,h} + server-models.{cpp,h} + server.cpp + CMakeLists.txt Build-breaking. Upstream added a resumable-streaming SSE replay buffer (PR #23226): a new TU server-stream.cpp defines g_stream_sessions (a process-wide stream_session_manager), stream_session_attach_pipe(), stream_aware_should_stop(), stream_conv_id_from_headers(), and the stream_pipe_producer/stream_pipe_consumer types. The three server TUs the project already compiles into jllamaserver-context.cpp, server-http.cpp, server-models.cpp — now #include "server-stream.h" and reference those symbols (server_res_generator gained a stop() override + a ~server_res_generator that calls spipe->cleanup(); server_http_res gained a std::shared_ptr<stream_pipe_producer> spipe member + virtual stop(); server-models tracks a conv_id → model map). Required project change: add ${llama.cpp_SOURCE_DIR}/tools/server/server-stream.cpp to both the target_sources(jllama ...) block and the jllama_test add_executable(...) sources in CMakeLists.txt, or the link fails with undefined references. It is platform-neutral (threads + std mutex/condvar, no subprocess.h/posix_spawn_*), so it builds on Android too and sits outside the server-models.cpp Android guard. jllama wires its own JNI routes and never calls g_stream_sessions.start_gc() (only the excluded standalone server.cpp main() does), so the GC thread stays dormant — the resumable-stream HTTP routes are not active in the embedded library. New feature: resumable SSE streams (reattach after a dropped socket via X-Conversation-Id) could later be wired into the project's Java OpenAiCompatServer.
b9803–b9829 tools/server/server.cpp + tests/export-graph-ops.cpptests/test-export-graph-ops.cpp (rename) (patch 0001 targets) Patch refresh. patches/0001-win32-arg-parse-embed-guard.patch stopped applying for two reasons: (1) upstream renamed tests/export-graph-ops.cpptests/test-export-graph-ops.cpp (also the llama-export-graph-ops artifact text), so the patch's call-site-flip hunk targeted a now-missing path; (2) the resumable-stream PR inserted g_stream_sessions.start_gc(); right after common_init() in server.cpp, shifting the context of the common_params_parse → common_params_parse_main flip (@@ -82 → @@ -87). Both hunks were regenerated against b9829 (path + index + @@ + leading context). Patch content is otherwise unchanged; the flips remain applied-but-not-compiled here (LLAMA_BUILD_TOOLS/TESTS OFF). Patches 0002/0003/0004 apply unchanged (their target regions — server-context.cpp load-progress guard, the get_meta/get_response_reader area for the slot-prompt-similarity getter/setter, and server-common.cpp/test-chat.cpp — were untouched in this range).
b9803–b9829 src/models/mamba2.cpp + src/models/mamba-base.cpp + conversion/mamba.py Mamba2 generalized beyond a fixed expansion factor of 2: d_in_proj now derived from ssm_dt_rank + conv_dim (was 2*d_inner + 2*n_group*d_state + n_head), the GGML_ASSERT(2*n_embd == d_inner) / d_inner % d_state == 0 asserts removed, and ssm_dt_b/ssm_a/ssm_d tensor shapes keyed on dt_rank. Model-build internals inside upstream-compiled libllama; no symbol the project binds. No project source changes required
b9803–b9829 ggml/src/ggml-opencl/ (FA q4_0/q8_0 KV, +5 new kernel files) + ggml/src/ggml-cuda/{cpy,out-prod}.cu + ggml/src/ggml-vulkan/ + ggml/src/ggml-sycl/{norm,softmax}.cpp + ggml/src/ggml-openvino/ Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new fa_tune.h, flash_attn_pre_f16.cl, flash_attn_f32_q{4,8}_0.cl, cvt.cl/set_rows.cl SoA quant variants); CUDA adds a cudaMemcpy2DAsync fast path for strided same-type copies, batched cublasSgemmBatched out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. GGML_GLU_OP_SWIGLU_OAI, softmax attention-sinks). No API surface visible to jllama.cpp; the OpenCL set only affects the opencl-android-aarch64 classifier, CUDA only cuda13-linux-x86-64. No project source changes required
b9803–b9829 common/common.{h,cpp} + common/speculative.cpp + common/arg.{cpp,h} + tools/mtmd/clip*.{h,cpp} Internal upstream churn: new COM_*/SPC_* logging macros (the LOG_* calls inside common.cpp/speculative.cpp/reasoning-budget.cpp were rewrapped, several LOG_INFLOG_TRC quieting); common_models_handler gained plan_spec/plan_voc for --spec-draft-hf/--hf-repo-v downloads + duplicate-task dedup; clip hardened GGUF array reads (get_arr_f32, even-pinpoints / mean-std validation, n_merge defaults to 1). All consumed inside upstream-compiled common/mtmd; grep -rn "common_models_handler|COM_TRC|n_merge" src/main/cpp src/test/cpp → zero matches. No project source changes required
b9803–b9829 upstream verification (sandbox) All four patches (00010004) re-verified to apply + reverse-apply cleanly against b9829 via git apply --check / git apply --reverse --check over the actual b9829 sources fetched from api.github.com (github.com git-clone — incl. FetchContent of nlohmann/json and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the test-export-graph-ops rename and the server.cpp GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The server-stream.cpp link fix in CMakeLists.txt is required by the b9829 server-TU #includes (verified against the upstream diff: server-context/server-http/server-models reference symbols defined only in server-stream.cpp). Full build + ctest (target 454/454) to be confirmed by the CI pipeline.
b9829–b9839 common/regex-partial.{cpp,h} (removed) + common/CMakeLists.txt + tests/test-regex-partial.cpp (removed) + tests/CMakeLists.txt The standalone reversed-partial-regex matcher (common_regex, regex_to_reversed_partial_regex, common_regex_match/common_string_range) was deleted — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 json-partial removal). Project references none of these symbols — verified grep -rn "regex-partial|common_regex|regex_to_reversed|COMMON_REGEX" src/main/cpp src/test/cpp → zero matches; the deleted upstream test isn't built here (LLAMA_BUILD_TESTS OFF). No project source changes required
b9829–b9839 common/common.h + common/speculative.cpp + conversion/*.py + gguf-py/ + src/llama-arch.{cpp,h} + src/llama-{context,graph,model}.cpp + src/models/dflash.cpp (new) + docs/speculative.md New feature — DFlash block-diffusion speculative decoding (--spec-type draft-dflash, PR #22105): a new LLM_ARCH_DFLASH arch + common_speculative_impl_draft_dflash that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH (so COMMON_SPECULATIVE_TYPE_COUNT 9→10, static_assert bumped) and a self_kq_mask && self_kq_mask->buffer guard in llm_graph_input_attn_kv::set_input for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no common_speculative_*/arch symbol — all consumed inside upstream-compiled common/libllama. No project source changes required. Could later surface as a --spec-type inference parameter
b9829–b9839 common/chat.cpp + models/templates/openbmb-MiniCPM5-1B.jinja (new) + tests/test-chat*.cpp New model support — MiniCPM5 chat template (common_chat_params_init_minicpm5): XML tool calls <function name="…"><param name="…">…</param></function> with CDATA-escaped string values + <think> reasoning. Detected by common_chat_try_specialized_template and handled inside the compiled-in chat.cpp, so it flows through the embedded server / LlamaModel chat path automatically. Upstream test additions aren't built here (LLAMA_BUILD_TESTS OFF). No project source changes required
b9829–b9839 common/arg.cpp + common/chat.cpp + common/jinja/caps.{cpp,h} + tools/server/server-context.cpp New feature--reasoning-preserve / --no-reasoning-preserve (LLAMA_ARG_REASONING_PRESERVE): preserve the reasoning trace across the full chat history (not just the last assistant message) when the template advertises the supports_preserve_reasoning capability; server-context.cpp adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by ModelParameters. Note: the b9839 server-context.cpp additions sit in load_model after the chat_params block — disjoint from the load-progress-callback guard patches/0002 targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter
b9829–b9839 common/jinja/runtime.{cpp,h} + common/jinja/value.cpp + tools/ui/** + tests/test-jinja.cpp + tools/server/server-{models,stream}.cpp Internal/cosmetic only: Jinja gains an AST visitor + runtime::debug_dump_program (template debugging) and min/max array filters; server-models.cpp/server-stream.cpp add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into jllama); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI auto-follows the pinned GIT_TAG (the build-webui CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required
b9829–b9839 common/arg.cpp (lambda capture + --offline examples) Behaviour-neutral upstream churn: the common_models_handler_apply on_done lambda now captures first_path by value (dangling-reference fix) and --offline gained LLAMA_EXAMPLE_COMMON/LLAMA_EXAMPLE_DOWNLOAD set_examples tags. The project's ModelParameters.setOffline(boolean) (--offline) already exists; both changes are inside upstream-compiled arg.cpp and don't touch the patches/0001 hunks. No project source changes required
b9829–b9839 upstream verification (sandbox) All four patches (00010004) re-verified to apply cleanly against b9839 via git apply --check over the actual b9839 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for all four). Patch 0001's common/arg.{cpp,h} target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new --reasoning-preserve opt, the --offline set_examples, and the on_done capture fix — none overlap the patched hunks); 0002's server-context.cpp load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream tools/tts/tts.cpp is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + ctest (target 459/459) to be confirmed by the CI pipeline
b9839–b9840 src/llama-arch.{cpp,h} + src/llama-model.{cpp,h} + src/llama-hparams.h + src/llama-graph.{cpp,h} + src/llama-kv-cache-dsv4.{cpp,h} (new) + src/models/deepseek4.cpp (new) + src/llama-kv-cache{,-iswa}.{cpp,h} + src/llama-model-loader.cpp + src/CMakeLists.txt + conversion/*.py + gguf-py/ + models/templates/deepseek-ai-DeepSeek-V4.jinja (new) New model support — DeepSeek-V4 (LLM_ARCH_DEEPSEEK4 / deepseek4): a brand-new arch with its own compressed KV cache (llama_kv_cache_dsv4: raw SWA + CSA/HCA/lightning-indexer compressor states), sqrtsoftplus MoE gating (LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. build_moe_ffn gained an optional trailing selected_experts_in param (defaults nullptr); llama_kv_cache_iswa gained an hparams-taking ctor overload; llama_kv_cache exposes get_layer_ids()/get_k_storage(). All internal to upstream-compiled libllama — upstream's own src/CMakeLists.txt adds the new llama-kv-cache-dsv4.cpp (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified grep -rn "DEEPSEEK4|dsv4|DSV4|SQRT_SOFTPLUS|sqrtsoftplus|selected_experts_in|HYPER_CONNECTION|hash_layer" src/main/cpp src/test/cpp → zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / LlamaModel path.
b9839–b9840 upstream verification (sandbox) All four patches (00010004) re-verified to apply cleanly against b9840 via git apply --check over the actual b9840 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for all four). The b9839→b9840 range touches no patch-target file (common/arg.{cpp,h}, tools/server/server-context.{cpp,h}, server-common.cpp, test-chat.cpp, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream tools/tts/tts.cpp unchanged in this range). Full build + ctest (target 459/459) to be confirmed by the CI pipeline