Per-version-range record of upstream API breaks observed in the b5022 → latest range, what the affected upstream files are, and the project-side fix (or "no project changes required" when the break stayed inside an upstream-compiled translation unit).
Used during llama.cpp version bumps: when upgrading, scan this file from the row matching the current pinned version forward to the target, apply any rows marked as needing project source changes, and append a new row covering the upgrade range. See the "Upgrading/Downgrading llama.cpp Version" section in ../../CLAUDE.md for the upgrade workflow.
| Version | File | Change |
|---|---|---|
| ~b7217–b7433 | common/common.h, include/llama-cpp.h |
common_init_result became common_init_result_ptr; access changed to ->model() / ->context() / ->free_context() |
| ~b7433 | common/arg.h |
n_parallel default changed to sentinel -1 (auto); Java bindings must resolve to 1 before model load |
| ~b7217–b7783 | common/arg.h → common/download.h |
common_remote_get_content and common_remote_params split into new download.h; headers changed from vector<string> to vector<pair> |
| ~b7783 | common/common.h |
build_info string moved into common.h; local definition must be removed |
| ~b7783–b7858 | common/chat.h |
common_chat_syntax renamed to common_chat_parser_params; to_json_oaicompat<json>() template removed (no template arg); ensure_tool_call_ids_set() → set_tool_call_ids() |
| ~b7858–b7864 | common/speculative.h |
Full redesign: common_speculative_init(ctx_tgt, ctx_dft) → common_speculative_init(params_speculative, ctx); common_speculative_gen_draft → common_speculative_draft; new common_speculative_accept(); common_speculative_params struct replaced by common_params_speculative; draft model loaded via llama_model_load_from_file into llama_model_ptr |
| ~b7858–b7864 | common/common.h |
params_speculative: .model.path/.hf_repo replaced by .has_dft()/.mparams_dft; new .model_dft and .cparams_dft fields; speculative.type enum added (COMMON_SPECULATIVE_TYPE_NONE) |
| ~b7858–b7864 | server.hpp (internal) |
slot_action.slot_id → slot_action.id_slot; llama_init_dft removed from server_context; model_dft changed from llama_model* to llama_model_ptr; slot.ctx_tgt/ctx_dft removed |
| ~b7864 | common/mtmd.h |
mtmd_init_params.verbosity field removed |
| ~b7904–b8190 | common/common.h |
params_base.model_alias changed from std::string to a container; use *model_alias.begin() instead of direct string cast |
| ~b8778–b8808 | tools/mtmd/mtmd.h |
MTMD_DEFAULT_IMAGE_MARKER macro removed; mtmd_image_tokens_get_nx/ny deprecated; new mtmd_decoder_pos struct + mtmd_image_tokens_get_decoder_pos(); mtmd_context_params_default() now sets image_marker = nullptr (throws "custom image_marker is not supported anymore" if non-null); upstream server adds randomized get_media_marker() in server-common.h — our server.hpp is unaffected since it does not include that header and uses mtmd_default_marker() consistently |
| ~b8808–b8831 | project CMakeLists.txt |
CMake target common renamed to llama-common; update target_link_libraries for jllama and jllama_test |
| ~b8808–b8831 | common/common.h → new common/build-info.h |
build_info std::string removed; replaced by llama_build_info() (const char*) in new build-info.h; add #include "build-info.h" in server.hpp and utils.hpp; call sites: std::string(llama_build_info()) in server.hpp (6×), llama_build_info() in jllama.cpp (1×) and utils.hpp (1×) |
| ~b8808–b8831 | ggml/src/ggml.c |
New ggml_graph_next_uid() calls _InterlockedIncrement64 via <intrin.h> on x86; intrinsic unavailable on 32-bit MSVC; fix: src/main/cpp/compat/ggml_x86_compat.c provides __cdecl _InterlockedIncrement64 via InterlockedIncrement64 (CMPXCHG8B), added to ggml-base via target_sources guarded by MSVC AND CMAKE_SIZEOF_VOID_P EQUAL 4 |
| ~b8838–b8841 | src/llama-model.h |
Attention bias fields renamed: bq→wq_b, bk→wk_b, bv→wv_b, bo→wo_b, bqkv→wqkv_b; internal to llama.cpp, no impact on this project |
| ~b8841–b8854 | common/common.h |
common_params::clear_idle renamed to cache_idle_slots; new common_context_seq_rm_type enum + common_context_can_seq_rm() replacing common_speculative_is_compat(); get_model_endpoint() → common_get_model_endpoint() |
| ~b8841–b8854 | tools/mtmd/mtmd.h + mtmd-helper.h |
mtmd_decoder_pos gains z field; mtmd_image_tokens_get_decoder_pos() + mtmd_helper_image_get_decoder_pos() gain new pos_0 parameter |
| ~b8841–b8854 | project utils.hpp / server.hpp |
server_tokens::get_text_tokens() split: get_tokens() returns raw const llama_tokens &; new get_text_tokens() returns filtered copy (removes LLAMA_TOKEN_NULL mtmd placeholders); save/load and context-shift call sites updated to get_tokens() |
| ~b8854–b8887 | common/chat.h |
common_chat_msg_diff_to_json_oaicompat removed; moved to tools/server/server-chat.cpp; project defines it locally in server.hpp — importing server-chat.cpp is impractical because it pulls in convert_transcriptions_to_chatcmpl → get_media_marker → server-common.cpp |
| ~b8854–b8887 | common/common.h |
common_params::reasoning_budget and reasoning_budget_message moved into common_params::sampling sub-struct as reasoning_budget_tokens; update: params_base.reasoning_budget → params_base.sampling.reasoning_budget_tokens |
| ~b8854–b8887 | common/fit.h (new) |
llama_params_fit and llama_memory_breakdown_print removed from include/llama.h; now common_fit_params / common_memory_breakdown_print in new common/fit.h; not used directly by project |
| ~b8887–b8913 | tools/server/server-chat.h |
convert_transcriptions_to_chatcmpl gained a new const common_chat_templates * tmpls second parameter; not called by project's server.hpp — handled automatically by upstream server-chat.cpp |
| ~b8887–b8913 | tools/server/server-task.cpp |
n_discard clamped to non-negative: params.n_discard = std::max(0, params.n_discard); applied in project's server.hpp after the json_value parse |
| ~b8887–b8913 | tools/server/server-common.cpp |
parallel_tool_calls now defaults to caps["supports_parallel_tool_calls"] instead of hardcoded false; handled automatically by upstream file |
| ~b8887–b8913 | common/chat.h |
New additive common_chat_prompt_preset struct and common_chat_get_asr_prompt() function; no project changes required |
| ~b8887–b8913 | common/common.h |
New string_starts_with(std::string_view, char) overload added; no project changes required |
| ~b8887–b8913 | tools/mtmd/mtmd.cpp |
Added LLAMA_ROPE_TYPE_NONE case to rope-type switch; internal fix, no project changes required |
| ~b8913–b8953 | common/debug.h |
base_callback_data renamed to common_debug_cb_user_data; template common_debug_cb_eval<false/true> replaced by plain common_debug_cb_eval; not used by this project |
| ~b8913–b8953 | tools/server/server-http.h |
New uploaded_file struct; files map type changed from map<string, raw_buffer> to map<string, uploaded_file>; upstream server sources compiled directly — no project impact |
| ~b8913–b8953 | src/llama-quant.cpp |
Default quantization ftype changed from LLAMA_FTYPE_MOSTLY_Q5_1 to LLAMA_FTYPE_MOSTLY_Q8_0; upstream only |
| ~b8913–b8953 | src/models/llama.cpp, qwen3.cpp, qwen3moe.cpp |
Removed duplicate ggml_mul for wo_s scale (now handled exclusively by build_attn); upstream only |
| ~b8953–b8962 | common/common.h |
struct cpu_params → struct common_cpu_params; cpu_get_num_physical_cores() → common_cpu_get_num_physical_cores(); cpu_get_num_math() → common_cpu_get_num_math(); not used directly by project |
| ~b8953–b8962 | common/common.h |
common_params_speculative fully restructured with nested sub-structs: .mparams_dft/.model_dft/.cparams_dft/.n_max/.n_min/.p_split/.p_min → .draft.mparams/.draft.model/.draft.cparams/.draft.n_max/.draft.n_min/.draft.p_split/.draft.p_min; ngram fields moved to .ngram_cache/.ngram_mod/.ngram_simple/etc sub-structs; not referenced by project directly |
| ~b8953–b8962 | common/arg.h |
is_sparam bool split into is_sampling + is_spec; set_sparam() split into set_sampling() + set_spec(); not used by project |
| ~b8953–b8962 | tools/server/server-task.cpp |
task_params::to_json() drops "speculative.n_max", "speculative.n_min", "speculative.p_min" from output; only "speculative.type" remains; test SlotParamsToJson.SpeculativeFields_Present updated accordingly |
| ~b8953–b8962 | common/speculative.h |
New public API: common_speculative_n_max() and common_speculative_n_min() added; server-context.cpp uses these instead of direct field access; no project changes required |
| ~b8962–b8982 | common/sampling.h |
common_sampler_accept 3rd param renamed accept_grammar → is_generated; semantics broadened: false now also skips reasoning budget update (not just grammar); no project call sites affected |
| ~b8962–b8982 | common/reasoning-budget.h |
Two overloads merged: prefill_tokens variant removed; new single overload takes initial_state = REASONING_BUDGET_IDLE; prefill now fed via llama_sampler_accept() loop after init; not called directly by project |
| ~b8962–b8982 | ggml/src/ggml-cuda/ssm-conv.cuh |
ggml_cuda_op_ssm_conv gained optional bias_add_node param; SSM_CONV + ADD + SILU fusion now supported; internal CUDA code, no project changes required |
| ~b8962–b8982 | common/speculative.cpp |
Draft token confidence check (p_min) moved before push to result: low-confidence tokens are now discarded entirely rather than included then ignored; behavior fix, no project changes required |
| ~b8962–b8982 | tools/server/server-context.cpp |
n_draft_total accounting moved to draft generation site instead of acceptance site (bug fix); upstream only |
| ~b8982–b8994 | ggml/src/ggml-cuda.cu |
ggml_backend_cuda_i struct: .get_tensor_2d_async and .set_tensor_2d_async function pointers were swapped (get pointed to set impl and vice versa); corrected; internal CUDA backend, no project changes required |
| ~b8982–b8994 | ggml/src/ggml-vulkan.cpp |
ggml_vk_buffer_write_2d_async and ggml_vk_buffer_write_2d gained a dpitch parameter; Vulkan now implements set_tensor_2d/get_tensor_2d in buffer interface; internal backend code, no project changes required |
| ~b8982–b8994 | common/speculative.cpp |
Checkpoint helpers renamed: draft_create_checkpoint → create_checkpoint, draft_restore_checkpoint → restore_checkpoint; ckpt_size field removed (size computed from context directly); internal speculative module, not called by project |
| ~b8982–b8994 | common/arg.cpp |
CLI option typo fixed: --spec--draft-p-split → --spec-draft-p-split (extra dash removed); CLI-only, no project changes required |
| ~b8982–b8994 | src/llama-mmap.cpp |
Windows large-file (>2 GB) fix: ftell/fseek replaced with _ftelli64/_fseeki64; upstream only |
| ~b8982–b8994 | tools/server/httplib.h |
cpp-httplib bumped to v0.43.2: Windows FILE_SHARE_WRITE fix, Linux DNS cancel race fix, mbedTLS close_notify fix; upstream server header, no project changes required |
| ~b8982–b8994 | tools/server/server-context.cpp |
New LLAMA_TRACE env variable enables slot acceptance tracing; upstream only |
| ~b8994–b9004 | ggml/src/ggml-vulkan/ggml-vulkan.cpp |
vk_fa_pipeline_state gains k_type/v_type fields; get_fa_tuning_params_coopmat2 now takes separate k_type/v_type params; mixed K/V type FA pipeline creation refactored to CREATE_FA_CM2_MIXED() macro; flash_attn_cm2.comp shader uses runtime FaTypeK/FaTypeV spec constants (spec constants 12–15 added); DECODEFUNC/NEEDS_INIT_IQ_SHMEM macros removed; internal Vulkan backend, no project changes required |
| ~b8994–b9004 | ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp |
get_mul_mat_fast_pipeline vectorized-path condition fixed: dst->ne[1] % 4 == 0 check removed (was preventing vectorization for non-multiple-of-4 batch sizes); internal WebGPU backend, no project changes required |
| ~b8994–b9004 | ggml/src/ggml-hexagon/ |
Hexagon HTP backend: FA exp2 half-precision option, unary-op non-contiguous tensor fix; internal DSP backend, no project changes required |
| ~b8994–b9004 | tools/server/webui/ |
Major frontend component reorganization (Svelte/TypeScript); purely UI, no C++ or JNI impact |
| ~b9004–b9016 | src/llama-io.h |
llama_io_read_i interface changed: read(size_t)→read(void*,size_t), read_to(void*,size_t) removed, new read_tensor(tensor,offset,size) added; llama_io_write_buffer/llama_io_read_buffer now batch backend tensor ops in destructors for performance; internal state-save/load path, not called by project |
| ~b9004–b9016 | tools/server/server-context.cpp |
Static server_get_checkpoint() (returns by value) renamed to server_prompt_checkpoint_update() (takes server_prompt_checkpoint & by reference, in-place update); compiled directly into jllama, no call site in project code |
| ~b9004–b9016 | common/arg.cpp + docs |
Speculative decoding CLI args renamed: --draft/--draft-n/--draft-max and --draft-min/--draft-n-min were REMOVED (handler throws std::invalid_argument at parse time, not just deprecated); other draft flags (--draft-p-min, --ctx-size-draft, --device-draft, --gpu-layers-draft, --model-draft) kept as aliases for new canonical --spec-draft-* names. Java impact: ModelParameters.setDraftMax/setDraftMin produced removed flags → threw at model load; fixed to canonical --spec-draft-n-max/--spec-draft-n-min. Other set*Draft methods updated to canonical names for forward compatibility. Env vars also renamed (LLAMA_ARG_DRAFT_MAX→LLAMA_ARG_SPEC_DRAFT_N_MAX, etc.) |
| ~b9004–b9016 | ggml/src/ggml-cuda/ggml-cuda.cu |
PCI bus ID detection replaced snprintf with cudaDeviceGetPCIBusId (buffer 16→32 bytes); HIP/MUSA compat headers gain cudaDeviceGetPCIBusId alias; internal CUDA backend |
| ~b9004–b9016 | ggml/src/ggml-opencl/ |
Adreno MoE MXFP4: new kernel_convert_block_mxfp4_trans4_ns/restore kernels in cvt.cl; new gemm_moe_mxfp4_f32_ns, gemv_moe_mxfp4_f32_ns, moe_reorder_b, moe_sort_by_expert kernel files; GPU-side router reorder replaces CPU-side preprocessing; q_img created for GEMM path; internal OpenCL backend |
| ~b9004–b9016 | ggml/src/ggml-vulkan/ggml-vulkan.cpp |
GGML_VK_MAX_NODES 8192 macro removed (node limit now determined differently); internal Vulkan backend |
| ~b9004–b9016 | ggml/src/ggml-webgpu/ |
ggml_webgpu_row_norm_pipeline_key gains src_type/dst_type fields; GGML_OP_NORM now supported alongside GGML_OP_RMS_NORM/GGML_OP_L2_NORM; row_norm.wgsl gains SRC_TYPE/DST_TYPE parameterization and NORM two-pass algorithm; internal WebGPU backend |
| ~b9004–b9016 | src/llama-model.cpp |
rope_yarn_log_mul get_key call changed from required=0.0f to required=false; fixes Mistral YaRN log_mul loading; internal model loading, no project impact |
| ~b9004–b9016 | common/chat.cpp |
common_chat_templates_generation_prompt() extracted from common_chat_templates_apply_jinja(); internal refactor, no API change |
| ~b9016–b9022 | src/llama-model.h + src/llama-model.cpp + src/models/ |
llama_model becomes abstract base with pure virtual methods (load_stats, load_hparams, load_vocab, load_tensors, load_arch_hparams, load_arch_tensors, build_arch_graph); load_arch() removed; new intermediate llama_model_base class provides concrete implementations; per-arch subclasses (e.g. llama_model_llama, llama_model_gemma2) in src/models/; factory llama_model_create(llm_arch, params) and llama_model_create(ml, params) replace direct instantiation; LLAMA_LOAD_LOCALS convenience macro added; public C API (llama_model_load_from_file etc.) unchanged — no project impact |
| ~b9016–b9022 | src/models/ |
Many model files renamed: cohere2-iswa.cpp→cohere2.cpp, gemma2-iswa.cpp→gemma2.cpp, gemma3n-iswa.cpp→gemma3n.cpp, gemma4-iswa.cpp→gemma4.cpp, mimo2-iswa.cpp→mimo2.cpp, openai-moe-iswa.cpp→openai-moe.cpp, pangu-embedded.cpp→pangu-embed.cpp, qwen3vl-moe.cpp→qwen3vlmoe.cpp, step35-iswa.cpp→step35.cpp; new model files added (deepseek2ocr.cpp, glm-dsa.cpp, granite-moe.cpp, hunyuan-vl.cpp, jina-bert-v2/v3.cpp, lfm2moe.cpp, llama-embed.cpp, mamba2.cpp, minicpm.cpp, mistral4.cpp, nemotron-h-moe.cpp, nomic-bert.cpp, nomic-bert-moe.cpp, phimoe.cpp); upstream only, no project changes required |
| ~b9016–b9022 | tools/server/server-context.cpp |
server_prompt_checkpoint_update (the renamed function from b9016) static function signature changed from returning by value to taking server_prompt_checkpoint & by reference; compiled directly into jllama, no project call site |
| ~b9016–b9022 | tools/server/server-tools.cpp |
New built-in get_datetime tool added via new server_tool_get_datetime struct in build_tools(); no project changes required (handled automatically by compiled upstream source) |
| ~b9016–b9022 | common/chat-auto-parser-generator.cpp |
force_tools variable removed from build_tool_parser_json_native, build_tool_parser_tag_json, build_tool_parser_tag_tagged; content before tool calls is now always p.optional(p.content(...)) regardless of tool_choice=required; upstream only, no project changes required |
| ~b9016–b9022 | common/chat-peg-parser.h/cpp |
New optspace(const std::string & tag) method added to common_chat_peg_builder; makes leading/trailing spaces in reasoning tags optional; upstream only, no project changes required |
| ~b9016–b9022 | common/reasoning-budget.cpp |
Forced token logit now set to +INFINITY (previously left at whatever the model computed); reasoning budget enforcement is now absolute; upstream only, no project changes required |
| ~b9016–b9022 | common/chat.cpp |
thinking_start_tag and thinking_end_tag now trimmed via trim_whitespace(); upstream only, no project changes required |
| ~b9016–b9022 | examples/diffusion/ |
diffusion_generate extracted from diffusion-cli.cpp to new diffusion.h/diffusion.cpp static library; enum names prefixed: ORIGIN→DIFFUSION_ALGORITHM_ORIGIN, TIMESTEP_BASED→DIFFUSION_TRANSFER_SCHEDULE_TIMESTEP_BASED etc.; examples only, no project changes required |
| ~b9022–b9049 | include/llama.h |
New LLAMA_STATE_SEQ_FLAGS_ON_DEVICE 2 macro added alongside existing LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY 1; enables on-device KV cache state save/restore without host round-trip via llama_state_seq_get_size_ext/get_data_ext/set_data_ext; no project call-site changes required (not used by JNI layer) |
| ~b9022–b9049 | src/llama-context.cpp |
State seq data format breaking change: llama_state_seq_get_data/set_data now prepend a 4-byte magic (0xaf143cd8) + 4-byte seq_id header; state data saved with ≤b9022 is incompatible with b9049+; internal I/O classes renamed llama_io_write_buffer→llama_io_write_host, llama_io_read_buffer→llama_io_read_host; new llama_io_write_device/llama_io_read_device classes for on-device paths; no project changes required (not called by JNI layer) |
| ~b9022–b9049 | ggml/include/ggml.h |
New ggml_op_hint enum (GGML_HINT_DEFAULT=0, GGML_HINT_SRC0_IS_HADAMARD=1) and ggml_mul_mat_set_hint() function added for FWHT (Fast Walsh-Hadamard Transform) support; used internally in llama-graph.cpp / llama-kv-cache.cpp; no project call-site changes required |
| ~b9022–b9049 | src/llama.cpp |
llama_backend_init() now auto-calls ggml_backend_load_all() if no backends are yet registered; ggml_backend_load_all() removed from common_params_parser_init() (was in common/arg.cpp); no project changes required — backend loading still happens correctly |
| ~b9022–b9049 | tools/server/server-context.cpp |
server_prompt_checkpoint_update() gained an on_device bool parameter; speculative checkpoints now use LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE; compiled directly into jllama from upstream source — no project call-site changes required |
| ~b9022–b9049 | src/llama-model.cpp |
Unsupported model architecture now throws std::runtime_error instead of calling GGML_ABORT; allows callers to catch unknown-arch errors gracefully; no project changes required |
| ~b9022–b9049 | ggml/CMakeLists.txt |
GGML version bumped 0.10.2 → 0.11.0; no project changes required |
| ~b9022–b9049 | vendor/cpp-httplib/ |
Updated to 0.43.3: str2tag converted to iterative loop (eliminates recursion stack depth risk), res.body.reserve now OOM-safe; upstream server header, no project changes required |
| ~b9049–b9071 | common/chat.h |
contains_media() method added to common_chat_msg; to_json_oaicompat() now forces text concatenation when message contains media markers; additive change, no project impact |
| ~b9049–b9071 | src/llama-arch.h/cpp + src/llama-hparams.h |
New LLM_KV_ATTENTION_VALUE_SCALE KV key and f_attn_value_scale hparam field added for MiMo-V2 attention value scaling; additive, no project changes required |
| ~b9049–b9071 | src/llama.cpp |
llama_supports_gpu_offload() and llama_supports_rpc() now auto-call ggml_backend_load_all() if no backends are registered; behavior fix, no project changes required |
| ~b9049–b9071 | src/llama-context.cpp |
state_seq_set_data: removed too-strict seq_id matching guard that was gated on LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY; KV slot restorer now checks tensor shapes and view offsets before deciding to reallocate (avoids unnecessary realloc on shape-compatible updates); both are bug fixes, no project API changes required |
| ~b9049–b9071 | src/models/mimo2.cpp |
MiMo-V2 extended with MTP (Multi-Token Prediction) layer support via nextn_predict_layers; fused wqkv projection; attention_value_scale post-attention scaling; all internal model-loading changes, no project changes required |
| ~b9049–b9071 | ggml/src/ggml-sycl/ |
SYCL implementations added for CUMSUM, DIAG, FILL, SSM_SCAN, SOLVE_TRI ops; additive, no project changes required |
| ~b9049–b9071 | ggml/src/ggml-cuda/out-prod.cu |
CUDA outer-product uses cublasSgemmStridedBatched for batched path (dps2==1, ne2>1); HIP/MUSA compat headers gain the alias; performance improvement, no project changes required |
| ~b9049–b9071 | tools/mtmd/ |
MiniCPM-V 4.6 multimodal support added (PROJECTOR_TYPE_MINICPMV4_6, ViT merger graph, new tensor names); additive, no project changes required |
| ~b9049–b9071 | tools/server/webui/ |
LLM-based conversation title generation; CSS animation fill-mode-forwards fixes; UI-only changes compiled into upstream server, no project changes required |
| ~b9071–b9094 | ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh (NEW) |
2-GPU PCIe AllReduce pipeline for tensor parallelism (no NVLink required); requires Volta+ (sm70+); enabled via GGML_CUDA_ALLREDUCE env var (nccl/internal/none); compiled automatically via FetchContent, no project changes required |
| ~b9071–b9094 | ggml/src/ggml-cuda/snake.cu + snake.cuh (NEW) |
Fused CUDA Snake activation kernel (y = x + sin(a*x)^2 * inv_b) for BigVGAN/Vocos audio models; fuses 5-op chain MUL→SIN→SQR→MUL→ADD at graph level; F32/F16/BF16; compiled automatically, no project changes required |
| ~b9071–b9094 | ggml/src/ggml-cuda/ggml-cuda.cu |
Flash attention head size 192 (DKQ=192, DV=128) for MiMo-V2.5/V2.5-Pro/V2-Flash with GQA ratio 8/16; multi-GPU comm context refactored to ggml_backend_cuda_comm_context with try_allreduce function pointer; PCI bus IDs lowercased; compiled automatically, no project changes required |
| ~b9071–b9094 | ggml/src/ggml-sycl/ |
Q5_K reordered memory layout + MMVQ kernel for Intel GPUs; PAD op supports non-contiguous src0; dedicated growing K/V buffer for flash attention; all internal SYCL backend, no project changes required |
| ~b9071–b9094 | ggml/src/ggml-hexagon/ |
GATED_DELTA_NET and L2_NORM HVX-vectorized on Hexagon HTP backend; internal DSP backend, no project changes required |
| ~b9071–b9094 | src/models/sarvam.cpp (NEW) |
Sarvam-MoE model (sarvamai/sarvam-30b); reuses BailingMoeV2 arch; new vocab pre-type LLAMA_VOCAB_PRE_TYPE_SARVAM_MOE = 51; additive, no project changes required |
| ~b9071–b9094 | src/models/gemma4.cpp |
Gemma4 split gate/up experts: ffn_gate_up_exps now TENSOR_NOT_REQUIRED; fallback to separate ffn_gate_exps/ffn_up_exps; NVFP4 per_expert_scale folding; internal model-loading, no project changes required |
| ~b9071–b9094 | tools/server/server-context.h + server-context.cpp |
New get_model_info() method on server_context; /v1/models response now includes "n_ctx" field (value: slot_n_ctx); compiled from upstream sources, no JNI changes required (Java callers of model info APIs receive the new field transparently) |
| ~b9071–b9094 | tools/server/server-http.h + server.cpp |
handlers map moved from private to public in server_http_context; new register_gcp_compat() method exposes GCP/Vertex AI Prediction Protocol endpoint reading AIP_MODE/AIP_PREDICT_ROUTE/AIP_HEALTH_ROUTE/AIP_HTTP_PORT env vars; compiled from upstream sources, no project changes required |
| ~b9071–b9094 | tools/server/server-models.h + server.cpp |
Router child→parent model info propagation: new CMD_CHILD_TO_ROUTER_INFO command; setup_child_server() gains const json & model_info parameter; new update_loaded_info() method; server_model_meta gains loaded_info field; all internally consistent across compiled upstream sources, no project changes required |
| ~b9071–b9094 | common/reasoning-budget.cpp |
Forced token logit no longer set to +INFINITY; only competing tokens set to -INFINITY; internal sampler behavior change, no project changes required |
| ~b9071–b9094 | tools/server/webui/ |
Settings registry refactored (settings-config.ts/settings-fields.ts/settings-sections.ts merged into settings-registry.ts); MCP route #/settings/mcp → #/mcp-servers; settings route /settings/chat/[section] → /settings/[[section]]; UI-only, no project changes required |
| ~b9094–b9102 | ggml/src/ggml-cuda/allreduce.cu + allreduce.cuh |
Internal CUDA AllReduce pipeline refactored with ggml_cuda_ar_pipeline struct; ggml_cuda_ar_pipeline_init(devices, n_devices) / _free / _allreduce APIs; supports 2-GPU PCIe AllReduce without NCCL (Volta+ / sm70+); chunked kernel path (small tensors) vs copy-engine path (large tensors); GGML_CUDA_ALLREDUCE env = nccl/internal/none; env tuning vars GGML_CUDA_AR_COPY_THRESHOLD / GGML_CUDA_AR_COPY_CHUNK_BYTES / GGML_CUDA_AR_BF16_THRESHOLD; HIP/MUSA builds return nullptr stub; compiled automatically via FetchContent, no project changes required |
| ~b9094–b9102 | ggml/src/ggml-cuda/ggml-cuda.cu |
GGML_LOG_WARN_ONCE macro added; ggml_backend_cuda_comm_context gains try_allreduce fn pointer and ar_pipeline; three dispatch fns: try_allreduce_nccl, try_allreduce_internal, try_allreduce_butterfly; init chain: comm_init_nccl → comm_init_internal → comm_init_none; platform default Linux→NCCL, Windows→internal; no project changes required |
| ~b9094–b9102 | ggml/src/ggml-sycl/ggml-sycl.cpp + im2col.cpp + im2col.hpp |
New ggml_sycl_im2col_3d function; GGML_OP_IM2COL_3D now supported on Intel GPU via SYCL; 2D im2col kernel rewritten with tile-based IC_KH_KW thread decomposition; new SYCL_IM2COL_BLOCK_SIZE 256; additive, no project changes required |
| ~b9094–b9102 | ggml/CMakeLists.txt |
GGML version patch bumped 0.11.0 → 0.11.1; no project changes required |
| ~b9094–b9102 | common/sampling.cpp |
Bug fix in common_sampler_sample: set_logits now called at the top before backend-sampling check; backend sampling token-selection now scans all of cur_p.data to find matching token (instead of artificial 1-element array), fixing cur_p.selected for downstream n_probs; post-sampling probabilities now work correctly with backend sampling |
| ~b9094–b9102 | tools/server/server-context.cpp |
need_logits renamed to need_pre_sample_logits; only set when n_probs > 0 && !post_sampling_probs; backend sampling now works with post_sampling_probs; 0.0-probability tokens filtered from result.probs; compiled from upstream, no project JNI changes required |
| ~b9094–b9102 | src/llama-model.cpp |
n_vocab loading moved from llama_model_base::load_hparams() to per-model load_arch_hparams() (e.g. src/models/deepseek2.cpp, src/models/llama.cpp); internal model-loading refactor, no project changes required |
| ~b9094–b9102 | src/llama-model.cpp |
ggml/src/ggml-virtgpu/ggml-backend-device.cpp gains #include <mutex> for std::once_flag; internal backend fix, no project changes required |
| ~b9094–b9102 | vendor/cpp-httplib/httplib.cpp + httplib.h |
Security fix: chunk-size parsing replaced strtoul with manual hex-digit scanning to prevent overflow and reject invalid chunk extensions; version bumped to 0.43.4; compiled automatically, no project changes required |
| ~b9102–b9103 | vendor/cpp-httplib/httplib.cpp + httplib.h |
cpp-httplib bumped to v0.44.0: (1) RFC 9110 §5.5 compliance — header field values are no longer percent-decoded by the recipient in parse_header; Location/Referer special-casing removed; callers that need URI-component decoding must call decode_uri_component() explicitly; (2) ThreadPool constructor is now exception-safe — if thread creation fails partway through, already-started workers are signalled to exit and joined before rethrowing, preventing std::terminate from joinable threads in the destructor; compiled automatically, no project changes required |
| ~b9103–b9106 | ggml/src/ggml-vulkan/ggml-vulkan.cpp + Vulkan shaders |
Vulkan flash attention refactored: pipeline_flash_attn_f32_f16 changed from a per-type array of maps to a single map; mixed K/V quant types (e.g. Q4_0 K + F16 V) now supported on all Vulkan FA paths (scalar, cm1, cm2) rather than coopmat2 only; per-type SPIR-V variants replaced by two generic modules (flash_attn_f32_f16 and flash_attn_f32_f16_int8) that select K/V type at runtime via FaTypeK/FaTypeV spec constants; new flash_attn_dequant.glsl contains aliased SSBO views and an uber dequantize4() switch; the K/V type mismatch guard removed from ggml_backend_vk_device_supports_op; internal Vulkan backend refactor, no project changes required |
| ~b9103–b9106 | ggml/src/ggml-cuda/argsort.cu |
Added #include <cuda/iterator> for CCCL ≥ 3.1 strided-iterator path; internal CUDA backend, no project changes required |
| ~b9103–b9106 | convert_hf_to_gguf.py |
Mistral Medium 3.5 mmproj support: n_embd_text now reads "dim" key instead of "hidden_dim"; negative img_break_tok_id placeholders resolved from tekken.json or tokenizer.json; conversion tool only, no project changes required |
| ~b9106–b9134 | common/arg.cpp |
CLI option --spec-draft-ctx-size / -cd / --ctx-size-draft REMOVED — throws std::invalid_argument at parse time; ModelParameters.setCtxSizeDraft() removed; no replacement (context size now managed internally by speculative engine) |
| ~b9106–b9134 | common/arg.cpp |
CLI option --spec-draft-replace / --spec-replace REMOVED — throws std::invalid_argument at parse time; no corresponding Java method existed |
| ~b9106–b9134 | common/speculative.h |
Full redesign: common_speculative_type enum values renamed DRAFT→DRAFT_SIMPLE, EAGLE3→DRAFT_EAGLE3; common_params_speculative.type (single enum) → .types (vector); common_speculative_n_max() / common_speculative_n_min() REMOVED; new common_speculative_init(params, n_seq) no longer takes ctx; new common_speculative_begin(spec, seq_id, prompt), common_speculative_draft(spec), common_speculative_accept(spec, seq_id, n), common_speculative_process(spec, batch) signatures; common_speculative_draft_params struct added; server sources compiled directly, no project JNI changes required |
| ~b9106–b9134 | common/common.h |
New common_prompt_checkpoint struct (contains data_tgt + data_dft) replaces the old server_prompt_checkpoint in server-task.h; compiled from upstream server sources, no project JNI changes required |
| ~b9106–b9134 | tools/server/server-task.cpp |
task_params::to_json() renamed field "speculative.type" → "speculative.types" (now serialises the vector); test SlotParamsToJson.SpeculativeFields_Present updated accordingly |
| ~b9106–b9134 | include/llama.h |
New LLAMA_STATE_SEQ_FLAGS_NONE = 0 macro added; additive, no project changes required |
| ~b9134–b9145 | tools/server/server-common.cpp |
New continue_final_message boolean request field in oaicompat_chat_params_parse; vLLM/transformers-compatible alias for the prefill-assistant heuristic — when true, the last assistant message is extended without appending an end-of-turn token; mutually exclusive with add_generation_prompt=true (throws 400); compiled from upstream server sources; InferenceParameters.setContinueFinalMessage(boolean) added |
| ~b9134–b9145 | ggml/src/ggml-sycl/ |
Level Zero API integration for SYCL device memory allocation (GGML_SYCL_SUPPORT_LEVEL_ZERO build option, GGML_SYCL_ENABLE_LEVEL_ZERO runtime env); reduces system RAM usage on Intel dGPUs; internal SYCL backend, no project changes required |
| ~b9134–b9145 | ggml/src/ggml-opencl/ |
Q5_0 and Q5_1 MoE GEMM/GEMV kernels added for Adreno (Qualcomm) GPUs; internal OpenCL backend, no project changes required |
| ~b9134–b9145 | ggml/src/ggml-cuda/allreduce.cu |
AllReduce accumulation now routed through float intermediate for precision (avoids BF16 truncation); internal CUDA backend, no project changes required |
| ~b9134–b9145 | ggml/src/ggml-hexagon/ |
GGML_UNARY_OP_TANH added to Hexagon HTP backend; internal DSP backend, no project changes required |
| ~b9134–b9145 | ggml/src/ggml-webgpu/ggml-webgpu-shader-lib.hpp |
use_subgroup_matrix condition now also checks sg_mat_k > 0 && sg_mat_n > 0 and alignment; prevents crash on devices reporting subgroup matrix support with zero k/n; internal WebGPU backend, no project changes required |
| ~b9145–b9150 | ggml/src/ggml-vulkan/ggml-vulkan.cpp |
Bug fix: mul_mat_l_int[i] / mul_mat_m_int[i] / mul_mat_s_int[i] / mul_mat_id_l_int[i] / mul_mat_id_m_int[i] / mul_mat_id_s_int[i] were unconditionally set to true instead of mirroring the actual device pipeline capabilities from mul_mat_l[i] etc.; now properly initialized; internal Vulkan backend bug fix, no project changes required |
| ~b9145–b9150 | src/unicode.cpp |
New unicode_regex_split_custom_qwen35() function registered for the Qwen 3.5 tokenizer regex pattern; uses [\p{L}\p{M}]+ letter-plus-combining-mark runs vs. Qwen2's \p{L}+; additive internal tokenizer change, no project changes required |
| ~b9145–b9150 | ggml/src/ggml-cpu/ggml-cpu-riscv64-spacemit/ |
SpaceMIT RISC-V IME backend major refactor: IME2 kernels, expanded quantization (Q2_K, Q3_K, Q6_K, Q8_0, Q5_0, Q5_1, Q5_K, MXFP4), TCM (Tightly Coupled Memory) pool; new source files ime2_kernels.cpp, ime_env.cpp, repack.cpp, rvv_kernels.cpp, spine_mem_pool.cpp; guarded by GGML_CPU_RISCV64_SPACEMIT build flag; no project changes required |
| ~b9150–b9151 | common/log.h |
New LOG_TRC macro added at LOG_LEVEL_TRACE = 4 (between INFO=3 and DEBUG=5); LOG_LEVEL_DEBUG bumped from 4 to 5; new LOG_TRCV verbosity variant; additive, no project changes required |
| ~b9150–b9151 | common/common.h + common/common.cpp |
New common_params_print_info(const common_params &) function: prints verbosity level, per-device memory (name, total, free), and system info at LOG_INF level; replaces the two-line pattern LOG_INF("build_info: %s\n", llama_build_info()); LOG_INF("%s\n", common_params_get_system_info(params).c_str()); — updated in jllama.cpp |
| ~b9150–b9151 | common/common.cpp |
common_init() now unconditionally calls common_log_set_prefix(…, true) and common_log_set_timestamps(…, true) before setting the log callback; log output will always include prefix and timestamps unless explicitly disabled with --no-log-prefix / --no-log-timestamps |
| ~b9150–b9151 | common/arg.cpp |
--log-prefix and --log-timestamps now also accept negated forms --no-log-prefix / --no-log-timestamps (lambda receives a bool value); backing env vars renamed LLAMA_LOG_PREFIX → LLAMA_ARG_LOG_PREFIX and LLAMA_LOG_TIMESTAMPS → LLAMA_ARG_LOG_TIMESTAMPS; Java layer does not expose these, so no project changes required |
| ~b9150–b9151 | tools/server/server-common.h |
New SLT_TRC and SRV_TRC macros (emit at LOG_TRC level); additive, no project changes required |
| ~b9150–b9151 | tools/server/server-context.cpp |
New server_slot::t_print_last field + print_timings_tg() / print_timings_pp() methods: emit periodic in-flight token-generation and prompt-processing throughput to SLT_INF (throttled to ≥100 decoded tokens and ≥3 s interval); server_context_impl constructor now calls mtmd_helper_log_set unconditionally (was guarded by !is_resume); many SLT_INF/SRV_WRN downgraded to SLT_TRC/SRV_INF; compiled from upstream, no project JNI changes required |
| ~b9150–b9151 | tools/server/server-task.cpp |
Several SRV_WRN calls downgraded to SRV_INF; one SRV_WRN upgraded to SRV_ERR for failed state restore; compiled from upstream, no project changes required |
| ~b9151–b9172 | tools/mtmd/clip.h |
clip_has_whisper_encoder() removed from public API; not referenced by project — no changes required |
| ~b9151–b9172 | tools/server/CMakeLists.txt + scripts/webui-download.cmake (new) |
WebUI assets no longer committed (tools/server/public/ gitignored); provisioned at build time via HF bucket (LLAMA_USE_PREBUILT_WEBUI=ON default) or built from source (LLAMA_BUILD_WEBUI); project sets LLAMA_BUILD_WEBUI=OFF CACHE BOOL "" FORCE before FetchContent to skip asset download |
| ~b9151–b9172 | common/common.h |
common_params::webui default made conditional on LLAMA_WEBUI_DEFAULT_ENABLED macro (falls back to true when undefined); compiled server sources unaffected |
| ~b9151–b9172 | common/reasoning-budget.cpp |
common_reasoning_budget_clone rewritten to use llama_sampler_init properly; pure bug fix, no API change, no project changes required |
| ~b9151–b9172 | ggml/src/ggml-cuda/fattn-mma-f16.cuh + mma.cuh |
AMD RDNA3 WMMA flash attention support; new DATA_LAYOUT_I_MAJOR_SCRAMBLED, tile<16,16,half2,I_MAJOR_SCRAMBLED>, extended config tables; internal CUDA backend, no project changes required |
| ~b9151–b9172 | tools/server/server-chat.cpp |
Non-function Responses API tools now silently skipped (continue) instead of throwing; server behavior fix, no Java API change required |
| ~b9172–b9198 | project CMakeLists.txt |
Option LLAMA_BUILD_WEBUI renamed to LLAMA_BUILD_UI (and LLAMA_USE_PREBUILT_WEBUI → LLAMA_USE_PREBUILT_UI); upstream keeps a backward-compat shim that forwards the old cache variable with a DEPRECATION message, so this project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged |
| ~b9172–b9198 | common/common.h |
common_params::webui / webui_mcp_proxy / webui_config_json deprecated in favour of ui / ui_mcp_proxy / ui_config_json; both pairs of fields are kept and synced by common/arg.cpp, compiled upstream sources unaffected; new common_params::ctx_type and cparams.n_rs_seq fields added (default LLAMA_CONTEXT_TYPE_DEFAULT / 0), additive |
| ~b9172–b9198 | common/common.cpp + common.h |
common_params_print_info gained optional print_devices parameter (default true); upstream tools/server/server.cpp passes !is_router_server to skip GPU enumeration on the router process; this project does not compile server.cpp, no impact |
| ~b9172–b9198 | common/speculative.h + speculative.cpp |
New enum value COMMON_SPECULATIVE_TYPE_DRAFT_MTP (count is now 9); new common_speculative_need_embd() API; MTP draft implementation added (common_speculative_state_draft_mtp); --spec-type draft-mtp CLI flag added in common/arg.cpp; additive, no project changes (could be exposed later as a ModelParameters enhancement) |
| ~b9172–b9198 | include/llama.h |
New enum llama_context_type { LLAMA_CONTEXT_TYPE_DEFAULT, LLAMA_CONTEXT_TYPE_MTP }; new llama_context_params::n_rs_seq (recurrent-state snapshots per seq for rollback) and ctx_type fields; new llama_n_rs_seq() accessor; all additive, default-zero, no project impact |
| ~b9172–b9198 | src/llama-ext.h (new) + src/llama-context.cpp |
New pre-norm embedding extraction path: llama_set_embeddings_pre_norm / llama_get_embeddings_pre_norm[_ith] APIs and an embd_pre_norm output buffer in llama_context; used by the MTP draft loop only, additive |
| ~b9172–b9198 | src/llama-memory-recurrent.cpp |
Recurrent-state rollback support: per-seq rs_idx snapshot index and set_rs_idx() helper; tensors widened to (1 + n_rs_seq) groups; seq_rm now rolls back via snapshot when within n_rs_seq bounds. Backwards-compatible when n_rs_seq == 0 (this project's default), no project changes |
| ~b9172–b9198 | tools/server/server-context.cpp |
Embedding endpoint default now reads params.embd_normalize (was hard-coded 2); compiled upstream, no project changes |
| ~b9172–b9198 | tools/server/CMakeLists.txt + new tools/ui/CMakeLists.txt |
WebUI asset wiring moved into a new llama-ui static library; tools/server now links llama-ui; project does not build the llama-server binary (only compiles server-context.cpp / server-queue.cpp / server-task.cpp / server-models.cpp directly into jllama), so no impact. HF bucket name renamed LLAMA_WEBUI_HF_BUCKET → LLAMA_UI_HF_BUCKET (old name still honoured) |
| ~b9172–b9198 | vendor/cpp-httplib/httplib.{h,cpp} |
Bumped to v0.45.0: RFC 9112 §6 message-body framing — requests without Content-Length / Transfer-Encoding no longer scan for stray body bytes on persistent connections (fixes #2450 keep-alive misframing); X-Forwarded-For parser falls back to the connection remote address when the header is empty/malformed; compiled automatically, no project changes |
| ~b9172–b9198 | ggml/CMakeLists.txt |
GGML version bumped 0.11.1 → 0.12.0; no project changes |
| ~b9172–b9198 | ggml/src/ggml.c + ggml-cuda/gated_delta_net.cu + ggml-metal/ggml-metal.metal + ggml-vulkan/vulkan-shaders/gated_delta_net.comp |
ggml_gated_delta_net state tensor reshaped from 2D (S_v*S_v*H, n_seqs) to 3D (S_v*S_v*H, K, n_seqs) where K is the snapshot slot count (K=1 is final-state-only, K>1 keeps last min(n_tokens, K) per-token snapshots); internal Qwen3.5 / Qwen3-Next recurrent-attention kernel, no project changes |
| ~b9198–b9219 | common/chat.{h,cpp} |
New common_chat_continuation enum (NONE/AUTO/REASONING/CONTENT); new common_chat_msg::render_content(delimiter) method; new continue_final_message field on common_chat_templates_inputs; new common_chat_continuation_parse() accepts both bool and "reasoning_content"/"content" strings; common_chat_template_generation_prompt() extracted; oaicompat_chat_params_parse refactored to route the prefill-assistant heuristic through the new continuation enum. Existing bool wire-format unchanged; the new string variants are exposed via InferenceParameters.setContinueFinalMessage(ContinuationMode) |
| ~b9198–b9219 | common/hf-cache.{h,cpp} + common/arg.cpp |
hf_cache::migrate_old_cache_to_hf_cache() and hf_file::size field removed; the migration call in common_params_parse_ex was dropped. Internal to arg.cpp, no project impact |
| ~b9198–b9219 | common/speculative.{h,cpp} + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h |
llama_set_embeddings_pre_norm(ctx, value) → llama_set_embeddings_pre_norm(ctx, value, masked) (3rd bool arg distinguishes "embeddings for outputs only" from "embeddings for every token"); new cparams.embeddings_pre_norm_masked; new common_speculative_need_embd_pre_norm() API; MTP draft path now uses pre-norm extraction. Project does not call any of these APIs (speculative decoding is configured via ModelParameters only), no source changes required |
| ~b9198–b9219 | tools/server/server-task.{h,cpp} |
task_result_state ctor moved from header into .cpp — now seeds chat_msg via common_chat_parse("", true, …) when !echo so the assistant prefill is not echoed back as a delta; new bool echo field on chat_parser_params (default false, populated from request body via json_value(data, "echo", false)). Project compiles server-task.cpp from upstream and does not instantiate task_result_state directly, no source changes required |
| ~b9198–b9219 | tools/server/server-context.cpp + server-models.cpp |
New cors_proxy_enabled boolean field added to /props and /v1/models JSON responses (set from params.ui_mcp_proxy || params.webui_mcp_proxy). Additive, no Java consumer in this project |
| ~b9198–b9219 | upstream CMakeLists.txt |
Backward-compat shim widened: if(DEFINED LLAMA_BUILD_WEBUI AND NOT DEFINED LLAMA_BUILD_UI) → if(DEFINED LLAMA_BUILD_WEBUI) — setting the old name now always forwards to the new one (and emits the existing DEPRECATION message). Project sets only LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE (CMakeLists.txt:107), behaviour unchanged |
| ~b9198–b9219 | ggml/src/ggml-cuda/ssm-conv.cu + top-k.cu |
Added kernel size 15 to SSM-conv launcher (now supports 3/4/5/9/15); top-k.cu includes <cuda/iterator> for CCCL ≥ 3.1; internal CUDA backend, no project changes |
| ~b9198–b9219 | ggml/src/ggml-sycl/ggml-sycl.cpp + vecdotq.hpp |
SYCL GEMM now falls back to direct MKL for small problems (gemm_flops < 256³); Q6_K dot product refactored to a single scalar fast-path helper vec_dot_q6_K_q8_1_impl_mmvq_scalar; internal SYCL backend, no project changes |
| ~b9219–b9222 | ggml/src/ggml-hexagon/ + htp/pad-ops.c (new) + htp/unary-ops.c |
Hexagon HTP backend gains GGML_OP_PAD (HVX + optional VTCM/DMA double-buffered, both zero-pad and circular-pad variants) and GGML_OP_TRI (HVX-vectorised triangular masking) support; new HTP_OP_PAD / HTP_OP_TRI opcodes; internal Qualcomm DSP backend, no project changes |
| ~b9219–b9222 | .devops/*.Dockerfile + .github/workflows/docker.yml |
OCI image labels (org.opencontainers.image.*) added via BUILD_DATE/APP_VERSION/APP_REVISION build args; new skip_s390x workflow_dispatch input; manifest annotations on docker buildx imagetools create; upstream packaging/CI only, no project changes |
| ~b9222–b9245 | common/common.h + common.cpp |
common_init_result(common_params &, bool model_only = false) and common_init_from_params(common_params &, bool model_only = false) gain an optional model_only flag that skips context/sampler/lora/warmup setup and returns only the loaded model. Additive with default value; no project call sites in src/main/cpp/, no source changes required |
| ~b9222–b9245 | common/common.h |
common_params_speculative_draft defaults retuned: n_max 16→3, p_min 0.75f→0.0f. Defaults only; Java ModelParameters sets these explicitly via JSON, so behaviour is unchanged for this project |
| ~b9222–b9245 | common/speculative.{h,cpp} |
common_speculative_impl::accept() virtual gains a 3rd bool is_other parameter; common_speculative_accept() now broadcasts the accepted-token count to every registered impl (with is_other=true for impls that did not generate the draft). common_speculative_impl_ngram_map_k ctor signature simplified (no longer takes common_params_speculative). Lots of new LOG_INF startup banners per impl. Internal to upstream-compiled server-context.cpp; no project call sites |
| ~b9222–b9245 | common/arg.cpp + common/common.cpp + tools/fit-params/fit-params.cpp |
--verbosity levels relabeled: level 4 now means "trace (more info)" and level 5 means "debug"; LOG_LEVEL_DEBUG constant value moved from 4 to 5. Direct params.verbosity >= 4 comparisons in upstream common.cpp and fit-params.cpp replaced with >= LOG_LEVEL_DEBUG. Project does not reference LOG_LEVEL_DEBUG or numeric verbosity thresholds in src/main/cpp/; no source changes required |
| ~b9222–b9245 | common/arg.cpp |
--spec-type duplicate-arg DEPRECATED warning suppressed (the flag legitimately accepts repeated values to form the comma-list). Behaviour-only |
| ~b9222–b9245 | common/ngram-map.cpp |
One per-draft LOG_INF downgraded to LOG_DBG. Log-level only |
| ~b9222–b9245 | src/llama-graph.h |
llm_graph_params::operator== adds a third disjunct so ubatches with both token and embd arrays present compare equal (graph reuse fix for MTP pre-norm path). Internal |
| ~b9222–b9245 | src/llama-memory-recurrent.{h,cpp} + src/llama-memory-hybrid.cpp + src/llama-memory-hybrid-iswa.cpp |
init_batch() now forces sequential split (split_seq) instead of equal split when n_rs_seq > 0 (recurrent-state rollback is incompatible with equal splits). Internal upstream model code, no project impact |
| ~b9222–b9245 | src/models/delta-net-base.cpp + src/models/models.h + src/models/qwen35.cpp |
llm_build_delta_net_base::keep_rs() helper removed; conv-state and recurrent-attn paths reworked to read cparams.n_rs_seq directly and loop K = n_rs_seq + 1 snapshot slots. Comment fix in qwen35.cpp MTP layer index. All internal upstream model code |
| ~b9222–b9245 | tools/server/server-context.cpp |
pos_min_thold lowered by one (pos_next - n_swa → pos_next - n_swa - 1); checkpoint trigger guard relaxed from n_past < slot.prompt.n_tokens() to <=; per-slot print_timings_pp/print_timings_tg lines split into separate SLT_INF calls; new graphs reused and draft acceptance lines; n_draft_total log moved from SLT_CNT to SLT_INF. Compiled upstream-as-is, no project changes |
| ~b9222–b9245 | ggml/src/ggml-cuda/mmvq.cu |
calc_nwarps table tweak: Q6_K returns 2 warps (was grouped with the 8-warp tier). Internal CUDA backend |
| ~b9222–b9245 | ggml/src/ggml-hexagon/ (htp/rope-ops.c, htp/unary-ops.c, htp-ops.h, main.c, ggml-hexagon.cpp) |
New HTP_OP_NORM opcode (mean+variance norm); rope-ops.c adds MROPE / IMROPE position-id support via new mrope_cache_init(). Internal Qualcomm DSP backend |
| ~b9222–b9245 | ggml/src/ggml-opencl/ (ggml-opencl.cpp, kernels/cvt.cl, six new gemm_moe_q{4,5,6}_k_f32_ns + gemv_moe_q{4,5,6}_k_f32_ns kernels) |
Adreno MoE pipeline extended to Q4_K / Q5_K / Q6_K (image1d_buffer_t transposed layout, dedicated convert/restore kernels, GEMM + GEMV paths). Internal OpenCL backend |
| ~b9222–b9245 | ggml/src/ggml-rpc/ggml-rpc.cpp |
last_graph_uid field moved from ggml_backend_rpc_context (per-backend) into ggml_backend_rpc_device_context (per-device) so multiple backends sharing a device reuse cached graphs. Internal RPC backend |
| ~b9222–b9245 | ggml/src/ggml-sycl/ggml-sycl.cpp |
New GGML_SYCL_USE_ASYNC_MEM_OP env (default 1) decouples async USM alloc/free from the graph path. Internal SYCL backend |
| ~b9222–b9245 | ggml/src/ggml-webgpu/ggml-webgpu.cpp + wgsl-shaders/gated_delta_net.wgsl |
Gated-delta-net shader gains a K snapshot-count param; per-slot snapshot write path added. Internal WebGPU backend |
| ~b9222–b9245 | convert_hf_to_gguf.py, convert_lora_to_gguf.py, examples/save-load-state/save-load-state.cpp, examples/llama-eval/*, tools/cli/README.md, tools/server/README.md, docs/speculative.md, docs/backend/SYCL.md |
Doc/example/tooling updates only. Not compiled by this project |
| ~b9222–b9245 | tools/ui/* |
WebUI source reorganisation (enum file renames *.ts → *.enums.ts, new chat components, Tailwind plugin imports). Project sets LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE in CMakeLists.txt, so the UI is never built — no impact |
| ~b9245–b9264 | src/llama-chat.{h,cpp} |
LLM_CHAT_TEMPLATE_HUNYUAN_OCR renamed to LLM_CHAT_TEMPLATE_HUNYUAN_VL (HunyuanOCR and HunyuanVL now share one template). Not referenced by project — no source changes required |
| ~b9245–b9264 | tools/mtmd/clip-impl.h + tools/mtmd/models/ |
PROJECTOR_TYPE_HUNYUANOCR removed and merged into PROJECTOR_TYPE_HUNYUANVL; hunyuanocr.cpp renamed to hunyuanvl.cpp; clip graph class clip_graph_hunyuanocr renamed to clip_graph_hunyuanvl. Not referenced by project — no source changes required |
| ~b9245–b9264 | tools/mtmd/clip.h |
clip_is_minicpmv() and clip_is_glm() removed from public API. Not referenced by project — no source changes required |
| ~b9245–b9264 | tools/mtmd/clip.h (struct clip_context_params) |
New bool no_alloc field added (initialized via mtmd_context_params_default()). Additive default-zero — no project changes required |
| ~b9245–b9264 | tools/mtmd/mtmd.h |
New mtmd_get_memory_usage() C++ API for estimating mmproj VRAM/RAM usage. Additive, not called by project |
| ~b9245–b9264 | tools/mtmd/clip-model.h |
New enum pad_style { PAD_NONE, PAD_CEIL, PAD_NEAREST } replacing the bool image_resize_pad flag (allows Pillow-byte-parity nearest-integer rounding for DeepSeek-OCR). Internal to mtmd, project links mtmd as-is |
| ~b9245–b9264 | common/common.h (struct common_params_speculative_draft) |
New bool backend_sampling = true field — offloads draft sampling to the backend. Additive default-on; Java ModelParameters doesn't set it, so the upstream default applies. Backend sampler auto-disables when split_mode == TENSOR in src/llama-context.cpp — safe |
| ~b9245–b9264 | common/speculative.cpp |
common_speculative_impl_draft_mtp now registers a per-seq backend sampler chain (top-k 10) on ctx_dft via llama_set_sampler; cleaned up in destructor. Falls back to CPU sampler if llama_set_sampler fails. Internal to upstream-compiled speculative module, no project call sites |
| ~b9245–b9264 | app/ (new) |
New optional unified llama binary (llama-app target) dispatching to serve/cli/completion/bench. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it |
| ~b9245–b9264 | tools/{cli,completion,llama-bench,server}/CMakeLists.txt |
Each tool split into a *-impl static library (the logic) plus a thin main.cpp wrapper; the main() in cli.cpp/completion.cpp/llama-bench.cpp/server.cpp is renamed to llama_cli/llama_completion/llama_bench/llama_server and now satisfies -Wmissing-declarations via a forward decl. Project does NOT compile any of these .cpp files — only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp (see CMakeLists.txt:237/:302) — so no impact |
| ~b9245–b9264 | tools/server/server-context.cpp |
Adds mmproj memory estimation: when params_base.fit_params is set, calls mtmd_get_memory_usage(mmproj_path, mparams) and adds the per-device cost into params_base.fit_params_target before common_init_from_params. Also calls mtmd_helper_log_set(common_log_default_callback, nullptr) once when !is_resume. Compiled upstream-as-is, no project call sites |
| ~b9245–b9264 | src/llama-context.cpp |
New llama_context::set_sampler() short-circuits with a one-shot LLAMA_LOG_WARN and returns false when model.split_mode() == LLAMA_SPLIT_MODE_TENSOR (backend sampling not supported with tensor split). Internal safety check, no project call sites |
| ~b9245–b9264 | common/arg.cpp |
New CLI flags --spec-draft-backend-sampling / --no-spec-draft-backend-sampling and env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING to toggle the new backend_sampling field. Not exposed by ModelParameters; could be added later as a Java-side enhancement |
| ~b9245–b9264 | ggml/src/ggml-cuda/CMakeLists.txt + common.cuh + binbcast.cu, concat.cu, cpy.cu, fattn-*.cu, gated_delta_net.cu, getrows.cu, mean.cu, mmvf.cu, mmvq.cu, norm.cu, quantize.cu, reduce_rows.cuh, rope.cu, scale.cu, set-rows.cu, softcap.cu, ssm-conv.cu, ssm-scan.cu, sumrows.cu, topk-moe.cu, unary.cu |
New PDL (Programmatic Dependent Launch) infrastructure: GGML_CUDA_USE_PDL build flag (CUDART ≥ 11.8, non-HIP/MUSA); ggml_cuda_pdl_sync() / ggml_cuda_pdl_lc() device helpers (active on Hopper sm_90+); ggml_cuda_kernel_launch_params + ggml_cuda_kernel_launch() host template that calls cudaLaunchKernelEx with stream-serialization attribute when GGML_CUDA_PDL env var allows. Adds 90-virtual (Hopper) to default CMAKE_CUDA_ARCHITECTURES when CUDA ≥ 11.8. Internal CUDA backend, no project changes required |
| ~b9245–b9264 | ggml/src/ggml-metal/ggml-metal-{device,ops}.cpp + ggml-metal.metal |
New 4-element kernel_pad_*_4 variant (currently disabled — is_c4 = false); kernel_pad rewritten with 1024-element-per-block tiling for larger tensors; kernel_cpy_* rewritten to use tpitg rows-per-threadgroup batching; Q quantization cpy paths use 256-thread limit. Internal Metal backend |
| ~b9245–b9264 | ggml/src/ggml-hexagon/htp/ (hmx-matmul-ops.c, hmx-ops.h, matmul-ops.c, main.c) |
HMX matmul refactor: K-loop tiled in 32-tile blocks with Q6_activation_hf_mxmem_RR_deep; the out-stationary fallback path for large M·K·N was deleted; function rename hmx_mat_mul_permuted_w16a32 → hmx_matmul_f16_f32, hmx_mat_mul_permuted_qk_0_d16a32 → hmx_matmul_q_f32, hmx_mat_mul_permuted_w16a32_batched_params_t → hmx_matmul_f16_f32_batched_params_t. HMX power-up code reorganized (HAP_power_set_HMX_v2 now combines power-on + clock in one step for __HVX_ARCH__ ≥ 75). Internal Qualcomm DSP backend |
| ~b9245–b9264 | ggml/src/ggml-opencl/ggml-opencl.cpp |
Lazy kernel compilation: argsort and flash_attn programs are now built only when first needed (load_cl_kernels_argsort / load_cl_kernels_flash_attn called from supports_op); new device-supported probe in ggml_opencl_is_device_supported runs at registration time; renamed ggml_cl2_init/ggml_cl2_free → ggml_cl_init/ggml_cl_free; OpenCL contexts now live as long as the process. Internal OpenCL backend |
| ~b9245–b9264 | ggml/src/ggml-vulkan/vulkan-shaders/im2col.comp |
Refactor: precomputed base input coords and step deltas; running pointer/index for destination; one inlined unrolled loop iteration writes BLOCK_SIZE outputs per step. Internal Vulkan backend |
| ~b9245–b9264 | src/models/delta-net-base.cpp |
Renamed local variables (state_in_3d→s_3d, state_3d→s_3d_pad) when reshaping the recurrent state; behaviour unchanged |
| ~b9245–b9264 | tools/mtmd/mtmd-image.cpp |
img_tool::resize() takes a pad_style enum (was bool add_padding); new PAD_NEAREST rounding path for Pillow byte-parity; mtmd_image_preprocessor_deepseekocr::preprocess rewritten with static constexpr resolution table and RESIZE_ALGO_BICUBIC_PILLOW + PAD_NEAREST. Internal mtmd, project links as-is |
| ~b9245–b9264 | tools/mtmd/models/deepseekocr.cpp |
Extracted build_sam(ggml_tensor *inp_raw) member function from the monolithic build path; FA mask casting to F16 only when flash_attn_type == CLIP_FLASH_ATTN_TYPE_ENABLED. Internal |
| ~b9245–b9264 | conversion/hunyuan.py, gguf-py/gguf/constants.py, gguf-py/gguf/tensor_mapping.py |
HunyuanOCR / HunyuanVL unified in conversion: VisionProjectorType.HUNYUANOCR removed; HunYuanVLForConditionalGeneration registers a single HunyuanVLVisionModel + HunyuanVLTextModel; vit.perceive.* tensor mappings now only mention HunyuanVL. Python tooling, not compiled by project |
| ~b9245–b9264 | CMakeLists.txt (upstream) |
New LLAMA_BUILD_APP option (default OFF); deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUI → LLAMA_BUILD_UI/LLAMA_USE_PREBUILT_UI preserved. Project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) still works unchanged |
| ~b9245–b9264 | .devops/*.Dockerfile, .github/workflows/build-and-test-snapdragon.yml, scripts/snapdragon/, docs/backend/snapdragon/, tools/cli/README.md, tools/server/README.md, tools/mtmd/tests/ |
Docker images add conversion/ dir; snapdragon toolchain bumped v0.3 → v0.6 with +dotprod+i8mm; mtmd test rewritten to use CER/chrF metrics; doc-only updates. Not compiled by project |
| ~b9264–b9279 | tools/server/server-context.cpp |
Slot-info JSON adds three additive fields (n_prompt_tokens, n_prompt_tokens_processed, n_prompt_tokens_cache) on each in-flight task; server_context_impl::destroy() now resets spec / ctx_dft / model_dft BEFORE llama_init.reset() to avoid use-after-free when a draft model holds back-references into the target context. Compiled directly into jllama from upstream — no project source changes required |
| ~b9264–b9279 | tools/server/server-models.cpp |
Adds #include <cstdlib> and a LLAMA_APP_CMD env-var lookup in server_model_meta::update_args() to re-inject the unified-binary subcommand into router-spawned child argv. Env var is only set by the new llama-app binary (which this project does not build), so the lookup harmlessly returns null and the code path is a no-op. Compiled upstream-as-is, no project changes |
| ~b9264–b9279 | src/llama-vocab.cpp |
New hybriddna BPE tokenizer model (DNA k-mer tokenization with <dna>…</dna> tag handling, k=6, OOV fallback) registered as a BPE variant; reached only when GGUF metadata declares tokenizer.model = "hybriddna". Adds a virtual destructor + virtual tokenize() to llm_tokenizer_bpe_session and a llm_tokenizer_hybriddna_session subclass; existing BPE callers unchanged. Additive, no project changes |
| ~b9264–b9279 | src/llama-graph.cpp |
llm_graph_input_attn_kv_iswa::set_input() / can_reuse() now guard the base and SWA tensor accesses behind if (self_k_idxs && self_k_idxs->buffer) / if (self_k_idxs_swa && self_k_idxs_swa->buffer). Fixes crashes on models with only-SWA or only-non-SWA attention layers. Internal, no project impact |
| ~b9264–b9279 | src/models/qwen35.cpp + src/models/qwen35moe.cpp |
MTP draft sub-graph now builds an inp_out_ids input and applies ggml_get_rows(cur, inp_out_ids) just before the head norm, so only the requested output rows are projected. Bug fix for MTP draft path; internal, no project changes |
| ~b9264–b9279 | ggml/src/ggml-backend.cpp |
ggml_backend_tensor_get_2d() fast-path condition fixed: now checks iface.get_tensor_2d == NULL (was incorrectly checking set_tensor_2d), so multi-copy gets correctly fall back to the per-copy loop when the backend lacks get_tensor_2d. Bug fix, no project changes |
| ~b9264–b9279 | ggml/src/ggml-vulkan/ (ggml-vulkan.cpp, new vulkan-shaders/snake.comp, vulkan-shaders-gen.cpp) |
New Vulkan Snake activation fusion: detects the 5-op chain MUL → SIN → SQR → MUL → ADD (matching CUDA b9094 introduction) and dispatches a single fused snake_{f32,f16,bf16} kernel y = x + sin(a*x)^2 * inv_b. New ggml_vk_can_fuse_snake() validates contiguity, 2D shape, and broadcast operands [1, C, 1, 1]. Internal Vulkan backend, no project changes |
| ~b9264–b9279 | ggml/src/ggml-metal/ggml-metal-ops.cpp + ggml-metal.metal |
kernel_concat / kernel_set now batch multiple small rows into one threadgroup (nrptg = min(256/ne0, ne1), capped at 256 threads/group) to improve small-row throughput; kernel_concat gains an early-return bounds check. Internal Metal backend, no project changes |
| ~b9264–b9279 | ggml/src/ggml-hexagon/ (ggml-hexagon.cpp, htp/ssm-conv.c, htp/rope-ops.c) |
SSM_CONV HVX kernel rewritten with VTCM-staged 32×32 fp32 in-register transpose and per-thread tiling (1 MiB VTCM budget); strictly-contiguous gate replaced with byte-stride checks (nb[0]==sizeof(float) and nb[1]==ne[0]*sizeof(float)); rope_cache_init / mrope_cache_init marked __attribute__((noinline)) to reduce code-bloat on Hexagon. Internal Qualcomm DSP backend, no project changes |
| ~b9264–b9279 | examples/save-load-state/ removed, tests/test-save-load-state.cpp added; tools/{batched-bench,fit-params,quantize,perplexity}/CMakeLists.txt |
The llama-save-load-state example binary was removed and re-homed as a CTest target; the four remaining standalone tools were each split into a *-impl static library + a thin main.cpp wrapper (mirroring the b9245 split of cli/completion/llama-bench/server), with the entry-point renamed to llama_batched_bench / llama_fit_params / llama_quantize / llama_perplexity to satisfy -Wmissing-declarations. Project does not compile any of these .cpp files (only server-context.cpp, server-queue.cpp, server-task.cpp, server-models.cpp — see CMakeLists.txt), so no impact |
| ~b9264–b9279 | app/ (CMakeLists.txt, llama.cpp) |
llama-app unified binary gains four new subcommands (batched-bench, fit-params, quantize, perplexity) and sets LLAMA_APP_CMD in the env before dispatching so that the router can re-inject the subcommand into spawned child argv. Guarded by LLAMA_BUILD_APP=OFF default — project doesn't enable it, no impact |
| ~b9264–b9279 | conversion/base.py + conversion/llama.py |
New _set_vocab_hybriddna() Python helper that emits a gpt2-style BPE vocab tagged as tokenizer.model = "hybriddna"; LlamaModel.set_vocab() dispatches to it when tokenizer_config.json declares "tokenizer_class": "HybridDNATokenizer"; add_prefix_space handling moved earlier in the same method. Conversion tooling only, not compiled by project |
| ~b9279–b9284 | upstream CMakeLists.txt |
LLAMA_BUILD_APP default flipped OFF → ON. Project's LLAMA_BUILD_TOOLS is OFF (FetchContent, LLAMA_STANDALONE=OFF), so tools/-dependent app targets are not configured; nevertheless CMakeLists.txt:108 now explicitly forces set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) to keep the cache pinned across upgrades |
| ~b9279–b9284 | tools/{batched-bench,cli,completion,fit-params,llama-bench,perplexity,quantize,server}/CMakeLists.txt |
Each *-impl target switched from add_library(... STATIC ...) to default library type (becomes SHARED when BUILD_SHARED_LIBS=ON); added WINDOWS_EXPORT_ALL_SYMBOLS ON and conditional install(TARGETS ... LIBRARY) under LLAMA_TOOLS_INSTALL. Project doesn't enable LLAMA_BUILD_TOOLS, so none of these targets are configured — no impact |
| ~b9279–b9284 | src/llama-vocab.cpp + conversion/base.py |
HybridDNA tokenizer fix: k-mers are now stored in token_to_id with a reserved \xee\x80\x80 (U+E000) suffix to disambiguate them from identical base-vocab BPE tokens (e.g. CCCCCC); the suffix is stripped from id_to_token text after vocab load. Pure tokenizer internals, not exposed via JNI — no project changes required |
| ~b9279–b9284 | ggml/src/ggml-cuda/common.cuh |
PDL-launch gating now uses ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_HOPPER instead of the raw device cc — fixes false negatives when running on a Hopper device with a binary compiled for an older arch. Internal CUDA backend, no project changes required |
| ~b9284–b9297 | upstream CMakeLists.txt |
LLAMA_BUILD_APP default reverted from ON back to ${LLAMA_STANDALONE} (i.e. OFF for FetchContent consumers). Project's set(LLAMA_BUILD_APP OFF CACHE BOOL "" FORCE) shim is now redundant but harmless; kept as defensive pin against future flips |
| ~b9284–b9297 | common/chat.h + tools/server/server-task.cpp |
New additive common_chat_parser_params::is_continuation field (default false); params_from_json_cmpl now parses the continue_final_message request field via common_chat_continuation_parse() and sets is_continuation when the result is non-NONE. task_result_state ctor guard tightened: the empty-prefill chat_msg = common_chat_parse("", true, ...) initialization is now gated on is_continuation && !echo (was just !echo) — i.e. the assistant-prefill suppression delta is only emitted when an actual continuation is requested. Java InferenceParameters.setContinueFinalMessage(boolean|ContinuationMode) already writes continue_final_message to the request JSON, so behaviour is wired through automatically; non-continuation requests now correctly emit the first delta instead of suppressing it |
| ~b9284–b9297 | src/llama-model.{h,cpp} + src/models/qwen35.cpp + src/models/qwen35moe.cpp |
NVFP4 quantization extended to MTP (Multi-Token Prediction) tensors: llama_layer_nextn gains four scale fields (eh_proj_s, eh_proj_in_s, shared_head_head_s, shared_head_head_in_s); load_tensors() loads them when the corresponding base tensor exists and is NVFP4; Qwen3.5 / Qwen3.5-MoE MTP graphs pass the scales into build_lora_mm(). Internal model-loading + graph-building changes, no project changes required |
| ~b9284–b9297 | ggml/src/ggml-backend.cpp |
Bug fix in ggml_backend_tensor_get_2d_async: fast-path condition checked iface.set_tensor_2d_async == NULL (typo) instead of iface.get_tensor_2d_async == NULL; multi-copy gets now correctly fall back when the backend lacks get_tensor_2d_async. Also corrects an out-of-bounds assertion message from "write" to "read". Internal backend code, no project changes required |
| ~b9284–b9297 | ggml/src/ggml-opencl/ (ggml-opencl.cpp + 17 kernel files) |
Adreno MoE pipeline bug fix: GEMM/GEMV kernels for MXFP4/Q4_0/Q4_1/Q4_K/Q5_0/Q5_1/Q5_K/Q6_K had a boundary-check race where the ne01 bounds check exited threads early and prevented their participation in tile-wide reductions, causing wrong results when ne01 % 64 != 0. Fixed by: (1) rounding global_size[0] up to the next multiple of 64 in ggml_cl_mul_mat_id, (2) moving the per-thread ne01 early-return in each GEMM kernel to AFTER the tile reduction, (3) adding the same early-return in the GEMV kernels and the cvt.cl trans4_ns/restore_ns kernels; alignment threshold also relaxed from ne01 % 64 == 0 to ne01 % 32 == 0 in use_adreno_moe_kernels. Internal OpenCL backend, affects the opencl-android-aarch64 classifier build only — no project source changes |
| ~b9284–b9297 | ggml/src/ggml-sycl/ (ggml-sycl.cpp, dmmv.cpp, gated_delta_net.cpp, common.hpp) |
(1) BF16 added to ggml_sycl_supports_dmmv() and can_use_dequantize_mul_mat_vec(); new convert_mul_mat_vec_bf16_sycl path. (2) Level Zero auto-detect moved into ggml_sycl_init() — info.ext_oneapi_level_zero flag now reflects the GPU-only check (CPU devices ignored) and is used as the default for GGML_SYCL_ENABLE_LEVEL_ZERO env. (3) mmid_counting_sort_rows() replaces the per-expert atomic scan in ggml_sycl_mul_mat_id — host-side counting sort builds expert-contiguous row slices in a single pass instead of N×expert atomic scans; significant speedup for MoE dispatch. (4) Gated-delta-net kernel extended with keep_rs_t template parameter and per-token snapshot writes when K > 1, matching the CUDA/Vulkan snapshot changes from b9222. Internal SYCL backend, no project changes required |
| ~b9284–b9297 | ggml/src/ggml-vulkan/CMakeLists.txt |
find_package(SPIRV-Headers) switched to CONFIG REQUIRED and adds $ENV{VULKAN_SDK} to CMAKE_PREFIX_PATH; fixes detection when SPIRV-Headers ships only the CMake-config files (no FindSPIRV-Headers.cmake). Internal Vulkan build config, no project changes required |
| ~b9284–b9297 | ggml/src/ggml-zendnn/ (CMakeLists.txt, ggml-zendnn.cpp) |
ZenDNN bumped to ZenDNN-2026-WW19; Q8_0 weight support added for matmul and matmul_id paths via dynamic quantization (S8 compute, BF16 scales); ZenDNN matmul/matmul_id now handles GGML_TYPE_Q8_0 with FP32 src1 directly without F32→Q8_0 conversion. Internal AMD ZenDNN backend, no project changes required |
| ~b9284–b9297 | tools/perplexity/perplexity.cpp |
log_probs.resize(n_ctx * nv) widened to size_t(n_ctx) * nv to avoid 32-bit overflow on large context sizes. Standalone tool not compiled by project, no impact |
| ~b9297–b9305 | upstream CMakeLists.txt |
Top-level backward-compat shims that forwarded LLAMA_BUILD_WEBUI → LLAMA_BUILD_UI and LLAMA_USE_PREBUILT_WEBUI → LLAMA_USE_PREBUILT_UI were REMOVED (they now live only in tools/ui/CMakeLists.txt). Java impact: project's set(LLAMA_BUILD_WEBUI OFF CACHE BOOL "" FORCE) no longer hits the shim at top level. tools/ui is not configured in FetchContent mode (LLAMA_BUILD_TOOLS=OFF), so the old setting was inert in practice, but the project's CMakeLists.txt:107 was renamed to set(LLAMA_BUILD_UI OFF CACHE BOOL "" FORCE) for clarity and to defend against future flips of LLAMA_BUILD_UI default |
| ~b9297–b9305 | common/common.h |
LLAMA_UI_DEFAULT_ENABLED macro removed; common_params::ui default is now unconditionally true. Not referenced by project, no changes required |
| ~b9297–b9305 | common/fit.{h,cpp} |
common_get_device_memory_data() made non-static and exported from fit.h (was a file-local helper). fit.h now also pulls in ggml-backend.h, llama.h, and ../src/llama-ext.h. Used by upstream tools/server/server-context.cpp (compiled directly into jllama). The #include "../src/llama-ext.h" resolves relative to fit.h's location (common/../src/llama-ext.h), so no extra include paths are required. No project source changes |
| ~b9297–b9305 | tools/server/server-context.cpp |
New #include "fit.h" and a new draft/MTP memory measurement block: when params_base.fit_params is set AND the speculative config includes a draft model or COMMON_SPECULATIVE_TYPE_DRAFT_MTP, common_get_device_memory_data() is called against the draft model (or a copy of the target params with LLAMA_CONTEXT_TYPE_MTP for MTP) and the resulting per-device model + context + compute bytes are added to params_base.fit_params_target before the target context is fitted. Compiled directly into jllama from upstream; behaviour is additive and only triggers for speculative-decoding setups. ModelParameters.setFit(boolean) defaults to on, so this kicks in automatically when a user configures a draft model — no Java-side wiring required |
| ~b9297–b9305 | tools/server/server-context.cpp |
[mtmd] estimated memory usage of mmproj log line reworded to estimated worst-case memory usage; log only, no behavioural change |
| ~b9297–b9305 | tools/server/server-http.cpp |
UI serving path migrated from per-asset extern arrays (index_html, bundle_js, …) and the LLAMA_BUILD_UI macro to a runtime llama_ui_find_asset() lookup gated on the new LLAMA_UI_HAS_ASSETS macro generated by the new llama-ui-embed host tool. Project does NOT compile server-http.cpp (only server-context.cpp/server-queue.cpp/server-task.cpp/server-models.cpp), no impact |
| ~b9297–b9305 | tools/ui/ (CMakeLists.txt, new embed.cpp, new sources.cmake, new scripts/ui-assets.cmake, removed scripts/ui-download.cmake + scripts/xxd.cmake, removed ui.cpp+ui.h) |
Full UI build pipeline rewrite: xxd.cmake+ui-download.cmake replaced by a host-compiled llama-ui-embed C++ tool that generates ui.cpp/ui.h (declaring a g_assets[] table and llama_ui_find_asset() lookup, plus LLAMA_UI_HAS_ASSETS macro) from arbitrary asset files; new scripts/ui-assets.cmake orchestrates asset provisioning with a clearer priority (pre-built tools/ui/dist → npm build → HF Bucket); tools/ui is now an add_custom_target always re-run per build. The deprecation shims for LLAMA_BUILD_WEBUI/LLAMA_USE_PREBUILT_WEBUI/LLAMA_WEBUI_HF_BUCKET moved here from the top-level CMakeLists.txt. Project does not build the UI (LLAMA_BUILD_TOOLS=OFF in FetchContent mode), no impact |
| ~b9297–b9305 | ggml/include/ggml-alloc.h |
Comment-only API documentation update for ggml_backend_alloc_ctx_tensors_from_buft. No project changes required |
| ~b9297–b9305 | ggml/src/ggml-backend-meta.cpp |
Bug fix for zero-sized split tensor slices: set_tensor/get_tensor/set_tensor_async/get_tensor_async paths now continue when chunk_size_j == 0; ggml_backend_meta_alloc_ctx_tensors_from_buft now allocates a dummy buffer when all tensors in a context are zero-sized (was returning NULL and asserting); ggml_backend_buft_alloc_buffer result now GGML_ASSERTed non-null. Internal backend code, no project changes required |
| ~b9297–b9305 | ggml/src/ggml-hexagon/htp/hmx-flash-attn-ops.c |
hvx_vec_splat_f16(hvx_vec_get_f16(...)) round-trip replaced with hvx_vec_repl_f16(...) which stays in the vector domain via vdelta (avoids store/reload through scalar). Internal Hexagon DSP backend optimization, no project changes required |
| ~b9297–b9305 | ggml/src/ggml-opencl/ggml-opencl.cpp |
GGML_OPENCL_PROFILING batching fix: when profiling_info reaches 2048 entries the batch is now flushed into a persistent profiling_results vector (events released, durations populated) instead of accumulating until shutdown. Also fixes missing ] closing the JSON array in cl_trace.json. Profile-only code (GGML_OPENCL_PROFILING is off by default), no project changes required |
| ~b9305–b9333 | common/common.h + common/arg.cpp |
common_params::checkpoint_every_nt renamed to checkpoint_min_step; default changed 8192 → 256; CLI flag -cpent/--checkpoint-every-n-tokens REMOVED (throws std::invalid_argument at parse time) and replaced by -cms/--checkpoint-min-step; env var LLAMA_ARG_CHECKPOINT_EVERY_NT → LLAMA_ARG_CHECKPOINT_MIN_SPACING_NT. Java layer does not expose this flag, no project source changes required |
| ~b9305–b9333 | common/chat.h + common/chat.cpp |
New common_chat_msg_span and common_chat_msg_delimiter structs; new common_chat_params::message_spans field (default empty vector); new common_chat_split_by_role() function; populated for GPT-OSS, Gemma4, and all autoparser-handled templates with detected user_start/assistant_start markers; passed through server-common.cpp as message_spans JSON array in the task params; compiled from upstream, no Java changes required |
| ~b9305–b9333 | common/chat-diff-analyzer.cpp + common/chat-auto-parser.h |
New autoparser::user_start and autoparser::assistant_start fields auto-detected via differential template analysis; new patches for Nemotron Nano v2, Fireworks v2, Solar Open, Apriel 1.6; additive, compiled from upstream, no project changes required |
| ~b9305–b9333 | tools/server/server-task.h + tools/server/server-context.cpp |
New task_params::n_before_user field (default -1); server computes it from message_spans to place context checkpoints precisely at the last-user-message boundary; MTP context creation now propagates draft.cache_type_k/v; compiled directly into jllama from upstream, no project source changes required |
| ~b9305–b9333 | ggml/include/gguf.h + ggml/src/gguf.cpp |
New gguf_reader_callback_t typedef; new gguf_init_from_buffer(data, size, params) and gguf_init_from_callback(callback, userdata, max_chunk_read, max_expected_size, params) public APIs; internal gguf_init_from_reader() helper refactored to use a callback-based reader; additive, not used by project |
| ~b9305–b9333 | ggml/CMakeLists.txt |
GGML version bumped 0.12.0 → 0.13.0; no project changes required |
| ~b9305–b9333 | ggml/src/CMakeLists.txt + ggml/src/ggml-cpu/CMakeLists.txt |
OpenMP detection and target_link_libraries moved from ggml-cpu into ggml-base; exported ggml-config.cmake.in updated to add GGML_BASE_INTERFACE_LINK_LIBRARIES and guard OpenMP targets before appending; fixes static-lib consumers that link only ggml-base; no project source changes required |
| ~b9305–b9333 | ggml/src/ggml-alloc.c |
Off-by-one bug fix in ggml_dyn_tallocr_remove_block: loop ran one iteration past the last valid element; internal allocator fix, no project changes required |
| ~b9305–b9333 | ggml/src/ggml-backend-meta.cpp |
Rotating-pair compute containers: external views created between evals now use a stc_compute[2] double-buffer scheme so they don't slowly deplete stc_static memory; split_state_cache is now unbounded (comment documents it as FIXME); ggml_backend_meta_alloc_ctx_tensors_from_buft uses ggml_get_mem_size(ctx) for static container and 16× that for each compute container; internal multi-GPU meta backend refactor, no project changes required |
| ~b9305–b9333 | ggml/src/ggml-cuda/fwht.cu + fwht.cuh + ggml-cuda.cu |
New CUDA FWHT (Fast Walsh-Hadamard Transform) kernel (fwht_cuda<N>) for N = 64/128/256/512; dispatched from ggml_cuda_mul_mat when GGML_HINT_SRC0_IS_HADAMARD op hint is set on a ggml_mul_mat node (hint index 1); internal CUDA backend, no project changes required |
| ~b9305–b9333 | ggml/src/ggml-metal/ggml-metal-device.{h,m} |
New ggml_metal_device_id enum covering M1–M5 variants; device_id field added to ggml_metal_device_props, populated by new ggml_metal_device_id_parse() from the MTL device name string; additive, no project changes required |
| ~b9305–b9333 | ggml/src/ggml-quants.c |
IQ2XS and IQ3XS neighbour-search init parallelized with OpenMP (3-pass: parallel count → serial prefix-sum → parallel write); fixes a prior race on counter under OpenMP; guards with #ifdef GGML_USE_OPENMP; internal quantization init, no project changes required |
| ~b9305–b9333 | src/llama-arch.cpp |
LLM_TENSOR_FFN_LATENT_DOWN and LLM_TENSOR_FFN_LATENT_UP probe op changed from GGML_OP_MUL to GGML_OP_MUL_MAT; fixes Nemotron 3 Super latent projections not staying on GPU (buft probe must use MUL_MAT to keep them there); internal upstream fix, no project changes required |
| ~b9305–b9333 | vendor/cpp-httplib/httplib.{h,cpp} |
Bumped to v0.45.1: close_socket, shutdown_socket, Server::stop marked noexcept; macOS Keychain cert loading migrated from deprecated SecTrustCopyAnchorCertificates to SecTrustSettingsCopyCertificates (all three trust domains: system, admin, user); CPPHTTPLIB_USE_CERTS_FROM_MACOSX_KEYCHAIN now restricted to TARGET_OS_OSX only with compile-time #error on iOS/tvOS/watchOS; compiled automatically, no project changes required |
| ~b9305–b9333 | common/common.h |
New string_lcs(std::string_view a, std::string_view b) function (longest common substring via DP); additive, not used by project directly |
| ~b9333–b9354 | src/models/talkie.cpp (new) + src/llama-arch.h/cpp + src/llama-model.cpp + src/llama-vocab.cpp/h |
New Talkie model architecture (LLM_ARCH_TALKIE); uses NEOX rope type; embedding skip connections via out_scale; per-head Q gain via attn_q_norm; logit scale; new LLAMA_VOCAB_PRE_TYPE_MINICPM5 = 52 ("minicpm5" pre-type with ignore_merges = true); "talkie" tokenizer_pre mapped to GPT4O; Gemma4ForCausalLM registered as Gemma4 in HF conversion map; all additive, no project source changes required |
| ~b9333–b9354 | src/models/mistral3.cpp |
Dense FFN now passes ffn_up_s/ffn_gate_s/ffn_down_s instead of nullptr; MoE passes ffn_up_exps_s/ffn_gate_exps_s/ffn_down_exps_s to build_moe_ffn; bug fix for NVFP4 Mistral3/Mistral-MoE models; upstream only, no project changes required |
| ~b9333–b9354 | tools/server/server-http.h + server-http.cpp |
bool is_ssl = false field added to server_http_context; listening_address now uses https:// prefix when SSL is configured (was always http://); compiled from upstream, no project changes required |
| ~b9333–b9354 | ggml/src/ggml-sycl/ggml-sycl.cpp |
Virtual memory pool (ggml_sycl_pool_vmm) implemented when SYCL_EXT_ONEAPI_VIRTUAL_MEM is available; GGML_SYCL_ENABLE_VMM env var (default 1) controls it; DEBUG_SYCL_MALLOC compile flag for verbose allocation logging; vmm_granularity field in sycl_device_info; internal SYCL backend, no project changes required |
| ~b9333–b9354 | ggml/src/ggml-cuda/fwht.cu + fwht.cuh |
ggml_cuda_op_fwht return type changed void → bool; returns false for non-contiguous tensors or unsupported N values instead of calling GGML_ABORT; caller in ggml-cuda.cu now skips FWHT gracefully; internal CUDA backend, no project changes required |
| ~b9333–b9354 | ggml/src/ggml-vulkan/ggml-vulkan.cpp + conv2d_mm.comp |
Cooperative matrix 1 (cm1) path for conv2d; new CONV_SHAPE_64x128 tile size; aligned spec constant skips bounds checks when K/CRS/NPQ are tile-aligned; csh_store stages cm2/cm1 output through shared memory for coalesced global stores; internal Vulkan backend, no project changes required |
| ~b9333–b9354 | ggml/src/ggml-webgpu/ |
New MMVQ path for mat-vec using packed_4x8_integer_dot_product; legacy mul_mat.wgsl removed (replaced by register-tile path); new quantize_q8.wgsl and mul_mat_vec_q_acc.tmpl; vendor and dot-product capability detection at init; q8_1.m renamed to q8_1.s in WGSL struct; internal WebGPU backend, no project changes required |
| ~b9333–b9354 | upstream CI (.github/workflows/) |
CANN and SYCL builds disabled to save Actions resources; macOS builds moved to build-apple.yml; cache keys prefixed with cache-gha-; [no release] commit message token skips release pipeline; no project changes required |
| ~b9354–b9437 | common/common.h + common/arg.h + common/arg.cpp |
common_params_handle_models() return type void → bool (caller can detect skip-download misses); new common_params::skip_download; common_params::timeout_read default raised 600 → 3600. Project does not call common_params_handle_models() directly — arg parsing happens upstream; the new defaults flow through transparently |
| ~b9354–b9437 | common/download.h + common/download.cpp |
common_download_model() parameter list trimmed: download_mmproj/download_mtp moved into common_download_opts; new common_skip_download_exception; new opt skip_download returns -2 on missing/etag mismatch. Project does not include download.h directly, no source changes required |
| ~b9354–b9437 | tools/server/server-task.h + server-task.cpp |
task_params::stream default true → false; new server_task_result_cmpl_partial::is_begin bool to let HTTP layer emit SSE headers before the first delta; to_json() returns nullptr for the begin marker (sentinel meaning "HTTP-headers-only, no body"). Project always sets stream explicitly from Java (LlamaIterator.java, LlamaModel.java) so the default change is inert. The is_begin / nullable-to_json contract DOES leak into the JNI bridge — see the row below for the required fix |
| ~b9354–b9437 | tools/server/server-context.cpp + server-queue.cpp |
send_partial_response() gained is_begin parameter (defaulted); SSE stream now emits a no-content opening event when stream && !return_progress (server-context.cpp:2835) so the client sees HTTP 200 + headers before first token. server_response_reader::next() 30s warn-on-cancel diagnostic message updated. Required project source change: Java_net_ladenthin_llama_LlamaModel_receiveCompletionJson in src/main/cpp/jllama.cpp called result->to_json() once and assigned response["stop"], which silently auto-promoted the nullptr to an object {"stop": false} and surfaced a phantom empty LlamaOutput to every Java streaming caller (LlamaModelTest.testGenerateAnswer and four sibling tests overran by +1 token). Fixed by wrapping the rd->next() call in a loop that skips response.is_null() results so only real events reach Java |
| ~b9354–b9437 | common/arg.cpp (env-var renames) |
LLAMA_LOG_* → LLAMA_ARG_LOG_*, LLAMA_OFFLINE → LLAMA_ARG_OFFLINE, LLAMA_LOG_FILE → LLAMA_ARG_LOG_FILE, LLAMA_CHAT_TEMPLATE_KWARGS → LLAMA_ARG_CHAT_TEMPLATE_KWARGS. CLI verbosity values relabeled (4=trace, 5=debug). The --license CLI flag was REMOVED and moved to the new llama-app licenses subcommand. Project does not expose these env vars or the --license flag through the Java API, no changes required |
| ~b9354–b9437 | src/llama.cpp |
llama_backend_init() device-discovery rule tightened: iGPUs are now added only when no discrete GPUs were found (was: when no devices at all). RPC servers no longer count as "found" for this purpose, so iGPU + RPC setups keep the local iGPU. Behavioural only, single-line caller in jllama.cpp unchanged |
| ~b9354–b9437 | src/llama-chat.cpp |
New LLM_CHAT_TEMPLATE_GRANITE_4_1 enum value + "granite-4.1" template name; granite-4.0 detection now requires the literal token g4_default_system_message in the template, otherwise it routes to 4.1. Project does not implement chat-template detection directly — routing happens inside compiled-from-upstream code, no source changes required |
| ~b9354–b9437 | vendor/cpp-httplib/ |
Bumped to v0.46.0: adds Client::set_no_proxy(std::vector<std::string>) with full hostname-suffix and IPv4/IPv6 CIDR matching; Server::ThreadPool constructor is exception-safe (already in v0.45.0); Client::set_proxy() now disconnects the held socket immediately so a later proxy change cannot reuse the old TLS session. Compiled automatically, no project changes required |
| ~b9354–b9437 | common/arg.cpp (additive flags) |
New --spec-draft-backend-sampling / --no-spec-draft-backend-sampling (env LLAMA_ARG_SPEC_DRAFT_BACKEND_SAMPLING) and --skip-download (mapped to common_params::skip_download). Both default-on / default-off in a way that preserves current Java behaviour. Consider exposing as ModelParameters.setSpecDraftBackendSampling(boolean) and setSkipDownload(boolean) in a follow-up — tracked under Open TODOs |
| ~b9354–b9437 | ggml/src/ggml-cuda/common.cuh |
GGML_CUDA_USE_PDL gating tightened: for MSVC, now requires CTK ≥ 12.3 (was 11.8) due to a compiler bug in the older Windows CUDA toolchains. Project's only CUDA build is Linux (dockcross, CUDA 13.2) so the MSVC gate has no CI impact; Windows CI builds CPU-only |
| ~b9437–b9442 | src/llama-vocab.{h,cpp} + src/llama-arch.{h,cpp} |
New LLAMA_VOCAB_PRE_TYPE_WHITESPACE = 53 and llm_tokenizer_whitespace_session (used by jina-v2-base-zh embeddings); new "whitespace" tokenizer_model routed as LLAMA_VOCAB_TYPE_BPE; new LLM_KV_TOKENIZER_NORMALIZER_LOWERCASE key (tokenizer.ggml.normalizer.lowercase) read into llama_vocab::impl::normalizer_lowercase; new public accessor llama_vocab::get_normalizer_lowercase(). All additive — existing tokenizers untouched; new whitespace + lowercase normalizer is consumed automatically when loading a GGUF that sets these vocabulary keys, no project source or Java API changes required |
| ~b9437–b9442 | src/llama.cpp |
llama_prepare_model_devices() iGPU collection now appends only the FIRST GGML_BACKEND_DEVICE_TYPE_IGPU device (prevents duplicate iGPU registration on multi-iGPU hosts). Behavioural fix, single-line caller in jllama.cpp unchanged, no project source changes required |
| ~b9437–b9442 | tools/ui/embed.cpp + tools/ui/src/... (Svelte) |
Webasset embedder tightened printf format specifiers (%lu → %zu and PRIx64); UI settings split custom into customJson + customCss; runtime CSS injection via <svelte:head>. Project does not ship the upstream UI, no impact |
| ~b9437–b9442 | gguf-py/, conversion/ (Python) |
New _set_vocab_whitespace() helper and add_normalizer_lowercase() GGUF writer for the new whitespace tokenizer + lowercase normalizer keys (mirrors the vocab additions above); jina-v2 Roberta-tokenizer path now branches to whitespace when tokenizer.json declares a Whitespace pre-tokenizer. Python-side only, no impact on the Java/JNI build |
| ~b9442–b9444 | .github/workflows/build-cpu.yml (upstream CI) |
Upstream's CPU-build CI trigger paths narrowed to **/*.h, **/*.hpp, **/*.c, **/*.cpp (dropped **/*.cu, **/*.cuh, **/*.swift, **/*.m, **/*.metal, **/*.comp, **/*.glsl, **/*.wgsl) so GPU/Metal/Vulkan/WebGPU/Swift source edits no longer trigger the CPU build. Upstream-only CI plumbing; this project consumes none of upstream's workflow files and has its own publish.yml, no impact |
| ~b9442–b9444 | tools/server/server-http.cpp |
If-None-Match conditional-GET handling now also accepts the weak ETag form W/"..." (previously matched only strong ETag bytes-equal); 304 Not Modified returned for either form. This is the standalone llama-server HTTP tool, which is not linked into the JNI build (libllama + libcommon only); no project source changes required and no new Java API surface to expose |
| ~b9444–b9490 | common/common.cpp |
common_prompt_batch_decode() signature changed: new int n_new parameter added between all_tokens and n_past. Callers must pass the count of newly-decoded tokens for the batch. Only called inside upstream tools/server/server-context.cpp (compiled directly into jllama); no project source changes required — the new signature flows through transparently |
| ~b9444–b9490 | include/llama.h |
llama_set_warmup() deprecated via LLAMA_DEPRECATED macro (warmup is now handled internally during model load + first decode). Not called from jllama.cpp or any project source — absorbed inside upstream-compiled code, no project changes required. If a future jllama feature wants to control warmup explicitly, that path is the deprecated one and should pick the new replacement instead |
| ~b9444–b9490 | include/llama.h + src/llama-context.cpp |
New llama_context_params::n_outputs_max field (default -1 = derived from n_batch). Limits the number of output slots allocated per context; useful for low-memory setups that always request logits_all=false. Not exposed by project today — consider adding ModelParameters.setMaxOutputs(int) if a user requests fine-grained control. Tracked under Open TODOs |
| ~b9444–b9490 | common/arg.cpp + common/common.cpp |
common_params_handle_models() no longer sets hf_opts.download_mmproj = true unconditionally; instead uses opts.download_mmproj = !params.no_mmproj so the new --no-mmproj flag suppresses the multimodal projector download. Not called from project source — arg parsing happens upstream, no project changes required |
| ~b9444–b9490 | common/sampling.h + common/sampling.cpp |
New common_sampler_reasoning_budget_force(common_sampler *) API that triggers the budget sampler to inject the end-of-thinking token on the next sample. Paired with new common_params_sampling::reasoning_control bool: when set, arms the budget sampler so external code (e.g. a server control endpoint) can end reasoning at runtime. Not used by project today — would pair with a future InferenceParameters.setReasoningControl(boolean) setter and a LlamaModel.endReasoning(...) helper. Tracked under Open TODOs |
| ~b9444–b9490 | common/common.h + common/arg.cpp |
New common_params::sse_ping_interval (int32, env LLAMA_ARG_SSE_PING_INTERVAL, CLI --sse-ping-interval); server emits SSE keep-alive comments at this interval. Server-only; project does not run the upstream HTTP server (uses a direct in-process API), no Java setter required |
| ~b9444–b9490 | tools/server/server-http.cpp |
New POST /v1/chat/completions/control endpoint accepting {"id": "...", "action": "reasoning_end"} — tells a streaming completion to wrap up reasoning early. Server-only; not linked into the JNI build (libllama + libcommon only), no project source changes required. If exposed in Java, would map to a new LlamaModel.endReasoning(String taskId) method that calls common_sampler_reasoning_budget_force on the slot's sampler. Tracked under Open TODOs |
| ~b9444–b9490 | src/llama-hparams.h + src/llama-model.cpp |
Internal renames: hparams::recurrent_layer_arr → hparams::is_recr_impl; hparams::swa_layers → hparams::is_swa_impl. Internal helper fields not part of the public API; not referenced by jllama.cpp or any project source, no changes required |
| ~b9444–b9490 | src/llama-arch.h + src/llama-arch.cpp + gguf-py/ |
New LLM_KV_HIDDEN_ACT GGUF key (%s.hidden_act) for ModernBert SwiGLU/GeGLU activation selection; new LLM_KV_ATTENTION_RECURRENT_LAYERS key for hybrid (recurrent + attention) models. Additive vocabulary keys consumed automatically when loading a GGUF that sets them; no project source or Java API changes required |
| ~b9444–b9490 | src/llama-arch.h + src/models/*.cpp (new) |
New model architectures: LLM_ARCH_MELLUM (JetBrains code-completion), LLM_ARCH_EXAONE4_5 (LG AI multimodal), LLM_ARCH_STEP3P7 (StepFun Step-3.7 with MTP support); LLM_ARCH_QWEN3NEXT/LLM_ARCH_QWEN35/LLM_ARCH_QWEN35MOE removed from llama_model_saver_supports_arch() allowlist. New tokenizer pre-types: LLAMA_VOCAB_PRE_TYPE_GRANITE_EMB_MULTI = 54, LLAMA_VOCAB_PRE_TYPE_MELLUM2 = 55. All additive at the architecture level — consumed automatically when loading a matching GGUF, no project source or Java API changes required |
| ~b9444–b9490 | common/arg.cpp |
New --mtp / --no-mtp flags (env LLAMA_ARG_MTP) now apply to Step-3.5 in addition to the existing Qwen3.5 coverage. Multi-Token Prediction is consumed inside upstream-compiled server TUs; project does not expose an MTP setter today (would map to ModelParameters.setMtp(boolean)). Tracked under Open TODOs if a user requests it |
| ~b9444–b9490 | upstream build / verification | Local build with GIT_TAG b9490 was verified clean: cmake -B build configures cleanly; cmake --build build --config Release -j$(nproc) links libjllama.so with zero warnings on jllama.cpp or any project translation unit. All breaking changes in this range are absorbed inside upstream-compiled translation units (common.cpp, arg.cpp, llama.cpp, server-*.cpp, download.cpp); no project source edits required for the version bump itself |
| ~b9490–b9495 | include/llama.h + src/llama-ext.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-graph.{h,cpp} + common/speculative.{h,cpp} + src/models/{qwen35,qwen35moe,step35}.cpp |
Mass terminology rename: pre_norm → nextn everywhere the pre-final-norm hidden state is referenced. Affects the public API: llama_set_embeddings_pre_norm() → llama_set_embeddings_nextn(), llama_get_embeddings_pre_norm() → llama_get_embeddings_nextn(), llama_get_embeddings_pre_norm_ith() → llama_get_embeddings_nextn_ith(). Internal: cparams.embeddings_pre_norm → cparams.embeddings_nextn, cparams.embeddings_pre_norm_masked → cparams.embeddings_nextn_masked, llm_graph_result::t_h_pre_norm → t_h_nextn, common_speculative_need_embd_pre_norm() → common_speculative_need_embd_nextn(). Qwen3.5 / Qwen3.5-MoE / Step-3.5 model graphs moved the final norm before extracting t_h_nextn (was after extracting the pre-norm hidden state). Project does not call any of these MTP-specific APIs directly — all references stay inside upstream-compiled translation units (speculative.cpp, llama-context.cpp, server-context.cpp, model TUs). Verified by grep across src/main/cpp/*.{cpp,hpp}: zero matches for any pre_norm / nextn / embeddings_pre_norm* / t_h_pre_norm* symbol. No project source changes required |
| ~b9490–b9495 | ggml/src/ggml-cuda/common.cuh + 10 CUDA kernel files |
New GGML_CUDA_RESTRICT macro replaces __restrict__ on kernel parameter pointers. PDL (Programmatic Dependent Launch) on Hopper requires __restrict__ to be disabled per llama.cpp PR #24030; the macro expands to nothing under GGML_CUDA_USE_PDL && __CUDA_ARCH__ >= GGML_CUDA_CC_HOPPER, otherwise to __restrict__. Kernel signatures change from direct T * __restrict__ x parameters to T * x_ptr parameter + an internal T * GGML_CUDA_RESTRICT x = x_ptr; alias line; GGML_UNUSED_VARS calls in fallback branches updated to reference the _ptr names. Internal CUDA backend change; project does not compile any CUDA kernels in the JNI build (CUDA build uses upstream sources unchanged via FetchContent). No project source changes required |
| ~b9490–b9495 | src/llama-arch.{h,cpp} + src/llama-vocab.{h,cpp} + gguf-py/gguf/constants.py + gguf-py/gguf/gguf_writer.py |
New LLM_KV_TOKENIZER_SUPPRESS_TOKENS GGUF key (tokenizer.ggml.suppress_tokens). When a GGUF declares this array, the loader stores it on llama_vocab::impl::suppress_tokens and exposes it via new llama_vocab::get_suppress_tokens() accessor. The Gemma4 model graph (src/models/gemma4.cpp) reads this list and appends a -INFINITY logit bias to those token IDs at the end of the forward graph (new llm_graph_input_logits_bias class). Additive: existing models without the key produce an empty suppress_tokens vector and the bias-add branch is skipped. Mirrors a HuggingFace transformers suppress_tokens parameter; specifically used for Gemma4 Unified to prevent the model from emitting `<image |
| ~b9490–b9495 | gguf-py/gguf/constants.py + gguf-py/gguf/tensor_mapping.py + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/clip.cpp + new tools/mtmd/models/gemma4uv.cpp + new tools/mtmd/models/gemma4ua.cpp + tools/mtmd/mtmd-audio.{h,cpp} + tools/mtmd/mtmd.cpp + conversion/__init__.py + conversion/gemma.py |
New Gemma4 Unified vision + audio variant (Gemma4UnifiedForConditionalGeneration). Adds new projector types PROJECTOR_TYPE_GEMMA4UV and PROJECTOR_TYPE_GEMMA4UA (vision uses bigger patch size with token merging done on the conv layer; audio is encoder-free, raw 16 kHz waveform chunked into 640-sample frames). New V_ENC_EMBD_PATCH_NORM tensor enum (v.patch_norm.{bid}) and 3 indexed patch_norm_{1,2,3}_{w,b} weights on clip_model (Gemma4U uses standard PyTorch LayerNorm rather than RMSNorm before/after the patch embedding). New mtmd_audio_preprocessor_gemma4ua mel-major waveform packer (40 ms / 16 kHz frames; no FFT, no filterbank). Multimodal additions are routed through upstream mtmd-cli / mtmd-debug binaries that the project does not link; the JNI build links libllama + libcommon only. Additive at the GGUF / projector loader level: existing GGUFs without these projector types continue to load through the previous code paths. No project source or Java API changes required |
| ~b9490–b9495 | tools/ui/ (package.json, src/lib/components/app/content/MarkdownContent/, new MermaidPreview.svelte, new DialogMermaidPreview.svelte, new constants / icons / rehype plugins) |
Upstream llama-server web UI gains Mermaid diagram rendering: new mermaid@^11.15 dependency, lazy-loaded; new rehype plugin chain (rehype-mermaid-pre, rehype-enhance-mermaid-blocks) converts ```mermaid code fences to <pre class="mermaid"> and wraps them with copy / preview action buttons; the existing single-file MarkdownContent.svelte is split into a .svelte + sibling .css / markdown-utils.ts / markdown-handlers.ts so the new mermaid renderer can share helpers. Project does not compile or ship the upstream tools/ui (server-only feature, classpath-only JNI build); no impact |
| ~b9490–b9495 | upstream build / verification | Local build with GIT_TAG b9495 was verified clean: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit; ctest --test-dir build --output-on-failure reports 435/435 tests passing. All breaking changes in this range are renames within upstream-compiled translation units; no project source edits required for the version bump itself |
| ~b9495–b9543 | src/llama-hparams.{h,cpp} + every src/models/*.cpp (~150 files) |
Field hparams::n_layer (uint32_t) was split: the raw count moved to hparams::n_layer_all and hparams::n_layer() is now a member function that returns n_layer_all - n_layer_nextn (the effective non-MTP layer count). Sibling rename: hparams::nextn_predict_layers → hparams::n_layer_nextn. Every per-model TU in src/models/*.cpp was updated to call hparams.n_layer() and hparams.n_layer_nextn. New hparams::set_recr_pattern() mirror of set_swa_pattern() for hybrid recurrent architectures. New per-layer hparams::deepstack_mapping_arr (LLAMA_MAX_LAYERS, default -1) populated from new GGUF key LLM_KV_DEEPSTACK_MAPPING for Granite4-Vision-style per-layer deepstack injection. hparams::kv_only_nextn was removed (MTP heads now use a layer filter callback instead). Project does not reference any of these hparams symbols directly — verified via grep -rn "hparams\.n_layer|nextn_predict_layers|n_layer_nextn|n_layer_all|deepstack_mapping" src/main/cpp/ src/test/cpp/ returns zero matches. All consumers are inside upstream-compiled TUs (llama-model.cpp, llama-context.cpp, model TUs); no project source changes required |
| ~b9495–b9543 | include/llama.h (state-seq flags) + tools/server/server-context.cpp + examples/speculative-simple/speculative-simple.cpp |
The LLAMA_STATE_SEQ_FLAGS_ON_DEVICE flag was removed from the llama_state_seq_flags enum. All upstream call sites that passed LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY | LLAMA_STATE_SEQ_FLAGS_ON_DEVICE were updated to pass only LLAMA_STATE_SEQ_FLAGS_PARTIAL_ONLY — the on-device path is now the default for partial saves/loads. Project does not call llama_state_seq_get_* / llama_state_seq_set_* directly from jllama.cpp; the only consumer in the JNI build is upstream server-context.cpp (speculative checkpoint helpers), which was updated upstream. Verified via grep -rn "LLAMA_STATE_SEQ_FLAGS_ON_DEVICE" src/ returns zero matches. No project source changes required |
| ~b9495–b9543 | new common/imatrix-loader.{h,cpp} + refactor of tools/imatrix/imatrix.cpp + tools/quantize/quantize.cpp |
Extracted shared imatrix-loading logic into a standalone library: new common_imatrix struct (entries, datasets, chunk_count, chunk_size, is_legacy, has_metadata) and common_imatrix_load(const std::string &, common_imatrix &) reader. New GGUF metadata keys exposed as LLM_KV_IMATRIX_DATASETS, LLM_KV_IMATRIX_CHUNK_COUNT, LLM_KV_IMATRIX_CHUNK_SIZE. The imatrix and quantize CLIs were rewritten to consume this shared loader (the legacy in-file binary parser also moved into the shared loader). Build system: common/CMakeLists.txt now includes imatrix-loader.cpp and imatrix-loader.h in libcommon, which means the JNI build picks up the new TU automatically via FetchContent + the existing target_link_libraries(jllama PRIVATE common) line. Project does not use imatrix loading from Java today (no LlamaImatrix class); the new symbols ship as additive surface area only. No project source changes required |
| ~b9495–b9543 | tools/mtmd/clip.{h,cpp} + tools/mtmd/clip-impl.h + tools/mtmd/clip-model.h + tools/mtmd/mtmd.{h,cpp} + tools/mtmd/mtmd-helper.{h,cpp} + tools/mtmd/mtmd-image.cpp + every tools/mtmd/models/*.cpp |
Large MTMD subsystem refactor: (1) clip_image_u8 and clip_image_f32 switched from public POD-style nx / ny / buf fields to private members with get_size() / set_size() / get_ro_buf() / cpy_buf() / get_pixel() / set_pixel() / is_placeholder() getters/setters; every model TU and image helper was updated to the new API. (2) Several public helpers were removed from tools/mtmd/clip.h: clip_embd_nbytes, clip_embd_nbytes_by_img, clip_image_u8_get_data, clip_build_img_from_pixels, clip_get_newline_tensor, clip_encode_float_image, clip_image_f32_batch_add_mel. (3) mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() gained a required bool placeholder parameter (when true the bitmap reserves shape only, no pixel decode — used for token counting). (4) mtmd_bitmap is now a true class (private buffer + is_placeholder() / can_batch_with()); mtmd_bitmap_init() and mtmd_bitmap_init_from_audio() accept nullptr data to create placeholder bitmaps. (5) New Granite4 Vision projector type PROJECTOR_TYPE_GRANITE4_VISION and tensor enums (V_MULTI_PROJ_*, V_QF_*) for QFormer-with-window projection. (6) Qwen-VL video / temporal-merge support: clip_graph_qwen2vl::build_inp_with_temporal_merge() plus n_batch_max=2 for batch-merged consecutive image frames. Project does not link any tools/mtmd/* TUs into the JNI build (libllama + libcommon only); the JNI vision API surfaces through mtmd-helper.h and was reviewed: zero clip_image_* / removed-helper references found across src/main/cpp/ and src/test/cpp/. No project source changes required |
| ~b9495–b9543 | tools/server/server-context.cpp + tools/server/server-http.cpp + tools/server/server.cpp (new /v1/responses/input_tokens + /v1/chat/completions/input_tokens + /v1/messages/count_tokens) |
New token-counting endpoints (Anthropic-compatible + OpenAI Responses-API-compatible). Implementation: server_routes::handle_count_tokens() consolidates the body parsing path (chat completions, responses, anthropic messages) and emits {"input_tokens": N, "object": "response.input_tokens"}. process_mtmd_prompt() signature gained a bool is_placeholder = false parameter so token-counting can reuse the multimodal tokenization path without decoding image/audio pixels. Server-only HTTP endpoints (the JNI build links neither tools/server/server.cpp nor server-http.cpp); the only server TU we link is server-context.cpp, where the only project-visible change is the new optional process_mtmd_prompt parameter, which is defaulted — existing project call sites compile unchanged. No project source changes required |
| ~b9495–b9543 | common/chat-peg-parser.{h,cpp} + common/chat.cpp (LFM2/2.5 unified) |
LFM2.5's chat-completion parser was merged into the single common_chat_params_init_lfm2() (was a separate _lfm2_5 function); a bool tool_list_tokens flag toggles between the two template flavours. New helper common_chat_peg_builder::python_or_json_value() and a new bool allow_json_literals parameter on python_style_tool_calls() so LFM2.5 can accept JSON-cased true / false / null alongside the Python-cased literals. Pure-Python literal normalisation in chat-peg-parser.cpp (True/False/None → JSON during streaming). Project does not call any common_chat_peg_* or common_chat_params_init_lfm2* symbols; routing happens inside upstream-compiled chat.cpp. No project source changes required |
| ~b9495–b9543 | ggml/src/ggml-cuda/mmvq.cu + ggml/src/ggml-cpu/arch/{riscv,wasm}/quants.c + ggml/src/ggml-metal/ggml-metal-device.m + ggml/src/ggml-opencl/* + ggml/src/ggml-sycl/* + ggml/src/ggml-vulkan/* + ggml/src/ggml-webgpu/* + ggml/src/ggml-cpu/kleidiai/kleidiai.cpp |
Per-backend numerical & performance work: (1) CUDA mul_mat_vec_q_moe switched to GGML_CUDA_RESTRICT aliasing + PDL launch params for Hopper. (2) RISC-V Vector quants: dispatch-by-VL refactor (vl128 / vl256 / vl512 / vl1024 separate kernels for Q2_K, Q3_K, Q4_K, Q6_K, IQ1_S, IQ1_M, IQ2_S, IQ2_XS, IQ3_S, IQ3_XXS, IQ4_XS, TQ1_0, TQ2_0). (3) WebAssembly SIMD path for Q4_1. (4) Metal residency-set keep-alive polling interval tightened to 5 ms (was 500 ms). (5) OpenCL Adreno: faster concat/cpy/get_rows packed kernels for narrow tensors (<32 cols); Q6_K mat-vec rewritten with vec4 weight gather. (6) SYCL: multi-column MMVQ paths added for all quant types (ncols=2..8) used by speculative decoding's draft verification batches; should_reorder_tensor gate widened from ne[1]==1 to ne[1]<=8. (7) Vulkan: NV cooperative-matrix2 feature detection now requires every coopmat2_features.* bit; FWHT shader gains shmem fallback (Intel Windows driver bug workaround). (8) WebGPU: flash-attention split into vector / tile / subgroup-matrix variants with K/V quantization-aware staging (U32_DEQUANT_HELPERS); GRANITE_SPEECH bumped to multi-projector. (9) KleidiAI: env vars GGML_KLEIDIAI_CHUNK_MULTIPLIER & GGML_KLEIDIAI_SME thread-cap auto-detect; SME + non-SME hybrid scheduling. All purely backend-internal; project compiles backends through FetchContent with no API surface change visible to jllama.cpp. No project source changes required |
| ~b9495–b9543 | conversion/__init__.py + conversion/granite.py + conversion/gemma.py + convert_lora_to_gguf.py + gguf-py/gguf/{constants,tensor_mapping,gguf_writer}.py |
Python-side: new Granite4VisionMmprojModel (vision-projector for Granite4 with QFormer-window deepstack + per-projector spatial offsets + image-grid pinpoints); Gemma4 unified vision/audio conversion fix-ups for newer HF checkpoints (hidden_size falls back to audio_embed_dim; model_patch_size falls back to patch_size * pooling_kernel_size). convert_lora_to_gguf.py gained --trust-remote-code. New LLM_KV_DEEPSTACK_MAPPING writer (add_deepstack_mapping) and new clip-vision keys (KEY_PROJ_SAMPLE_QUERY_SIDE, KEY_PROJ_SAMPLE_WINDOW_SIDE, KEY_PROJ_SPATIAL_OFFSETS, KEY_FEATURE_LAYERS, KEY_IMAGE_GRID_PINPOINTS) for the Granite4 vision projector. Python-side only; no impact on the Java/JNI build. No project source changes required |
| ~b9495–b9543 | upstream build / verification | Local build pending: the b9495 → b9543 bump is expected to compile cleanly given the audit above (zero grep matches in src/main/cpp/ for any of the renamed or removed symbols: hparams.n_layer, nextn_predict_layers, n_layer_nextn, n_layer_all, LLAMA_STATE_SEQ_FLAGS_ON_DEVICE, clip_image_u8/clip_image_f32 field access, clip_build_img_from_pixels, clip_get_newline_tensor, clip_image_u8_get_data, clip_embd_nbytes, clip_embd_nbytes_by_img, clip_encode_float_image, clip_image_f32_batch_add_mel, mtmd_helper_bitmap_init_from_file, mtmd_helper_bitmap_init_from_buf, common_imatrix_load). The only project-visible signature change — process_mtmd_prompt()'s new bool is_placeholder parameter — is defaulted, so existing call sites inside the project compile unchanged. All breaking changes in this range are absorbed inside upstream-compiled translation units; no project source edits required for the version bump itself |
| ~b9543–b9549 | include/llama.h + src/llama-context.{h,cpp} + src/llama-cparams.h + src/llama-ext.h |
New llama_context_params::ctx_other field (a source/target/parent llama_context *, default nullptr) used to share results or llama_memory between two contexts; mirrored by new cparams.ctx_other and the new staging API llama_get_ctx_other() (llama-ext.h). llama_get_memory() was moved earlier in llama-context.cpp and made null-safe (returns nullptr for a null ctx). llama_context_default_params() initializes ctx_other = nullptr. Project does not aggregate-init llama_context_params (it goes through llama_context_default_params() inside upstream server-context.cpp) and never includes llama-ext.h — verified via grep -rn "llama_context_params|ctx_other|llama-ext.h|llama_get_ctx_other|llama_get_memory" src/main/cpp/ returns zero matches. No project source changes required |
| ~b9543–b9549 | src/llama-kv-cache.{h,cpp} + llama-kv-cache-iswa.{h,cpp} + llama-kv-cache-dsa.cpp + llama-memory.h + llama-memory-hybrid{,-iswa}.cpp |
KV-cache constructors gained two new parameters: llama_memory_t mem_other and layer_share_cb share (std::function<int32_t(int32_t il)> returning the source layer index to share cells from, or negative to skip). Enables one context's KV cache to share cells with another's (used by the new Gemma4-assistant MTP head). llama_memory_params gained a mem_other field. All call sites (iswa/dsa/hybrid wrappers, llama_model::create_memory) updated upstream; the project never constructs a llama_kv_cache* or llama_memory_* directly. No project source changes required |
| ~b9543–b9549 | src/llama-arch.{h,cpp} + new src/models/gemma4-assistant.cpp + src/models/models.h + src/llama-model.{h,cpp} + src/llama-hparams.{h,cpp} + src/llama-graph.{h,cpp} + gguf-py/ + conversion/gemma.py |
New model architecture LLM_ARCH_GEMMA4_ASSISTANT ("gemma4-assistant") — a NextN/MTP draft "assistant" head that shares the target Gemma4's KV cache and reads its post-final-norm hidden state. New tensors LLM_TENSOR_NEXTN_PROJ_PRE/NEXTN_PROJ_POST (nextn.pre_projection/post_projection) plus model-level nextn_proj_pre/nextn_proj_post; new hparams n_embd_inp_impl (input-embedding dim override, honoured by n_embd_inp()) and graph field n_layer_nextn. Python conversion registers Gemma4AssistantForCausalLM/Gemma4UnifiedAssistantForCausalLM. This is the headline new feature; it is a speculative-decoding / MTP mechanism, which this project tracks as deferred-by-policy (see Open TODOs / spec-draft-backend-sampling + MTP). Consumed entirely inside upstream-compiled TUs — loading a non-assistant GGUF is unaffected. No project source changes required to build; exposing MTP through the Java API remains the existing deferred TODO |
| ~b9543–b9549 | common/chat.cpp + new models/templates/LFM2.5-8B-A1B.jinja |
LFM2 chat-template handling: prior-turn reasoning_content is now copied into the template's thinking field, and <think> reasoning extraction is gated on the template source actually containing <think> (and no longer on enable_thinking). New LFM2.5-8B-A1B template + parser test consolidation. Routing happens inside upstream-compiled chat.cpp; the project calls no common_chat_params_init_lfm2* symbol. Handled automatically when such a model is loaded; no project source or Java API changes required |
| ~b9543–b9549 | common/arg.cpp + common/speculative.cpp + src/llama-graph.cpp |
common_params_handle_models() mmproj auto-download now also requires params.mmproj.path.empty() && params.mmproj.url.empty() (an explicitly-specified mmproj is no longer re-downloaded). speculative.cpp MTP path adds a shared-memory fast path (is_mem_shared = llama_get_ctx_other(ctx_dft) == ctx_tgt) that skips the catch-up decode and reuses the target position for draft tokens (Gemma4 assistant), and switched to llama_model_n_embd_out() for the MTP row width. llama-graph.cpp moved the set_input_kq_mask / can_reuse_kq_mask calls out of the k-idxs-buffer guard (iswa/hybrid-iswa mask bugfix). All inside upstream-compiled TUs; no project source changes required |
| ~b9543–b9549 | tools/server/server-context.cpp (project-linked) |
The one project-linked server TU changed: now #includes ggml-cpp.h and ../../src/llama-ext.h; sets cparams.ctx_other = ctx_tgt for MTP draft/MTP contexts; moved the ctx_dft_seq_rm_type = common_context_can_seq_rm(...) assignment to after context init (guarded by if (ctx_dft)); downgraded the spec memory-measure failure log from SRV_ERR to SRV_WRN; and gated the mtmd draft-processing block on llama_get_ctx_other(ctx_dft) != ctx_tgt. All changes are internal to the TU and the new includes resolve against the FetchContent'd src/ and ggml headers. Compiles into jllama unchanged from the project's side. No project source changes required |
| ~b9543–b9549 | .github/workflows/docker.yml (upstream CI) |
Upstream's cuda13 Docker image bumped from CUDA 13.1.1 to 13.3.0. Upstream's own CI only; this project ships its own publish.yml and pins CUDA 13.2 via .github/build_cuda_linux.sh (see CLAUDE.md "Upgrading CUDA Version"). No impact |
| ~b9543–b9549 | project CMakeLists.txt (pre-existing latent bug, fixed in this bump) |
Not an upstream change — surfaced while build-testing this bump locally. The OS/arch detection block invoked net.ladenthin.llama.OSInfo, but the class had moved to net.ladenthin.llama.loader.OSInfo in the earlier layered-package restructure, so cmake -B build failed with "Could not determine OS name" on any host that does not pass -DOS_NAME/-DOS_ARCH explicitly (CI does, which is why it went unnoticed). Fixed both execute_process invocations (--os and --arch) to the loader.OSInfo FQN. Same stale-FQN-after-restructure class as the earlier spotbugs-exclude.xml / PIT-targetClasses repairs — the standing reminder to re-validate every FQN-bearing config after a package move now also covers CMakeLists.txt |
| ~b9543–b9549 | upstream build / verification | Local build with GIT_TAG b9549 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly (after the loader.OSInfo FQN fix above), cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit (incl. the changed server-context.cpp), and ctest --test-dir build --output-on-failure reports 435/435 tests passing. All upstream breaking changes in this range are absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |
| ~b9549–b9553 | common/sampling.h + common/sampling.cpp + common/arg.cpp + common/common.cpp + tools/server/server-task.cpp |
common_sampler_types_from_names() dropped its bool allow_alt_names parameter — the signature is now common_sampler_types_from_names(const std::vector<std::string> & names). The body was rewritten to (a) auto-generate kebab-case (top-k) and no-dash (topk) aliases from the canonical snake_case names, plus misc aliases (nucleus→top_p, temp→temperature, typ→typical_p), and (b) lowercase the input so matching is case-insensitive; aliases are now always accepted (the old gate is gone). All three call sites were updated upstream (arg.cpp / common.cpp dropped the , true arg; server-task.cpp dropped the , false arg). Project impact: none at the source level — grep -rn common_sampler_types_from_names src/main/cpp src/test/cpp returns zero matches; the symbol is reached only through the upstream-compiled server-task.cpp linked into jllama. New behaviour exposed for free: because server-task.cpp previously passed allow_alt_names=false, the project's InferenceParameters samplers JSON array only matched canonical names like top_k; it now also accepts top-k / topk / nucleus / temp / typ and is case-insensitive (TOP_K, Min-P). Pinned by 5 new ParamsFromJsonCmpl.Samplers_* tests in test_server.cpp |
| ~b9549–b9553 | src/llama-kv-cache.cpp + src/llama-kv-cache.h + src/llama-kv-cells.h |
KV-cache shared-cells refactor (continues TAG_KV_CACHE_SHARE_CELLS, used by the Gemma4-assistant MTP head): the v_cells member changed from a by-value std::vector<llama_kv_cells> to a std::shared_ptr<llama_kv_cells_vec> v_cells_impl plus a llama_kv_cells_vec & v_cells reference, so a target cache now views the source cache's cells instead of copying them in apply_ubatch(); the constructor also clamps kv_size down to the shared source's size. New type alias using llama_kv_cells_vec = std::vector<llama_kv_cells>; in llama-kv-cells.h. All internal src/ headers the JNI build does not include (the project pulls public llama.h / llama-cpp.h, never llama-kv-cache.h / llama-kv-cells.h) — verified via grep -rn "llama_kv_cells|llama-kv-cache" src/main/cpp src/test/cpp → zero matches. No project source changes required |
| ~b9549–b9553 | conversion/mistral.py + convert_hf_to_gguf.py |
Python conversion-script robustness only: hparams["llama_4_scaling"] and "moe" in hparams replaced with hparams.get(...) / is not None guards so a present-but-null key no longer crashes conversion. Python tooling, not part of the JNI build. No impact |
| ~b9549–b9553 | upstream build / verification | Local build with GIT_TAG b9553 verified clean on Linux x86_64: cmake -B build -DBUILD_TESTING=ON configures cleanly, cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest --test-dir build --output-on-failure reports 440/440 tests passing (435 prior + 5 new Samplers_* tests). The sole breaking change in this range (the common_sampler_types_from_names signature) is absorbed inside upstream-compiled translation units; no project C++ source edits were required for the version bump itself |
| ~b9553–b9555 | .devops/intel.Dockerfile + ggml/src/ggml-metal/ggml-metal-device.cpp + tests/test-backend-ops.cpp |
Tiny maintenance bump — no API change and no new feature. (1) intel.Dockerfile: Intel GPU userspace driver pins bumped (IGC v2.20.5→v2.34.4, compute-runtime 25.40.35563.10→26.18.38308.1, IGDGMM 22.8.2→22.10.0) with the old multi-GPU-safe versions commented out; upstream's own Docker image only — this project ships its own publish.yml and does not consume .devops/. No impact. (2) ggml-metal-device.cpp: bugfix to the Metal im2col pipeline selector — the standard-vs-_ext kernel choice now keys off the actual conv-kernel footprint (KH*KW, with KH = is_2D ? ne01 : 1, KW = ne00) instead of the raw ne00*ne01 product, fixing kernel selection for 1-D convolutions. Backend-internal Metal TU compiled via FetchContent; no API surface visible to jllama.cpp, and only affects the macOS/Metal backend at runtime. (3) tests/test-backend-ops.cpp: one extra test_im2col case ({3000,384,1,1} / {3,384,384,1}) added — upstream test only, not linked into the JNI build. No project source changes required; no new Java-API-exposable feature. Build verification deferred to CI (publish.yml) / a developer host as usual |
| ~b9555–b9621 | ggml/include/ggml.h + ggml/src/ggml.c + ggml/src/ggml-cuda/gated_delta_net.cu + ggml/src/ggml-metal/ggml-metal.metal + ggml/src/ggml-vulkan/vulkan-shaders/gated_delta_net.comp |
ggml_gated_delta_net state tensor reshaped again: the 3D (S_v*S_v*H, K, n_seqs) layout is now the 4D [S_v, S_v, H, n_seqs] with an explicit int64_t K seventh parameter (snapshot count, K=1 is final-state-only). Signature: ggml_gated_delta_net(ctx, q, k, v, g, beta, state, K) (was 6-argument). Snapshot-slot ordering also flipped to most-recent-first. Internal Qwen3.5 / Qwen3-Next recurrent-attention kernel; project does not call ggml_gated_delta_net directly — no project source changes required |
| ~b9555–b9621 | ggml/include/ggml.h |
New ggml_col2im_1d(ctx, a, s0, oc, p0) function and GGML_OP_COL2IM_1D enum value added; GGML_OP_COUNT incremented 96 → 97. Additive; not called by project — no project source changes required |
| ~b9555–b9621 | common/fit.h + tools/server/server-context.cpp |
common_get_device_memory_data() return type changed: now returns common_device_memory_data_vec (typedef for std::vector<common_device_memory_data>). New common_device_memory_data struct carries .total, .free, .model, .context, .compute fields directly (previously the caller reached them via .mb.model etc.). fit.h also dropped its #include "ggml-backend.h" and #include "../src/llama-ext.h" lines (those types are no longer needed at the header level). Consumed exclusively in upstream-compiled server-context.cpp (field-accessor update from .mb.model → .model etc. was applied upstream); project does not include fit.h or call common_get_device_memory_data() directly — no project source changes required |
| ~b9555–b9621 | tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp + tools/server/server-common.cpp |
mtmd_helper_bitmap_init_from_file() and mtmd_helper_bitmap_init_from_buf() return type changed: both now return mtmd_helper_bitmap_wrapper struct (contains bitmap + video_ctx fields) instead of mtmd_bitmap*. All call sites updated in upstream server-common.cpp. Project does not call these functions from src/main/cpp/ (verified via grep: zero matches) — no project source changes required |
| ~b9555–b9621 | tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp |
New video pipeline: mtmd_helper_video_context, mtmd_helper_video_* API family (init/free/decode), ffmpeg-based frame extraction. New --video CLI flag in common/arg.cpp; new input_video content type in server-common.cpp. Multimodal helper additions flow through the upstream-compiled mtmd-helper.cpp and server-common.cpp; project does not reference any mtmd_helper_video_* symbol — no project source changes required. Could be exposed in a future Java API as InferenceParameters.setVideoPath(String) |
| ~b9555–b9621 | common/common.h |
New common_params fields: path_prompts_log_dir (prompt-logging output directory, string) and mtmd_batch_max_tokens (multimodal batch token limit, default 1024). Both additive with harmless defaults. Not surfaced by ModelParameters today — could be added in a future enhancement. No project source changes required |
| ~b9555–b9621 | src/llama-ext.h |
New EAGLE3 speculative-decoding support APIs: llama_set_embeddings_layer_inp(ctx, lid, value), llama_get_embeddings_layer_inp(ctx, lid), llama_model_target_layer_ids(model) → const int32_t*, llama_model_target_layer_ids_n(model) → uint32_t. New LLM_ARCH_EAGLE3 model architecture; new llama_model_eagle3 struct in upstream model sources. EAGLE3 enables full encoder+decoder graph implementation for speculative decoding. All consumed inside upstream-compiled speculative.cpp and model TUs; project does not reference any of these symbols — no project source changes required. Could be exposed later as a speculative-decoding backend type in ModelParameters |
| ~b9555–b9621 | src/llama-graph.h + src/llama-graph.cpp |
llm_graph_result::set_outputs() signature changed: now takes a const llm_graph_params & parameter (was no-parameter). New t_layer_inp vector added to llm_graph_result for layer-input embedding extraction (used by EAGLE3). Internal graph-building API; not called from project sources — no project source changes required |
| ~b9555–b9621 | src/llama-context.cpp |
llama_context now initializes embeddings_layer_inp storage for EAGLE3 layer-input extraction; n_outputs_max is forced to n_batch when llama_model_has_encoder() returns true (encoder models always need all outputs). Internal context lifecycle; no project sources reference these fields — no project source changes required |
| ~b9555–b9621 | vendor/cpp-httplib/httplib.h + httplib.cpp |
cpp-httplib bumped to v0.47.0. Compiled automatically via FetchContent — no project source changes required |
| ~b9555–b9621 | ggml/src/ggml-cuda/ggml-cuda.cu |
ggml_concat on CUDA now handles F16, BF16, I8, I16, I32, I64 element types in addition to F32; active_count tracking added to CUDA context to prevent memory leak from lazy cudaMemGetInfo context creation. Internal CUDA backend, no project changes required |
| ~b9555–b9621 | ggml/src/ggml-vulkan/ + Vulkan shaders |
New VK_VALVE_shader_mixed_float_dot_product extension support for F16→F32 fused dot products (dot2_f16) in flash attention and GEMM matmul. Internal Vulkan backend, no project changes required |
| ~b9555–b9621 | ggml/src/ggml-opencl/ + OpenCL kernels |
New Q5_0 and Q5_1 GEMM/GEMV noshuffle kernels for Qualcomm Adreno GPUs. Internal OpenCL backend (affects opencl-android-aarch64 classifier build only); no project source changes required |
| ~b9555–b9621 | ggml/src/ggml-cuda/ssm-scan.cu |
Added __syncthreads() before the final reduction stage to prevent shared-memory race conditions on multi-warp SSM scan. Bug fix, internal CUDA backend, no project changes required |
| b9621–b9637 | common/chat.cpp |
New Cohere2 MoE ("North Code") chat parser common_chat_params_init_cohere2moe + auto-detection (template containing <|START_TEXT|> and <|START_ACTION|>). Purely additive — compiled in the chat.cpp TU and reached through the existing specialized-template path, so the project's oaicompat_chat_params_parse picks it up automatically. No project source changes required. New feature: Cohere2 MoE reasoning + JSON tool-call chat support |
| b9621–b9637 | common/jinja/runtime.cpp, common/jinja/value.cpp |
Jinja chat-template engine fixes: filter aliases count→length, d→default, e→escape; negative-step slice start/stop defaults; split raises on empty separator; replace('', x) now expands between every char. Compiled into common; improves chat-template compatibility automatically. No project source changes required |
| b9621–b9637 | src/llama-arch.{h,cpp}, src/models/cohere2moe.cpp (new), src/models/models.h, src/llama-model.cpp, src/llama-model-saver.cpp, src/llama-vocab.cpp |
New LLM_ARCH_COHERE2MOE architecture (MoE + MTP/NextN) with llama_model_cohere2moe; cohere2moe tokenizer pre-type (maps to LLAMA_VOCAB_PRE_TYPE_TINY_AYA); Cohere2 dense path gains ffn_*_s NVFP4 scale tensors; tied-NVFP4-output assert relaxed to allow sidecar LM-head scales. Additive enum/struct internal to libllama; the project includes llama.h, not llama-arch.h/models.h, and switches on no arch enum. No project source changes required. New feature: loads North-Mini-Code GGUFs |
| b9621–b9637 | ggml/src/ggml-vulkan/ + shaders |
Unary shaders consolidated into one templated unary.comp; new EXPM1 Vulkan op; GLU push-constants reworked (per-dim strides + misalign offsets); fastdiv L values byte-packed to stay under the 128B push-constant limit. Internal Vulkan backend — the project builds CPU/CUDA/Metal/OpenCL only, never Vulkan. No project changes required |
| b9621–b9637 | tools/server/server-http.cpp, tools/ui/, scripts/ui-assets.cmake |
Optional gzip-compressed WebUI asset serving (LLAMA_UI_GZIP, llama_ui_use_gzip()). The project compiles server-context/queue/task/models but not server-http.cpp or tools/ui, so the HTTP/WebUI layer is absent from jllama. No project changes required |
| b9621–b9637 | tools/cli/cli.cpp, .devops/*.Dockerfile, .github/, conversion/, convert_hf_to_gguf_update.py, gguf-py/, models/templates/Cohere2MoE.jinja, docs/, tests/ |
CLI preserved-token wiring, Docker image docker.io/ prefixes, CI labeler/release tweaks, Python GGUF converters, the new model template asset, doc typos, and upstream tests. None are compiled into jllama or shipped by the project. No project changes required |
| b9637–b9642 | ggml/src/ggml-cuda/ggml-cuda.cu |
ggml_backend_cuda_device_supports_op for GGML_OP_REPEAT tightened: the supported-types check changed from a blocklist (!= I32 && != I16) to an allowlist (== F32 || == F16), because the CUDA REPEAT path only implements F32/F16 and other types asserted at runtime. Internal CUDA backend; the project switches on no op-support enum and never calls this. No project changes required |
| b9637–b9642 | ggml/src/ggml-webgpu/wgsl-shaders/mul_mat_decls.tmpl |
WebGPU matmul shared-memory dequant templates rewritten: legacy/k-quant #elif chains converted to independent #if defined(...) blocks, and the i-quant (super-block 256) IQ1/IQ2/IQ3/IQ4 paths reworked to process NQ quants per thread with vectorized store_shmem_iquants/create_iq_gw4 helpers. Internal WebGPU backend — the project builds CPU/CUDA/Metal/OpenCL only, never WebGPU. No project changes required |
| b9637–b9642 | tools/ui/, tools/ui/src/lib/utils/heic-to-jpeg.ts (new) |
WebUI gains a "render thinking as Markdown" display setting and client-side HEIC/HEIF image upload support (lazy CDN-loaded heic-to decoder → JPEG). The project compiles server-context/queue/task/models but not tools/ui, so the WebUI is absent from jllama. No project changes required |
| b9637–b9642 | convert_lora_to_gguf.py, tests/test-backend-ops.cpp |
LoRA converter now resolves the base-model architecture via get_model_architecture(hparams, ModelType.TEXT) instead of hand-reading text_config/architectures; a GGML_TYPE_BF16 test_repeat case was added to the backend-ops test. Python tooling and an upstream test — neither is compiled into jllama. No project changes required |
| b9642–b9682 | tools/mtmd/mtmd-helper.h + tools/mtmd/mtmd-helper.cpp |
mtmd_helper_decode_image_chunk gained two parameters — a post-decode callback plus its user_data — so callers can hook each decoded multimodal chunk; the standalone process_chunk helper was removed and folded into mtmd_helper_eval_chunk_single. Consumed only inside the upstream-compiled mtmd-helper.cpp / server-context.cpp; the project's hand-written C++ references no mtmd_*/process_chunk symbol (zero matches in src/main/cpp). No project source changes required. New feature: the post-decode callback enables multimodal speculative-draft decoding — exposable later as a vision + draft-model Java path |
| b9642–b9682 | common/common.cpp (build_lora_mm_id) |
The LoRA multimodal id-embedding builder gained a w_s scale-weight argument for per-adapter scaling. Internal to the upstream-compiled common library; the project never calls it. No project source changes required |
| b9642–b9682 | common/speculative.{h,cpp} |
Speculative decoding now accumulates per-draft-position acceptance statistics and adds an Eagle3 backend-sampling path (the draft model samples on the compute backend). common_speculative_* is compiled into common and reached only through the upstream server's speculative slot; the project's C++ references no speculative/draft symbol. No project source changes required. New feature: per-position draft-acceptance metrics — could surface as speculative-decoding telemetry in a future Java API |
| b9642–b9682 | tools/server/server-context.cpp |
Server slot refactored so an mtmd (multimodal) prompt can feed a speculative draft model: image/media chunks are routed through the new mtmd_helper_decode_image_chunk callback before drafting. Compiled directly into jllama (the project builds server-context/queue/task/models), but the change is internal to the slot state machine and binds no new/renamed symbol; verified that jllama.cpp and the *_helpers.hpp headers call none of the touched functions. No project source changes required |
| b9642–b9682 | ggml/src/ggml-* backends, tools/ (incl. llama-bench --offline), conda-forge packaging, docs/, .github/ |
Routine backend kernel updates and tooling/docs/CI tweaks (a new llama-bench --offline flag, conda-forge recipe notes). None are compiled into jllama beyond the already-built CPU/CUDA/Metal/OpenCL backends, and none change a symbol the project binds. No project changes required |
| b9682–b9739 | tools/server/server-schema.{h,cpp} (new) + tools/server/server-task.{h,cpp} |
Build-breaking. server_task::params_from_json_cmpl() MOVED to server_schema::eval_llama_cmpl_schema() in new server-schema.h/server-schema.cpp. Required project changes: (1) add server-schema.cpp to the target_sources(jllama ...) block in CMakeLists.txt; (2) add #include "server-schema.h" in src/main/cpp/jllama.cpp and src/test/cpp/test_server.cpp; (3) update the call sites in jllama.cpp:203 and test_server.cpp:1722 from server_task::params_from_json_cmpl(...) to server_schema::eval_llama_cmpl_schema(...) |
| b9682–b9739 | common/common.h (common_params_model) |
common_params_model::name field REMOVED; replaced by get_name() method. Not referenced in project source (model name is read from server_context_meta::model_name, populated upstream) — no project source changes required |
| b9682–b9739 | common/common.h (common_params) |
webui, webui_mcp_proxy, webui_config_json fields REMOVED (deprecated aliases; replaced by ui/ui_mcp_proxy/ui_config_json introduced in b9172). Project never references these fields directly — no project source changes required |
| b9682–b9739 | tools/server/server-models.h + server-models.cpp |
server_state enum: SERVER_STATE_LOADING_MODEL renamed to SERVER_STATE_LOADING; new SERVER_STATE_SLEEPING added. on_sleeping_changed callback replaced by set_state_callback with server_state_callback_t type. None are referenced in jllama.cpp — no project source changes required |
| b9682–b9739 | vendor/cpp-httplib/httplib.{h,cpp} |
cpp-httplib bumped from v0.47.0 to v0.48.0. Compiled automatically via FetchContent — no project source changes required |
| b9682–b9739 | common/speculative.{h,cpp} |
New common_speculative_get_state() / common_speculative_set_state() Eagle3 state checkpointing APIs; common_prompt_checkpoint::data_spec field added for Eagle3 speculative draft state stash. Additive; compiled into upstream common; project does not call these functions — no project source changes required. New feature: Eagle3 speculative decoding state save/restore — could expose later |
| b9682–b9739 | common/download.h + common/download.cpp |
New common_download_remove() function for deleting cached model files. Additive; project does not call it — no project source changes required. New feature: could be exposed as LlamaModel.deleteCachedModel(String path) |
| b9682–b9739 | common/arg.cpp |
New --agent flag that enables all tools + MCP CORS proxy in one step. Server-level CLI flag; not referenced by ModelParameters — no project source changes required. New feature: consider ModelParameters.setAgent(boolean) |
| b9682–b9739 | common/arg.cpp + tools/server/server-http.cpp |
API key file: lines starting with # are now treated as comments and ignored. Behaviour fix for existing ModelParameters.setApiKeyFile(String) users — upgrade picks it up automatically, no source changes required |
| b9682–b9739 | ggml/src/ggml-sycl/ |
New conv2d, conv2d_dw, conv2d_transpose, conv3d SYCL ops; Q1_0 quantization support. Internal SYCL backend, no project changes required |
| b9682–b9739 | ggml/src/ggml-cuda/ |
New col2im_1d CUDA op. Internal CUDA backend, no project changes required |
| b9682–b9739 | ggml/src/ggml-metal/ |
ROPE_BACK Metal support; concat kernel extended to additional types. Internal Metal backend, no project changes required |
| b9739–b9789 | common/json-partial.{h,cpp} (removed) + common/peg-parser.{h,cpp} + common/chat.cpp |
The standalone partial-JSON parser was deleted (json-partial.h/.cpp, −363 lines) and its incremental-JSON handling folded into the PEG parser (peg-parser.cpp +194/−81). Partial JSON during streaming tool-call parsing is now produced by peg-parser instead of common_json_parse. Project never included json-partial.h — verified grep -rn "json-partial|common_json_parse" src/main/cpp src/test/cpp → zero matches. All consumers stay inside upstream-compiled chat.cpp. No project source changes required |
| b9739–b9789 | common/chat.h + common/chat.cpp |
Message-span types restructured: new enum common_chat_role (+ common_chat_role_from_string/_to_string); common_chat_msg_span::role and common_chat_msg_delimiter::role changed std::string → common_chat_role; new container structs common_chat_msg_spans / common_chat_msg_delimiters (the latter with tokenize()/split()/to_json()); common_chat_params::message_spans (vector) → message_delimiters; free function common_chat_split_by_role() removed, replaced by common_chat_msg_delimiters_parse(). common_chat_msg_diff (used by test_server.cpp) is unchanged. Project references none of the changed span/delimiter symbols — verified grep -rn "message_spans|common_chat_split_by_role|common_chat_msg_span|common_chat_msg_delimiter" src/main/cpp src/test/cpp → zero matches. Routing happens inside upstream-compiled chat.cpp / server-*.cpp. No project source changes required |
| b9739–b9789 | tools/server/server-task.h + server-context.cpp + server-common.{h,cpp} |
Context-checkpointing reworked from a precomputed offset to message spans: task_params::n_before_user (int32) removed, replaced by task_params::message_spans (common_chat_msg_spans); new server_tokens::find_message_spans(const common_chat_msg_delimiters &) helper. test_server.cpp asserts against task_params::to_json() but never references n_before_user — verified grep -rn "n_before_user|message_spans" src/test/cpp → zero matches, so it compiles and passes unchanged. Consumed inside upstream-compiled server-context.cpp linked into jllama. No project source changes required |
| b9739–b9789 | include/llama.h |
New API llama_model_n_layer_nextn(const llama_model *) — returns the number of NextN/MTP layers (additive; the surrounding accessor block was otherwise only column-realigned). Not called by project; could back a future introspection accessor. No project source changes required |
| b9739–b9789 | common/common.h |
common_params::checkpoint_min_step default raised 256 → 8192 (minimum spacing between context checkpoints). Tuning default consumed inside upstream-compiled server-context.cpp; not surfaced by ModelParameters. No project source changes required |
| b9739–b9789 | common/arg.h + common/arg.cpp + common/download.h |
common_params_handle_models() gained a 3rd parameter — a common_params_handle_models_params struct ({ common_download_callback*, bool preset_only }) for router-mode preset-only downloads; arg.h now #includes download.h; new common_download_opts::preset_only. Project does not call common_params_handle_models() directly (arg parsing happens upstream) — grep -rn common_params_handle_models src/ → zero matches. No project source changes required |
| b9739–b9789 | common/arg.cpp + common/arg.h + ~34 tools/*,examples/*,tests/* mains + tests/test-arg-parser.cpp (patch target) |
Upstream's Windows common_params_parse argv handling changed again: the unconditional argc/argv = make_utf8_argv() override (the original #24779 regression) became a count-guard if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); } — exactly the variant the project already found breaks its Windows server-integration tests (embedded argv length coincides with java.exe's). patches/0001-win32-arg-parse-embed-guard.patch now carries the complete upstream fix (37 files): common_params_parse parses exactly the argv it is given; a new common_params_parse_main() wrapper holds the GetCommandLineW recovery; the ~34 standalone main() call sites flip to it; and a tests/test-arg-parser.cpp case pins the contract. The embedded JNI caller stays on common_params_parse and is respected. Our subproject build compiles only the arg.{cpp,h} core (LLAMA_BUILD_TOOLS/TESTS OFF, so our 454-test suite is unchanged); the flips + test were validated via a one-off tools+tests build (new test passes; test-arg-parser's only failure is the live ggml.ai download check — sandbox network). 37-file patch — refresh on every bump |
| b9739–b9789 | tools/mtmd/mtmd.h + tools/mtmd/clip.h + clip.cpp + mtmd.cpp |
New feature — multimodal model-load progress: new mtmd_progress_callback typedef + progress_callback / progress_callback_user_data fields on mtmd_context_params and clip_context_params (additive, appended to the structs; returning false aborts the load). Project does not aggregate-init either struct (grep -rn mtmd_context_params src/ → zero matches) so the new fields are harmless; could later feed a Java LoadProgressCallback for vision models. No project source changes required |
| b9739–b9789 | tools/server/server-models.{h,cpp} + server-context.h |
Multi-model router refactor: model downloading moved into a dedicated child-process mode (enum server_child_mode, server_models::load(name, load_options), server_child::run_download(); old server_models::download() removed); SERVER_STATE_DOWNLOADING re-enabled in server_state. Project links server-models.cpp but does not drive the router (grep -rn "server_models|SERVER_CHILD_MODE" src/ → zero matches). Compiles into jllama unchanged. No project source changes required |
| b9739–b9789 | ggml/src/ggml-{hexagon,vulkan,sycl,opencl,webgpu,cuda}/ + shaders |
Backend-internal work only: Hexagon HTP matmul kernels re-tiled (hmx-matmul-ops.c → hmx-mm-kernels-tiled.h); Vulkan gains a conv3d_mm shader + get_rows_back and folds the elementwise unary shaders (clamp/cos/sin/sqrt/square/leaky_relu.comp removed) into unary.comp; SYCL element-wise / conv3d additions; OpenCL Adreno norm/gemv tweaks; WebGPU mul_mat_vec refactor. No API surface visible to jllama.cpp; the OpenCL set only affects the opencl-android-aarch64 classifier. No project source changes required |
| b9739–b9789 | common/json-schema-to-grammar.cpp (Java-test impact) |
The JSON-schema → GBNF serializer changed where it emits the space whitespace rule: a closing object is now … )? space "}" (was … )? "}" space) and a root-level string rule no longer appends a trailing space (string ::= "\"" char* "\"", was … "\"" space). Functionally equivalent (leading- vs trailing-whitespace placement) but byte-different, so the pinned expectation in LlamaModelTest.testJsonSchemaToGrammar was updated to the b9789 output. LlamaModel.jsonSchemaToGrammar is a pure JNI call (no model), so this failed on every platform's Java-test job; the new expectation was verified locally against the built b9789 libjllama. Test-data change only |
| b9739–b9789 | tools/server/server-context.cpp (patch target, regression) |
server_context::load_model now unconditionally installs the server's own load-progress reporter on params_base.load_progress_callback immediately before common_init_from_params (b9739 called common_init_from_params(params_base) with no such assignment). This clobbered libjllama's LoadProgressCallback JNI trampoline (set on common_params.load_progress_callback before load_model), so LoadProgressCallbackTest observed zero progress updates and the abort-on-false path stopped throwing. Fixed by new patches/0002-server-preserve-caller-load-progress-callback.patch, which guards the install behind if (params_base.load_progress_callback == nullptr) so a caller-supplied callback survives (standalone llama-server keeps its reporter — the field is null there). Re-verified to apply + reverse-apply cleanly against b9789 and to compile clean (ctest still 454/454) |
| b9739–b9789 | upstream build / verification | Local build with GIT_TAG b9789 verified clean on Linux x86_64 (GCC 13.3; sources were pre-staged from release tarballs + both patches hand-applied because this sandbox blocks github.com git-clone, so FetchContent's git path and PATCH_COMMAND could not run — the published CI pipeline uses the normal git FetchContent path). cmake -B build -DBUILD_TESTING=ON configures cleanly (the OuteTTS build-time extraction and the refreshed Windows patch both pass their fail-loud anchor checks against b9789), cmake --build build --config Release -j$(nproc) links libjllama.so + jllama_test with zero warnings on any project translation unit, and ctest --test-dir build --output-on-failure reports 454/454 tests passing. Every upstream breaking change in this range is absorbed inside upstream-compiled translation units, so no project C++ source edits were required — but PR CI's model-backed Java suite (which the restricted sandbox cannot run) surfaced two project-side fixes captured in the two rows above: the json-schema-to-grammar test-expectation update and the load_progress_callback server regression (patches/0002) |
| b9789–b9803 | common/arg.{cpp,h} + common/download.{cpp,h} + common/common.h (model-download refactor) |
The model-download pipeline was rewritten: common_params_handle_models() / common_params_handle_models_params / common_download_model() / common_download_model_result removed and replaced by a two-phase common_models_handler API (common_models_handler_init() builds the HF plan + opts; common_models_handler_apply() runs the parallel common_download_task list); common_download_opts::skip_download / ::preset_only and the whole common_skip_download_exception type removed; new common_download_get_hf_plan() / common_download_run_tasks() / common_download_get_all_parts(); download.h now #includes hf-cache.h. Project C++ references none of these — verified grep -rn "common_params_handle_models|common_download_model|common_skip_download|skip_download|preset_only" src/main/cpp src/test/cpp → zero matches, and no project TU includes download.h directly. All consumers (arg parsing, server-models.cpp, llama-bench.cpp) are upstream-compiled. No project C++ source changes required. Java API follow-up (behavioural): this removal exposed that the project's ModelFlag.SKIP_DOWNLOAD (--skip-download) was never a registered upstream arg — it only ever forced a parse failure that SkipDownloadFailureTranslator mapped to ModelUnavailableException, and it could never load a present model. It was replaced with the real upstream --offline flag: ModelFlag.OFFLINE + ModelParameters.setOffline(boolean); the heuristic translator was replaced by a deterministic pre-check OfflineModelGuard (throws ModelUnavailableException when --offline is set and the configured local --model file is absent, before the native call); LlamaModelSkipDownloadTest → LlamaModelOfflineTest. ModelUnavailableException is retained. Pure-Java change, no JNI rebuild |
| b9789–b9803 | common/common.h |
common_params_model gained bool empty() and get_name() became const (additive); common_params::skip_download field removed; new LLAMA_EXAMPLE_DOWNLOAD enumerator appended before LLAMA_EXAMPLE_COUNT. None surfaced by ModelParameters; consumed inside upstream-compiled TUs. No project source changes required |
| b9789–b9803 | CMakeLists.txt + tools/mtmd/CMakeLists.txt |
New top-level LLAMA_BUILD_MTMD option for standalone library-only mtmd builds; the mtmd CLI executables (llama-llava-cli, llama-gemma3-cli, llama-minicpmv-cli, llama-qwen2vl-cli, llama-mtmd-cli, llama-mtmd-debug) are now gated behind if (LLAMA_BUILD_TOOLS). The project adds tools/mtmd directly with LLAMA_BUILD_TOOLS=OFF, so after this bump those CLI executables are no longer built as collateral — beneficial (less build time); the mtmd library target the project links still builds via the if (TARGET mtmd) block above the gate. No project source changes required |
| b9789–b9803 | common/arg.cpp + docs/speculative.md |
New feature — EAGLE-3 speculative decoding (--spec-type draft-eagle3): a small one-layer draft transformer that reads the target model's hidden states for higher acceptance; plus a new standalone llama download / llama get subcommand (app/download.cpp, LLAMA_EXAMPLE_DOWNLOAD) and a --mtp download flag. Server-level CLI; not surfaced by ModelParameters/InferenceParameters. Could later feed an inference-parameter setter (--spec-type). No project source changes required |
| b9789–b9803 | ggml/src/ggml-cuda/{binbcast,cpy}.cu + ggml-opencl + src/llama-model.{cpp,h} + src/models/lfm2.cpp |
Backend/model-internal only: CUDA binbcast/cpy kernels reworked for >INT_MAX index safety (int→uint32/int64 widening + overflow guards); OpenCL flushes the profiling batch on context teardown; new LLM_TYPE_230M mapped for LFM2 (n_ff == 2560). No API surface visible to jllama.cpp; CUDA set only affects the cuda13-linux-x86-64 classifier, OpenCL only the opencl-android-aarch64 classifier. No project source changes required |
| b9789–b9803 | upstream verification (sandbox) | Both patches/0001-win32-arg-parse-embed-guard.patch (37 files) and patches/0002-server-preserve-caller-load-progress-callback.patch re-verified to apply cleanly against b9803 via git apply --check over the actual b9803 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for both patches). Patch 0001's common_params_parse target region is byte-identical to b9789; the b9803 arg.cpp churn is confined to the common_models_handler rewrite and set_examples tags, which don't overlap the patched hunks. OuteTTS generator anchors hold (upstream tts.cpp unchanged in this range apart from patch 0001's main()-only parse flip). Full build + ctest to be confirmed by the CI pipeline |
| b9803–b9829 | tools/server/server-stream.{cpp,h} (new) + server-context.cpp + server-http.{cpp,h} + server-models.{cpp,h} + server.cpp + CMakeLists.txt |
Build-breaking. Upstream added a resumable-streaming SSE replay buffer (PR #23226): a new TU server-stream.cpp defines g_stream_sessions (a process-wide stream_session_manager), stream_session_attach_pipe(), stream_aware_should_stop(), stream_conv_id_from_headers(), and the stream_pipe_producer/stream_pipe_consumer types. The three server TUs the project already compiles into jllama — server-context.cpp, server-http.cpp, server-models.cpp — now #include "server-stream.h" and reference those symbols (server_res_generator gained a stop() override + a ~server_res_generator that calls spipe->cleanup(); server_http_res gained a std::shared_ptr<stream_pipe_producer> spipe member + virtual stop(); server-models tracks a conv_id → model map). Required project change: add ${llama.cpp_SOURCE_DIR}/tools/server/server-stream.cpp to both the target_sources(jllama ...) block and the jllama_test add_executable(...) sources in CMakeLists.txt, or the link fails with undefined references. It is platform-neutral (threads + std mutex/condvar, no subprocess.h/posix_spawn_*), so it builds on Android too and sits outside the server-models.cpp Android guard. jllama wires its own JNI routes and never calls g_stream_sessions.start_gc() (only the excluded standalone server.cpp main() does), so the GC thread stays dormant — the resumable-stream HTTP routes are not active in the embedded library. New feature: resumable SSE streams (reattach after a dropped socket via X-Conversation-Id) could later be wired into the project's Java OpenAiCompatServer. |
| b9803–b9829 | tools/server/server.cpp + tests/export-graph-ops.cpp → tests/test-export-graph-ops.cpp (rename) (patch 0001 targets) |
Patch refresh. patches/0001-win32-arg-parse-embed-guard.patch stopped applying for two reasons: (1) upstream renamed tests/export-graph-ops.cpp → tests/test-export-graph-ops.cpp (also the llama-export-graph-ops artifact text), so the patch's call-site-flip hunk targeted a now-missing path; (2) the resumable-stream PR inserted g_stream_sessions.start_gc(); right after common_init() in server.cpp, shifting the context of the common_params_parse → common_params_parse_main flip (@@ -82 → @@ -87). Both hunks were regenerated against b9829 (path + index + @@ + leading context). Patch content is otherwise unchanged; the flips remain applied-but-not-compiled here (LLAMA_BUILD_TOOLS/TESTS OFF). Patches 0002/0003/0004 apply unchanged (their target regions — server-context.cpp load-progress guard, the get_meta/get_response_reader area for the slot-prompt-similarity getter/setter, and server-common.cpp/test-chat.cpp — were untouched in this range). |
| b9803–b9829 | src/models/mamba2.cpp + src/models/mamba-base.cpp + conversion/mamba.py |
Mamba2 generalized beyond a fixed expansion factor of 2: d_in_proj now derived from ssm_dt_rank + conv_dim (was 2*d_inner + 2*n_group*d_state + n_head), the GGML_ASSERT(2*n_embd == d_inner) / d_inner % d_state == 0 asserts removed, and ssm_dt_b/ssm_a/ssm_d tensor shapes keyed on dt_rank. Model-build internals inside upstream-compiled libllama; no symbol the project binds. No project source changes required |
| b9803–b9829 | ggml/src/ggml-opencl/ (FA q4_0/q8_0 KV, +5 new kernel files) + ggml/src/ggml-cuda/{cpy,out-prod}.cu + ggml/src/ggml-vulkan/ + ggml/src/ggml-sycl/{norm,softmax}.cpp + ggml/src/ggml-openvino/ |
Backend-internal only: OpenCL gains native flash-attention over quantized (q4_0/q8_0) KV cache + flash-decoding split kernels + Adreno X2/Xe tuning (new fa_tune.h, flash_attn_pre_f16.cl, flash_attn_f32_q{4,8}_0.cl, cvt.cl/set_rows.cl SoA quant variants); CUDA adds a cudaMemcpy2DAsync fast path for strided same-type copies, batched cublasSgemmBatched out-prod, and CPU→CUDA async copies; Vulkan/SYCL/OpenVINO kernel + op-table updates (incl. GGML_GLU_OP_SWIGLU_OAI, softmax attention-sinks). No API surface visible to jllama.cpp; the OpenCL set only affects the opencl-android-aarch64 classifier, CUDA only cuda13-linux-x86-64. No project source changes required |
| b9803–b9829 | common/common.{h,cpp} + common/speculative.cpp + common/arg.{cpp,h} + tools/mtmd/clip*.{h,cpp} |
Internal upstream churn: new COM_*/SPC_* logging macros (the LOG_* calls inside common.cpp/speculative.cpp/reasoning-budget.cpp were rewrapped, several LOG_INF→LOG_TRC quieting); common_models_handler gained plan_spec/plan_voc for --spec-draft-hf/--hf-repo-v downloads + duplicate-task dedup; clip hardened GGUF array reads (get_arr_f32, even-pinpoints / mean-std validation, n_merge defaults to 1). All consumed inside upstream-compiled common/mtmd; grep -rn "common_models_handler|COM_TRC|n_merge" src/main/cpp src/test/cpp → zero matches. No project source changes required |
| b9803–b9829 | upstream verification (sandbox) | All four patches (0001–0004) re-verified to apply + reverse-apply cleanly against b9829 via git apply --check / git apply --reverse --check over the actual b9829 sources fetched from api.github.com (github.com git-clone — incl. FetchContent of nlohmann/json and llama.cpp — is blocked in this sandbox, so a full build could not run). Patch 0001 was refreshed for the test-export-graph-ops rename and the server.cpp GC-insertion context shift (see the row above); 0002/0003/0004 unchanged. The server-stream.cpp link fix in CMakeLists.txt is required by the b9829 server-TU #includes (verified against the upstream diff: server-context/server-http/server-models reference symbols defined only in server-stream.cpp). Full build + ctest (target 454/454) to be confirmed by the CI pipeline. |
| b9829–b9839 | common/regex-partial.{cpp,h} (removed) + common/CMakeLists.txt + tests/test-regex-partial.cpp (removed) + tests/CMakeLists.txt |
The standalone reversed-partial-regex matcher (common_regex, regex_to_reversed_partial_regex, common_regex_match/common_string_range) was deleted — partial-match handling during streaming tool-call parsing is now fully inside the PEG parser (same consolidation pattern as the b9739–b9789 json-partial removal). Project references none of these symbols — verified grep -rn "regex-partial|common_regex|regex_to_reversed|COMMON_REGEX" src/main/cpp src/test/cpp → zero matches; the deleted upstream test isn't built here (LLAMA_BUILD_TESTS OFF). No project source changes required |
| b9829–b9839 | common/common.h + common/speculative.cpp + conversion/*.py + gguf-py/ + src/llama-arch.{cpp,h} + src/llama-{context,graph,model}.cpp + src/models/dflash.cpp (new) + docs/speculative.md |
New feature — DFlash block-diffusion speculative decoding (--spec-type draft-dflash, PR #22105): a new LLM_ARCH_DFLASH arch + common_speculative_impl_draft_dflash that drafts a whole block per step and injects the target model's hidden states into the draft KV cache. Adds COMMON_SPECULATIVE_TYPE_DRAFT_DFLASH (so COMMON_SPECULATIVE_TYPE_COUNT 9→10, static_assert bumped) and a self_kq_mask && self_kq_mask->buffer guard in llm_graph_input_attn_kv::set_input for the KV-injection pass. Conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds no common_speculative_*/arch symbol — all consumed inside upstream-compiled common/libllama. No project source changes required. Could later surface as a --spec-type inference parameter |
| b9829–b9839 | common/chat.cpp + models/templates/openbmb-MiniCPM5-1B.jinja (new) + tests/test-chat*.cpp |
New model support — MiniCPM5 chat template (common_chat_params_init_minicpm5): XML tool calls <function name="…"><param name="…">…</param></function> with CDATA-escaped string values + <think> reasoning. Detected by common_chat_try_specialized_template and handled inside the compiled-in chat.cpp, so it flows through the embedded server / LlamaModel chat path automatically. Upstream test additions aren't built here (LLAMA_BUILD_TESTS OFF). No project source changes required |
| b9829–b9839 | common/arg.cpp + common/chat.cpp + common/jinja/caps.{cpp,h} + tools/server/server-context.cpp |
New feature — --reasoning-preserve / --no-reasoning-preserve (LLAMA_ARG_REASONING_PRESERVE): preserve the reasoning trace across the full chat history (not just the last assistant message) when the template advertises the supports_preserve_reasoning capability; server-context.cpp adds an informational/warning log reconciling the flag with the loaded template's caps. Server-level CLI + capability detection, all inside upstream-compiled TUs; not surfaced by ModelParameters. Note: the b9839 server-context.cpp additions sit in load_model after the chat_params block — disjoint from the load-progress-callback guard patches/0002 targets, which still applies cleanly. No project source changes required; could later expose as a model/inference setter |
| b9829–b9839 | common/jinja/runtime.{cpp,h} + common/jinja/value.cpp + tools/ui/** + tests/test-jinja.cpp + tools/server/server-{models,stream}.cpp |
Internal/cosmetic only: Jinja gains an AST visitor + runtime::debug_dump_program (template debugging) and min/max array filters; server-models.cpp/server-stream.cpp add diagnostic warning logs on unknown-conversation stop paths (additive, compiled into jllama); the Svelte WebUI got conversation-sidebar/streaming-identity refactors. The WebUI auto-follows the pinned GIT_TAG (the build-webui CI job re-reads it and rebuilds the matching UI), so no manual step here. Project references none of the touched symbols. No project source changes required |
| b9829–b9839 | common/arg.cpp (lambda capture + --offline examples) |
Behaviour-neutral upstream churn: the common_models_handler_apply on_done lambda now captures first_path by value (dangling-reference fix) and --offline gained LLAMA_EXAMPLE_COMMON/LLAMA_EXAMPLE_DOWNLOAD set_examples tags. The project's ModelParameters.setOffline(boolean) (--offline) already exists; both changes are inside upstream-compiled arg.cpp and don't touch the patches/0001 hunks. No project source changes required |
| b9829–b9839 | upstream verification (sandbox) | All four patches (0001–0004) re-verified to apply cleanly against b9839 via git apply --check over the actual b9839 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for all four). Patch 0001's common/arg.{cpp,h} target regions and the ~34 standalone-main call sites are unchanged in this range (the b9839 arg.cpp edits are the new --reasoning-preserve opt, the --offline set_examples, and the on_done capture fix — none overlap the patched hunks); 0002's server-context.cpp load-progress guard region is untouched; 0003/0004 unchanged. OuteTTS generator anchors hold (upstream tools/tts/tts.cpp is unchanged in this range apart from patch 0001's existing main()-only parse flip). Full build + ctest (target 459/459) to be confirmed by the CI pipeline |
| b9839–b9840 | src/llama-arch.{cpp,h} + src/llama-model.{cpp,h} + src/llama-hparams.h + src/llama-graph.{cpp,h} + src/llama-kv-cache-dsv4.{cpp,h} (new) + src/models/deepseek4.cpp (new) + src/llama-kv-cache{,-iswa}.{cpp,h} + src/llama-model-loader.cpp + src/CMakeLists.txt + conversion/*.py + gguf-py/ + models/templates/deepseek-ai-DeepSeek-V4.jinja (new) |
New model support — DeepSeek-V4 (LLM_ARCH_DEEPSEEK4 / deepseek4): a brand-new arch with its own compressed KV cache (llama_kv_cache_dsv4: raw SWA + CSA/HCA/lightning-indexer compressor states), sqrtsoftplus MoE gating (LLAMA_EXPERT_GATING_FUNC_TYPE_SQRT_SOFTPLUS = 4), hyper-connection + compressor hparams/tensors, hash-routing experts, and an embedded chat template. build_moe_ffn gained an optional trailing selected_experts_in param (defaults nullptr); llama_kv_cache_iswa gained an hparams-taking ctor overload; llama_kv_cache exposes get_layer_ids()/get_k_storage(). All internal to upstream-compiled libllama — upstream's own src/CMakeLists.txt adds the new llama-kv-cache-dsv4.cpp (built via FetchContent), and the conversion/gguf-py changes are Python-only (not built/shipped by this repo). The project binds none of the new symbols — verified grep -rn "DEEPSEEK4|dsv4|DSV4|SQRT_SOFTPLUS|sqrtsoftplus|selected_experts_in|HYPER_CONNECTION|hash_layer" src/main/cpp src/test/cpp → zero matches. No project source changes required; a DeepSeek-V4 GGUF would just work through the embedded server / LlamaModel path. |
| b9839–b9840 | upstream verification (sandbox) | All four patches (0001–0004) re-verified to apply cleanly against b9840 via git apply --check over the actual b9840 sources fetched from raw.githubusercontent.com (github.com git-clone is blocked in this sandbox, so a full FetchContent build could not run — exit 0 for all four). The b9839→b9840 range touches no patch-target file (common/arg.{cpp,h}, tools/server/server-context.{cpp,h}, server-common.cpp, test-chat.cpp, the ~34 standalone mains) — it is entirely additive DeepSeek-V4 code — so the patch hunks and offsets are byte-identical to b9839. OuteTTS generator anchors hold (upstream tools/tts/tts.cpp unchanged in this range). Full build + ctest (target 459/459) to be confirmed by the CI pipeline |