bernardladenthin
diff --git a/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions b/‎CLAUDE.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/history/llama-cpp-breaking-changes.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/history/llama-cpp-breaking-changes.md‎
Lines changed: 1 addition & 1 deletion
@@ -1076,13 +1076,13 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
 | File | Tests | Scope |
 |------|-------|-------|
 | `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
-| `src/test/cpp/test_server.cpp` | 194 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths + per-request `dry_*` field round-trips), `response_fields` projection |
+| `src/test/cpp/test_server.cpp` | 197 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths + per-request `dry_*` and `sse_ping_interval` field round-trips incl. hard-limit + server-default inheritance), `response_fields` projection |
 | `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
 | `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
 | `src/test/cpp/test_jni_helpers.cpp` | 47 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
 | `src/test/cpp/test_tts_wav.cpp` | 2 | The in-memory WAV writer `pcm_to_wav16_bytes` in `tts_wav.hpp` (WAV header/payload + little-endian clamping). The OuteTTS DSP it pairs with is derived from upstream `tts.cpp` and covered end-to-end by the Java `TtsIntegrationTest`, not unit-tested here. |
 
-**Current total: 459 tests (all passing).**
+**Current total: 462 tests (all passing).**
 
 #### Upstream source location (in CMake build tree)
 
 
@@ -415,5 +415,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
 | b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required for the bump itself. **Follow-up (done):** the quant type is now also surfaced through the Java layer — `getModelMetaJson` emits `"ftype"` (from `server_context_meta::model_ftype`), `ModelMeta.getFtype()` / `LlamaModel.getModelFtype()` expose it, and the Java `OpenAiCompatServer` advertises it as `data[].ftype` in `GET /v1/models` (threaded through `OpenAiServerConfig.modelFtype`, mirroring how `supportsVision` is threaded), matching the upstream `get_model_info()` key. |
 | b9859–b9862 | `ggml/src/ggml-cuda/gated_delta_net.{cu,cuh}` + `ggml/src/ggml-cuda/ggml-cuda.cu` + `vendor/cpp-httplib/httplib.{cpp,h}` (v0.48.0→v0.49.0) | Backend/vendor-internal only, no API surface visible to `jllama.cpp`. (1) **CUDA gated-delta-net perf**: a fused `gated_delta_net → cpy` path (`ggml_cuda_op_gated_delta_net_fused_cache` + `ggml_cuda_try_gdn_cache_fusion`) lets the kernel scatter recurrent-state snapshots straight into the rollback cache and skip the follow-up strided copy (a decode win for gated-delta / hybrid-recurrent models, e.g. Qwen3-Next); plus a `ggml_cuda_is_view_or_noop` refactor. Affects only the `cuda13-*` classifiers. (2) **cpp-httplib bumped to v0.49.0** (the vendored copy inside llama.cpp, compiled into `jllama` via `server-http.cpp`): locale-independent ASCII classifiers (`is_ascii_digit/alpha/alnum` replacing `std::isdigit`/`isalnum`), a new additive `MultipartFormDataWriter` + `is_valid_multipart_boundary`, multipart field-name/filename escaping (WHATWG), an unsigned base64 accumulator (UB fix), a `ThreadPool` `idle_timeout_sec` ctor param (defaulted — backward-compatible), a `perform_websocket_handshake` `is_ssl` arg (internal), and a `path_encode_`-gated query-normalization skip. All internal to the compiled TU; the project binds no httplib symbol directly (it uses the upstream `server-http.cpp` transport). No project source changes required. |
 | b9859–b9862 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9862. The b9859→b9862 diff touches only two patch-target files — `tools/server/server-context.cpp` and `server-context.h` (the `model_ftype`/`get_meta`/`get_model_info` additions at ~L3989/~L5121 and the new struct field at ~L50). Patches **0002** (load-progress guard, ~L1152), **0003** (slot-prompt-similarity getter/setter, ~L3965 + `server_context` struct ~L106) and **0005** (near-prompt-end checkpoints, `update_slots` ~L3560) were **applied in sequence** against the actual b9862 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all three applied cleanly (their regions are disjoint from and far from the b9862 additions). Patches **0001** (`common/arg.{cpp,h}`, `test-arg-parser.cpp`, ~34 standalone mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not present** in the b9859→b9862 changed-file list, so their hunks are byte-identical to b9859 and apply unchanged. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
-| b9862–b9864 | `tools/server/server-context.cpp` + `server-schema.cpp` + `server-task.h` + `tools/server/README.md` + `tools/ui/**` | **New feature (additive), no break.** Adds a **per-request `sse_ping_interval`** to the completion API: `task_params` gains `int32_t sse_ping_interval = 30` (`server-task.h`), `make_llama_cmpl_schema` exposes it as a `field_num` with hard limits `[-1, INT32_MAX]` and `eval_llama_cmpl_schema` seeds it from `params_base.sse_ping_interval` (`server-schema.cpp`), and `handle_completions_impl` (`server-context.cpp`, ~L4089) captures the per-task value (instead of the server-level `params.sse_ping_interval`) into the SSE `next` lambda so a request can override the server `--sse-ping-interval` (`-1` disables pings). All inside upstream-compiled server TUs the project already links; the project binds no new symbol. **NativeServer** mode gets it for free (full `llama_server`). The rest of the diff is the **Svelte WebUI** (`tools/ui/**`: MCP server recommendations dialog, a bearer-token Authorization field, migration of the MCP default-enabled key into settings config, `STREAM_VISIBILITY_KICK_MS` 1000→3000, + Vitest units) — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required. Optional future work: expose `sse_ping_interval` on the Java `InferenceParameters` (it would flow through the OAI-compat completion path via `eval_llama_cmpl_schema`, like any other sampling field). |
+| b9862–b9864 | `tools/server/server-context.cpp` + `server-schema.cpp` + `server-task.h` + `tools/server/README.md` + `tools/ui/**` | **New feature (additive), no break.** Adds a **per-request `sse_ping_interval`** to the completion API: `task_params` gains `int32_t sse_ping_interval = 30` (`server-task.h`), `make_llama_cmpl_schema` exposes it as a `field_num` with hard limits `[-1, INT32_MAX]` and `eval_llama_cmpl_schema` seeds it from `params_base.sse_ping_interval` (`server-schema.cpp`), and `handle_completions_impl` (`server-context.cpp`, ~L4089) captures the per-task value (instead of the server-level `params.sse_ping_interval`) into the SSE `next` lambda so a request can override the server `--sse-ping-interval` (`-1` disables pings). All inside upstream-compiled server TUs the project already links; the project binds no new symbol. **NativeServer** mode gets it for free (full `llama_server`). The rest of the diff is the **Svelte WebUI** (`tools/ui/**`: MCP server recommendations dialog, a bearer-token Authorization field, migration of the MCP default-enabled key into settings config, `STREAM_VISIBILITY_KICK_MS` 1000→3000, + Vitest units) — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required for the bump itself. **Follow-up (done):** `InferenceParameters.withSsePingInterval(int)` now emits the `sse_ping_interval` key (it flows through the OAI-compat completion path via `eval_llama_cmpl_schema`), covered by a Java wither test + three C++ schema round-trip guards (round-trip, `-1` disables, below-hard-limit throws, absent inherits the server default). The same follow-up **audited the completion schema for other already-parseable-but-unexposed fields** and added the plain-scalar wins as withers: `withXtcProbability`/`withXtcThreshold` (XTC sampler), `withNDiscard`, `withNIndent`, `withTMaxPredictMs`, `withPostSamplingProbs`, `withTimingsPerToken`, `withReturnTokens`. (`t_max_prompt_ms` was deliberately skipped — it is commented out `// TODO: implement` in b9864's `make_llama_cmpl_schema`, so it is not parseable.) Remaining schema fields left unexposed on purpose: OAI aliases already covered (`max_tokens`/`max_completion_tokens` → `n_predict`), OAI/server-internal or array-shaped/advanced knobs (`n`/`n_cmpl`, `logprobs`, `echo`, `verbose`, `include_usage`, `return_progress`, `response_fields`, `lora`, `grammar_lazy`/`grammar_triggers`/`preserved_tokens`, `chat_format`, `parse_tool_calls`, `reasoning_control`, `backend_sampling`, `adaptive_*`). |
 | b9862–b9864 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9864. The b9862→b9864 diff touches exactly one patch-target file — `tools/server/server-context.cpp` — and only in `handle_completions_impl` (~L4089), far below every patched region (0002 load-progress guard ~L1152, 0005 near-prompt-end checkpoints ~L3560, 0003 slot-prompt-similarity getter/setter ~L3965). Patches **0002/0003/0005** were **applied in sequence** against the actual b9864 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all clean. `server-context.h` is unchanged in this range (so 0003's `.h` hunk is byte-identical); `server-schema.cpp`/`server-task.h` are **not** patch targets. Patches **0001** (`common/arg.*`, `test-arg-parser.cpp`, ~34 mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not** in the changed-file list, so they apply unchanged. Confirmed end-to-end by a clean `cmake` configure: b9864 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |