Skip to content

Commit dca1b20

Browse files
committed
Expose sse_ping_interval + audited completion params on InferenceParameters
Wires the b9864 per-request sse_ping_interval into the Java API and, from a completion-schema audit, adds the other already-parseable-but-unexposed plain scalars so callers of the OAI-compat completion path can set them. New withers (each emits a JSON key honored by eval_llama_cmpl_schema): - withSsePingInterval(int) -> sse_ping_interval (b9864; -1 disables pings) - withXtcProbability(float) / withXtcThreshold(float) -> XTC sampler - withNDiscard(int) -> n_discard (context-shift discard) - withNIndent(int) -> n_indent (infill indentation) - withTMaxPredictMs(int) -> t_max_predict_ms (generation time budget) - withPostSamplingProbs(boolean) -> post_sampling_probs - withTimingsPerToken(boolean) -> timings_per_token - withReturnTokens(boolean) -> return_tokens Audit method: extracted every field name from b9864's make_llama_cmpl_schema and diffed against the InferenceParameters keys. t_max_prompt_ms was deliberately skipped (commented out upstream, so not parseable). The remaining unexposed fields are OAI aliases already covered (max_tokens/ max_completion_tokens -> n_predict) or OAI/server-internal / array-shaped / advanced knobs (n, logprobs, echo, verbose, include_usage, return_progress, response_fields, lora, grammar_lazy/grammar_triggers/preserved_tokens, chat_format, parse_tool_calls, reasoning_control, backend_sampling, adaptive_*), left out on purpose and documented in the breaking-changes history. Tests: +2 Java withers tests (InferenceParametersTest -> 104 pass) and +3 C++ schema round-trip guards in test_server.cpp pinning that the native parser honors sse_ping_interval (round-trip, -1 disables, below-hard-limit throws, absent inherits the server default) -> full C++ suite 462 pass (was 459). javadoc (llama module) + spotless + clang-format all clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01HL7d4uQ3cKR5HwYFPvZvv7
1 parent 8eb55ff commit dca1b20

5 files changed

Lines changed: 172 additions & 3 deletions

File tree

CLAUDE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1076,13 +1076,13 @@ ctest --test-dir build --output-on-failure -R "ResultsToJson"
10761076
| File | Tests | Scope |
10771077
|------|-------|-------|
10781078
| `src/test/cpp/test_utils.cpp` | 156 | Upstream helpers: `server_tokens`, `server_grammar_trigger`, `gen_tool_call_id`, `json_value`, `json_get_nested_values`, UTF-8 helpers, `format_response_rerank`, `format_embeddings_response_oaicompat`, `oaicompat_completion_params_parse`, `oaicompat_chat_params_parse`, `are_lora_equal`, `strip_flag_from_argv`, `token_piece_value`, `json_is_array_and_contains_numbers`, `format_oai_sse`, `format_oai_resp_sse`, `format_anthropic_sse` |
1079-
| `src/test/cpp/test_server.cpp` | 194 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths + per-request `dry_*` field round-trips), `response_fields` projection |
1079+
| `src/test/cpp/test_server.cpp` | 197 | Upstream result types: `result_timings`, `task_params::to_json()` (incl. `dry_sequence_breakers`, `preserved_tokens`, `timings_per_token`), `completion_token_output`, `server_task_result_cmpl_partial` (non-oaicompat + `to_json_oaicompat` + logprobs + `to_json_oaicompat_chat` + `to_json_anthropic` + dispatcher), `server_task_result_cmpl_final` (non-oaicompat + `to_json_oaicompat` + `to_json_oaicompat_chat` + `to_json_oaicompat_chat_stream` + `to_json_anthropic` + `to_json_anthropic_stream` + tool_calls + dispatcher), `server_task_result_embd`, `server_task_result_rerank`, `server_task_result_metrics`, `server_task_result_slot_save_load`, `server_task_result_slot_erase`, `server_task_result_apply_lora`, `server_task_result_error`, `format_error_response`, `server_task::need_sampling()`, `server_task::n_tokens()`, `server_schema::eval_llama_cmpl_schema()` (parsing pipeline + grammar routing + error paths + per-request `dry_*` and `sse_ping_interval` field round-trips incl. hard-limit + server-default inheritance), `response_fields` projection |
10801080
| `src/test/cpp/test_json_helpers.cpp` | 47 | All functions in `json_helpers.hpp`: `get_result_error_message`, `results_to_json`, `rerank_results_to_json`, `parse_encoding_format`, `extract_embedding_prompt`, `is_infill_request`, `parse_slot_prompt_similarity`, `parse_positive_int_config`, `wrap_stream_chunk` |
10811081
| `src/test/cpp/test_log_helpers.cpp` | 13 | All functions in `log_helpers.hpp`: `log_level_name`, `format_log_as_json` |
10821082
| `src/test/cpp/test_jni_helpers.cpp` | 47 | All functions in `jni_helpers.hpp` using a zero-filled `JNINativeInterface_` mock |
10831083
| `src/test/cpp/test_tts_wav.cpp` | 2 | The in-memory WAV writer `pcm_to_wav16_bytes` in `tts_wav.hpp` (WAV header/payload + little-endian clamping). The OuteTTS DSP it pairs with is derived from upstream `tts.cpp` and covered end-to-end by the Java `TtsIntegrationTest`, not unit-tested here. |
10841084

1085-
**Current total: 459 tests (all passing).**
1085+
**Current total: 462 tests (all passing).**
10861086

10871087
#### Upstream source location (in CMake build tree)
10881088

docs/history/llama-cpp-breaking-changes.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -415,5 +415,5 @@ Used during `llama.cpp` version bumps: when upgrading, scan this file from the r
415415
| b9859–b9862 | `include/llama.h` + `src/llama-model-loader.cpp` + `src/llama-model.{cpp,h}` + `tools/server/server-context.{cpp,h}` + `tools/cli/cli.cpp` | **New feature (additive C API), no break.** Upstream promoted the previously-`static` `llama_model_ftype_name(llama_ftype)` (in `llama-model-loader.cpp`) to a **public** `LLAMA_API const char * llama_ftype_name(enum llama_ftype)` and added `LLAMA_API enum llama_ftype llama_model_ftype(const llama_model *)` (backed by a new `llama_model::ftype()` / `impl::ftype` cached from `ml.ftype` at `load_hparams`). `server_context::get_meta()` now fills a **new `std::string model_ftype`** field on `server_context_meta` (`server-context.h`) and `server_routes::get_model_info()` emits a `"ftype"` key — so the **NativeServer** mode's model-info/`/props` surface gains the quant type automatically (WebUI + `llama-server` clients). `cli.cpp` prints an `ftype :` line. **All inside upstream-compiled `libllama`/server TUs the project already links** — the project binds none of the new symbols (`grep` → only a *comment* mentions `server_context_meta` in `jllama.cpp`; nothing constructs it, and adding a trailing field is source-additive). No project source changes required for the bump itself. **Follow-up (done):** the quant type is now also surfaced through the Java layer — `getModelMetaJson` emits `"ftype"` (from `server_context_meta::model_ftype`), `ModelMeta.getFtype()` / `LlamaModel.getModelFtype()` expose it, and the Java `OpenAiCompatServer` advertises it as `data[].ftype` in `GET /v1/models` (threaded through `OpenAiServerConfig.modelFtype`, mirroring how `supportsVision` is threaded), matching the upstream `get_model_info()` key. |
416416
| b9859–b9862 | `ggml/src/ggml-cuda/gated_delta_net.{cu,cuh}` + `ggml/src/ggml-cuda/ggml-cuda.cu` + `vendor/cpp-httplib/httplib.{cpp,h}` (v0.48.0→v0.49.0) | Backend/vendor-internal only, no API surface visible to `jllama.cpp`. (1) **CUDA gated-delta-net perf**: a fused `gated_delta_net → cpy` path (`ggml_cuda_op_gated_delta_net_fused_cache` + `ggml_cuda_try_gdn_cache_fusion`) lets the kernel scatter recurrent-state snapshots straight into the rollback cache and skip the follow-up strided copy (a decode win for gated-delta / hybrid-recurrent models, e.g. Qwen3-Next); plus a `ggml_cuda_is_view_or_noop` refactor. Affects only the `cuda13-*` classifiers. (2) **cpp-httplib bumped to v0.49.0** (the vendored copy inside llama.cpp, compiled into `jllama` via `server-http.cpp`): locale-independent ASCII classifiers (`is_ascii_digit/alpha/alnum` replacing `std::isdigit`/`isalnum`), a new additive `MultipartFormDataWriter` + `is_valid_multipart_boundary`, multipart field-name/filename escaping (WHATWG), an unsigned base64 accumulator (UB fix), a `ThreadPool` `idle_timeout_sec` ctor param (defaulted — backward-compatible), a `perform_websocket_handshake` `is_ssl` arg (internal), and a `path_encode_`-gated query-normalization skip. All internal to the compiled TU; the project binds no httplib symbol directly (it uses the upstream `server-http.cpp` transport). No project source changes required. |
417417
| b9859–b9862 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9862. The b9859→b9862 diff touches only two patch-target files — `tools/server/server-context.cpp` and `server-context.h` (the `model_ftype`/`get_meta`/`get_model_info` additions at ~L3989/~L5121 and the new struct field at ~L50). Patches **0002** (load-progress guard, ~L1152), **0003** (slot-prompt-similarity getter/setter, ~L3965 + `server_context` struct ~L106) and **0005** (near-prompt-end checkpoints, `update_slots` ~L3560) were **applied in sequence** against the actual b9862 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all three applied cleanly (their regions are disjoint from and far from the b9862 additions). Patches **0001** (`common/arg.{cpp,h}`, `test-arg-parser.cpp`, ~34 standalone mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not present** in the b9859→b9862 changed-file list, so their hunks are byte-identical to b9859 and apply unchanged. OuteTTS generator anchors hold (`tools/tts/tts.cpp` unchanged in this range). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |
418-
| b9862–b9864 | `tools/server/server-context.cpp` + `server-schema.cpp` + `server-task.h` + `tools/server/README.md` + `tools/ui/**` | **New feature (additive), no break.** Adds a **per-request `sse_ping_interval`** to the completion API: `task_params` gains `int32_t sse_ping_interval = 30` (`server-task.h`), `make_llama_cmpl_schema` exposes it as a `field_num` with hard limits `[-1, INT32_MAX]` and `eval_llama_cmpl_schema` seeds it from `params_base.sse_ping_interval` (`server-schema.cpp`), and `handle_completions_impl` (`server-context.cpp`, ~L4089) captures the per-task value (instead of the server-level `params.sse_ping_interval`) into the SSE `next` lambda so a request can override the server `--sse-ping-interval` (`-1` disables pings). All inside upstream-compiled server TUs the project already links; the project binds no new symbol. **NativeServer** mode gets it for free (full `llama_server`). The rest of the diff is the **Svelte WebUI** (`tools/ui/**`: MCP server recommendations dialog, a bearer-token Authorization field, migration of the MCP default-enabled key into settings config, `STREAM_VISIBILITY_KICK_MS` 1000→3000, + Vitest units) — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required. Optional future work: expose `sse_ping_interval` on the Java `InferenceParameters` (it would flow through the OAI-compat completion path via `eval_llama_cmpl_schema`, like any other sampling field). |
418+
| b9862–b9864 | `tools/server/server-context.cpp` + `server-schema.cpp` + `server-task.h` + `tools/server/README.md` + `tools/ui/**` | **New feature (additive), no break.** Adds a **per-request `sse_ping_interval`** to the completion API: `task_params` gains `int32_t sse_ping_interval = 30` (`server-task.h`), `make_llama_cmpl_schema` exposes it as a `field_num` with hard limits `[-1, INT32_MAX]` and `eval_llama_cmpl_schema` seeds it from `params_base.sse_ping_interval` (`server-schema.cpp`), and `handle_completions_impl` (`server-context.cpp`, ~L4089) captures the per-task value (instead of the server-level `params.sse_ping_interval`) into the SSE `next` lambda so a request can override the server `--sse-ping-interval` (`-1` disables pings). All inside upstream-compiled server TUs the project already links; the project binds no new symbol. **NativeServer** mode gets it for free (full `llama_server`). The rest of the diff is the **Svelte WebUI** (`tools/ui/**`: MCP server recommendations dialog, a bearer-token Authorization field, migration of the MCP default-enabled key into settings config, `STREAM_VISIBILITY_KICK_MS` 1000→3000, + Vitest units) — the WebUI **auto-follows** the pinned `GIT_TAG` (the `build-webui` CI job rebuilds it), so no manual step. No project source changes required for the bump itself. **Follow-up (done):** `InferenceParameters.withSsePingInterval(int)` now emits the `sse_ping_interval` key (it flows through the OAI-compat completion path via `eval_llama_cmpl_schema`), covered by a Java wither test + three C++ schema round-trip guards (round-trip, `-1` disables, below-hard-limit throws, absent inherits the server default). The same follow-up **audited the completion schema for other already-parseable-but-unexposed fields** and added the plain-scalar wins as withers: `withXtcProbability`/`withXtcThreshold` (XTC sampler), `withNDiscard`, `withNIndent`, `withTMaxPredictMs`, `withPostSamplingProbs`, `withTimingsPerToken`, `withReturnTokens`. (`t_max_prompt_ms` was deliberately skipped — it is commented out `// TODO: implement` in b9864's `make_llama_cmpl_schema`, so it is not parseable.) Remaining schema fields left unexposed on purpose: OAI aliases already covered (`max_tokens`/`max_completion_tokens` → `n_predict`), OAI/server-internal or array-shaped/advanced knobs (`n`/`n_cmpl`, `logprobs`, `echo`, `verbose`, `include_usage`, `return_progress`, `response_fields`, `lora`, `grammar_lazy`/`grammar_triggers`/`preserved_tokens`, `chat_format`, `parse_tool_calls`, `reasoning_control`, `backend_sampling`, `adaptive_*`). |
419419
| b9862–b9864 | upstream verification (sandbox) | All **six** patches (`0001`–`0006`) re-verified against b9864. The b9862→b9864 diff touches exactly one patch-target file — `tools/server/server-context.cpp` — and only in `handle_completions_impl` (~L4089), far below every patched region (0002 load-progress guard ~L1152, 0005 near-prompt-end checkpoints ~L3560, 0003 slot-prompt-similarity getter/setter ~L3965). Patches **0002/0003/0005** were **applied in sequence** against the actual b9864 `server-context.{cpp,h}` fetched from `raw.githubusercontent.com` — all clean. `server-context.h` is unchanged in this range (so 0003's `.h` hunk is byte-identical); `server-schema.cpp`/`server-task.h` are **not** patch targets. Patches **0001** (`common/arg.*`, `test-arg-parser.cpp`, ~34 mains), **0004** (`server-common.cpp`, `test-chat.cpp`) and **0006** (`server.cpp`) target files **not** in the changed-file list, so they apply unchanged. Confirmed end-to-end by a clean `cmake` configure: b9864 fetched and **all six patches applied via the fail-loud `PATCH_COMMAND`** (exit 0; 0005's `is_ckpt_only_rollback` marker present), OuteTTS generator anchors held (`tools/tts/tts.cpp` unchanged). Full build + `ctest` (target 459/459) to be confirmed by the CI pipeline. |

0 commit comments

Comments
 (0)