Add per-request DRY sampling to InferenceParameters; carry upstream PRs #22393 and #23116 as patches

claude · claude · commit 381a3fc84e3f · 2026-06-28T08:53:11.000Z
InferenceParameters gains five immutable withers mirroring the existing withMinP/withTopNSigma (scalars) and withStopStrings (string array) style: withDryMultiplier(float) -> "dry_multiplier" withDryBase(float) -> "dry_base" withDryAllowedLength(int) -> "dry_allowed_length" withDryPenaltyLastN(int) -> "dry_penalty_last_n" (rejects < -1) withDrySequenceBreakers(String...) -> "dry_sequence_breakers" (omitted when unset) This exposes DRY per request, uniformly with the other samplers, instead of only at model/launch level (ModelParameters --dry-*). Defaults are unchanged: no wither call emits nothing and DRY stays disabled. Adds 12 unit tests covering field/value serialization, the JSON string array, the no-op-when-empty contract, penalty-last-n validation, and immutable-instance semantics (InferenceParametersTest: 90 -> 102 tests). Also carries two still-open upstream llama.cpp PRs as local patches (named after the PR number), refreshed against the pinned b9803 source and verified to apply cleanly + reverse-check idempotently: patches/0003-pr22393-... server_context slot_prompt_similarity get/set patches/0004-pr23116-... per-request reasoning_budget_tokens override (incl. upstream test-chat.cpp additions, verbatim) Updates CLAUDE.md patches table and CHANGELOG. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -18,6 +18,7 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
 - End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
 - Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
 - Per-request KV controls: `InferenceParameters.withSlotId(int)` and `withCacheReuse(int)`.
+- Per-request DRY sampling to `InferenceParameters` (`dry_multiplier`/`dry_base`/`dry_allowed_length`/`dry_penalty_last_n`/`dry_sequence_breakers`).
 - Typed cache observability through `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, and `ServerMetrics.getSlotMetrics()`.
 - Authenticated JSON `GET /metrics` and `GET /slots` endpoints on the embedded server.
 
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -384,6 +384,8 @@ Current patches:
 |-------|-------|
 | `0001-win32-arg-parse-embed-guard.patch` | Windows JNI regression from llama.cpp **#24779** (introduced b9739): on Windows `common_params_parse` re-derived argv from the **process** command line (`GetCommandLineW`) and adopted it, so an embedded/JNI caller (`java.exe`) lost its `--model …` args → "Failed to parse model parameters". b9789 narrowed the unconditional override to a **count-guard** (`if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); }`), but that is exactly the variant the project already found breaks its Windows server-integration tests (when the embedded argv length coincides with `java.exe`'s). The patch carries the **complete upstream change** (so it can be submitted to llama.cpp verbatim and then dropped here): **(1)** `common_params_parse` parses **exactly the argv it is given** (no `GetCommandLineW` magic) and a new `common_params_parse_main()` wrapper holds the UTF-8 recovery for the standalone tools' `main()` (`common/arg.{cpp,h}`); **(2)** the **~34 standalone `main()` call sites** (every `common_params_parse(argc, argv, …)` across `tools/*`, `examples/*` and the `tests/*` programs) flip to `common_params_parse_main()`; **(3)** a `tests/test-arg-parser.cpp` regression case pins that `common_params_parse` honors a caller-supplied argv. The embedded caller (`jllama.cpp`) keeps calling `common_params_parse` and is never overridden. **Our subproject build compiles only the `arg.{cpp,h}` core** — `LLAMA_BUILD_TOOLS`/`LLAMA_BUILD_TESTS` are OFF for a FetchContent subproject — so the flips + test are applied-but-not-compiled here; they were validated via a one-off `-DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON` build (the new test compiles and its asserts pass; `test-arg-parser`'s only red there is the live `ggml.ai` download check, which is sandbox-network, not the patch). Because it spans **37 files** it must be refreshed on every llama.cpp bump (the applier fails loud). |
 | `0002-server-preserve-caller-load-progress-callback.patch` | Load-progress-callback regression introduced in llama.cpp **b9789**: `server_context::load_model` (`tools/server/server-context.cpp`) now **unconditionally** installs the server's own load-progress reporter on `params_base.load_progress_callback` immediately before `common_init_from_params`, clobbering any callback the embedding caller already set. libjllama's `LoadProgressCallback` feature wires `common_params.load_progress_callback` to a JNI trampoline *before* calling `load_model`, so the bump silently killed it — `LoadProgressCallbackTest` saw zero progress updates and the abort-on-`false` path never threw. The patch guards the assignment with `if (params_base.load_progress_callback == nullptr)`, so the server installs its own reporter **only when the caller hasn't** — a caller-supplied callback survives and fires during load. Standalone `llama-server` (no caller callback, so the field is null) is unaffected. Same JNI-vs-standalone divergence class as `0001`. |
+| `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b<nnnn>` includes the change. |
+| `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b<nnnn>` includes the change. |
 
 ## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`)
 
diff --git a/patches/0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch b/patches/0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch
@@ -0,0 +1,45 @@
+Upstream PR: ggml-org/llama.cpp#22393 — "server : add slot_prompt_similarity getter/setter"
+https://github.com/ggml-org/llama.cpp/pull/22393
+
+Carried locally until the PR is merged upstream. Adds public get/set accessors for the
+server_context `slot_prompt_similarity` field so an embedding/JNI caller can query and tune
+the slot-selection threshold at runtime without reloading the model. The change is purely
+additive (two new accessors + their declarations) and is a verbatim copy of the upstream PR,
+so it can be dropped from patches/ once b<nnnn> includes it. Refresh against the new source on
+every llama.cpp version bump (the applier fails loud if the context shifts).
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 39b7eb2..7c274cb 100644
+--- a/tools/server/server-context.cpp
++++ b/tools/server/server-context.cpp
+@@ -3965,6 +3965,14 @@ server_response_reader server_context::get_response_reader() {
+     return impl->get_response_reader();
+ }
+ 
++float server_context::get_slot_prompt_similarity() const {
++    return impl->slot_prompt_similarity;
++}
++
++void server_context::set_slot_prompt_similarity(float value) {
++    impl->slot_prompt_similarity = value;
++}
++
+ server_context_meta server_context::get_meta() const {
+     auto bos_id = llama_vocab_bos(impl->vocab);
+     auto eos_id = llama_vocab_eos(impl->vocab);
+diff --git a/tools/server/server-context.h b/tools/server/server-context.h
+index 952f825..938c985 100644
+--- a/tools/server/server-context.h
++++ b/tools/server/server-context.h
+@@ -106,6 +106,11 @@ struct server_context {
+     // not thread-safe, should only be used from the main thread
+     server_context_meta get_meta() const;
+ 
++    // get/set the slot-prompt-similarity threshold for slot selection
++    // not thread-safe, should only be used from the main thread
++    float get_slot_prompt_similarity() const;
++    void  set_slot_prompt_similarity(float value);
++
+     // note: must be set before load_model() is called
+     void set_state_callback(server_state_callback_t callback);
+ };
diff --git a/patches/0004-pr23116-server-per-request-reasoning-budget-tokens.patch b/patches/0004-pr23116-server-per-request-reasoning-budget-tokens.patch
@@ -0,0 +1,131 @@
+Upstream PR: ggml-org/llama.cpp#23116 — "server: honour per-request reasoning_budget_tokens in
+chat completions"
+https://github.com/ggml-org/llama.cpp/pull/23116
+
+Carried locally until the PR is merged upstream. Motivated by java-llama.cpp#140: a per-request
+`reasoning_budget_tokens` (and `reasoning_budget_message`) sent on a chat-completions request must
+override the server-launch default. Upstream `oaicompat_chat_params_parse` only read the Anthropic
+`thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so the
+canonical per-request keys were ignored. The patch reads both overrides before the generic copy loop
+(precedence: reasoning_budget_tokens > thinking_budget_tokens alias > server default) and threads the
+per-request message through. Includes the upstream test additions (tests/test-chat.cpp) verbatim so
+the patch is submittable as-is; LLAMA_BUILD_TESTS is OFF for the FetchContent subproject, so those are
+applied-but-not-compiled here. Refresh against the new source on every llama.cpp version bump (the
+applier fails loud if the context shifts).
+
+diff --git a/tests/test-chat.cpp b/tests/test-chat.cpp
+index c38aed8..dfa8006 100644
+--- a/tests/test-chat.cpp
++++ b/tests/test-chat.cpp
+@@ -5780,6 +5780,71 @@ static void test_developer_role_to_system_workaround() {
+     }
+ }
+ 
++static void test_reasoning_budget_tokens_per_request() {
++    LOG_DBG("%s\n", __func__);
++    // Use Qwen3 template which has <think>...</think> reasoning markers.
++    // The autoparser detects them and sets thinking_start/end_tag, which enables
++    // the reasoning-budget code path in oaicompat_chat_params_parse.
++    auto tmpls = read_templates("models/templates/Qwen-Qwen3-0.6B.jinja");
++
++    server_chat_params opt;
++    opt.tmpls            = std::move(tmpls);
++    opt.use_jinja        = true;
++    opt.enable_thinking  = true;
++    opt.reasoning_budget = -1;
++    opt.reasoning_format = COMMON_REASONING_FORMAT_NONE;
++
++    // Body with per-request reasoning_budget_tokens=0 (suppress thinking).
++    json body = {
++        {"messages", json::array({json{{"role", "user"}, {"content", "hello"}}})},
++        {"reasoning_budget_tokens", 0},
++    };
++    std::vector<raw_buffer> out_files;
++    auto llama_params = oaicompat_chat_params_parse(body, opt, out_files);
++
++    // The per-request value must win over the server default (-1).
++    if (!llama_params.contains("reasoning_budget_tokens")) {
++        throw std::runtime_error("reasoning_budget_tokens missing from llama_params (thinking_end_tag may be empty for this template)");
++    }
++    int got = llama_params["reasoning_budget_tokens"].get<int>();
++    if (got != 0) {
++        throw std::runtime_error(std::string("Expected reasoning_budget_tokens=0, got ") + std::to_string(got));
++    }
++}
++
++static void test_reasoning_budget_message_per_request() {
++    LOG_DBG("%s\n", __func__);
++    // Same code path as test_reasoning_budget_tokens_per_request: the Qwen3 template's
++    // <think>...</think> markers enable the reasoning-budget block in oaicompat_chat_params_parse.
++    auto tmpls = read_templates("models/templates/Qwen-Qwen3-0.6B.jinja");
++
++    server_chat_params opt;
++    opt.tmpls                   = std::move(tmpls);
++    opt.use_jinja               = true;
++    opt.enable_thinking         = true;
++    opt.reasoning_budget        = -1;
++    opt.reasoning_format        = COMMON_REASONING_FORMAT_NONE;
++    opt.reasoning_budget_message = "server default";
++
++    // Body with a per-request reasoning_budget_message override.
++    const std::string per_request_message = "per-request message";
++    json body = {
++        {"messages", json::array({json{{"role", "user"}, {"content", "hello"}}})},
++        {"reasoning_budget_message", per_request_message},
++    };
++    std::vector<raw_buffer> out_files;
++    auto llama_params = oaicompat_chat_params_parse(body, opt, out_files);
++
++    // The per-request value must win over the server default.
++    if (!llama_params.contains("reasoning_budget_message")) {
++        throw std::runtime_error("reasoning_budget_message missing from llama_params (thinking_end_tag may be empty for this template)");
++    }
++    std::string got = llama_params["reasoning_budget_message"].get<std::string>();
++    if (got != per_request_message) {
++        throw std::runtime_error("Expected reasoning_budget_message='" + per_request_message + "', got '" + got + "'");
++    }
++}
++
+ static void test_msg_diffs_compute() {
+     LOG_DBG("%s\n", __func__);
+     {
+@@ -5937,6 +6002,8 @@ int main(int argc, char ** argv) {
+         test_convert_responses_to_chatcmpl();
+         test_developer_role_to_system_workaround();
+         test_template_generation_prompt();
++        test_reasoning_budget_tokens_per_request();
++        test_reasoning_budget_message_per_request();
+         test_template_output_peg_parsers(detailed_debug);
+         std::cout << "\n[chat] All tests passed!" << '\n';
+     }
+diff --git a/tools/server/server-common.cpp b/tools/server/server-common.cpp
+index ac291d3..26cdfd2 100644
+--- a/tools/server/server-common.cpp
++++ b/tools/server/server-common.cpp
+@@ -1116,16 +1116,24 @@ json oaicompat_chat_params_parse(
+ 
+     // Reasoning budget: pass parameters through to sampling layer
+     {
+-        int reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
++        // Per-request overrides, read before writing to llama_params so the generic copy
++        // loop (which skips keys already present) won't clobber the caller-supplied values.
++        // Precedence: canonical reasoning_budget_tokens > Anthropic thinking_budget_tokens
++        // alias > server-level default.
++        int reasoning_budget = json_value(body, "reasoning_budget_tokens", -1);
++        if (reasoning_budget == -1) {
++            reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
++        }
+         if (reasoning_budget == -1) {
+             reasoning_budget = opt.reasoning_budget;
+         }
++        std::string reasoning_budget_message = json_value(body, "reasoning_budget_message", opt.reasoning_budget_message);
+ 
+         if (!chat_params.thinking_end_tag.empty()) {
+             llama_params["reasoning_budget_tokens"] = reasoning_budget;
+             llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag;
+             llama_params["reasoning_budget_end_tag"] = chat_params.thinking_end_tag;
+-            llama_params["reasoning_budget_message"] = opt.reasoning_budget_message;
++            llama_params["reasoning_budget_message"] = reasoning_budget_message;
+             llama_params["reasoning_control"] = json_value(body, "reasoning_control", false);
+         }
+     }
diff --git a/src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java b/src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java
@@ -103,6 +103,11 @@ public final class InferenceParameters extends JsonParameters {
     private static final String PARAM_TOOLS = "tools";
     private static final String PARAM_TOOL_CHOICE = "tool_choice";
     private static final String PARAM_PARALLEL_TOOL_CALLS = "parallel_tool_calls";
+    private static final String PARAM_DRY_MULTIPLIER = "dry_multiplier";
+    private static final String PARAM_DRY_BASE = "dry_base";
+    private static final String PARAM_DRY_ALLOWED_LENGTH = "dry_allowed_length";
+    private static final String PARAM_DRY_PENALTY_LAST_N = "dry_penalty_last_n";
+    private static final String PARAM_DRY_SEQUENCE_BREAKERS = "dry_sequence_breakers";
 
     private static final InferenceParameters EMPTY = new InferenceParameters();
 
@@ -734,6 +739,83 @@ public InferenceParameters withTopNSigma(float topNSigma) {
         return withScalar(PARAM_TOP_N_SIGMA, topNSigma);
     }
 
+    /**
+     * Returns a new request with the per-request DRY (Don't Repeat Yourself) repetition multiplier
+     * replaced (default: 0.0, 0.0 = DRY disabled). DRY suppresses repeated multi-token sequences
+     * without the collateral damage of the classic {@code repeat_penalty}. This is the per-request
+     * mirror of {@link ModelParameters#setDryMultiplier(float)} (the {@code --dry-multiplier} launch
+     * flag); when this wither is not called, nothing is emitted and DRY stays disabled.
+     *
+     * @param dryMultiplier the DRY sampling multiplier (0.0 = disabled)
+     * @return a new instance; this instance is unchanged
+     */
+    public InferenceParameters withDryMultiplier(float dryMultiplier) {
+        return withScalar(PARAM_DRY_MULTIPLIER, dryMultiplier);
+    }
+
+    /**
+     * Returns a new request with the per-request DRY base replaced (default: 1.75). The base is the
+     * exponential growth factor applied to the penalty as a repeated sequence lengthens; it only takes
+     * effect when {@link #withDryMultiplier(float)} is non-zero. Per-request mirror of
+     * {@link ModelParameters#setDryBase(float)} (the {@code --dry-base} launch flag).
+     *
+     * @param dryBase the DRY sampling base value
+     * @return a new instance; this instance is unchanged
+     */
+    public InferenceParameters withDryBase(float dryBase) {
+        return withScalar(PARAM_DRY_BASE, dryBase);
+    }
+
+    /**
+     * Returns a new request with the per-request DRY allowed length replaced (default: 2). Sequences
+     * up to this length are not penalized; the penalty applies only once a repeated sequence grows
+     * longer. Only takes effect when {@link #withDryMultiplier(float)} is non-zero. Per-request mirror
+     * of {@link ModelParameters#setDryAllowedLength(int)} (the {@code --dry-allowed-length} launch flag).
+     *
+     * @param dryAllowedLength the allowed length for DRY sampling
+     * @return a new instance; this instance is unchanged
+     */
+    public InferenceParameters withDryAllowedLength(int dryAllowedLength) {
+        return withScalar(PARAM_DRY_ALLOWED_LENGTH, dryAllowedLength);
+    }
+
+    /**
+     * Returns a new request with the per-request DRY penalty window replaced (default: -1, -1 = context
+     * size, 0 = disabled). Only takes effect when {@link #withDryMultiplier(float)} is non-zero.
+     * Per-request mirror of {@link ModelParameters#setDryPenaltyLastN(int)} (the
+     * {@code --dry-penalty-last-n} launch flag); values below {@code -1} are rejected.
+     *
+     * @param dryPenaltyLastN the DRY penalty window (-1 = context size, 0 = disabled)
+     * @return a new instance; this instance is unchanged
+     * @throws IllegalArgumentException if {@code dryPenaltyLastN} is less than {@code -1}
+     */
+    public InferenceParameters withDryPenaltyLastN(int dryPenaltyLastN) {
+        if (dryPenaltyLastN < -1) {
+            throw new IllegalArgumentException("Invalid dry_penalty_last_n value: " + dryPenaltyLastN
+                    + " (must be >= -1; -1 = context size, 0 = disabled)");
+        }
+        return withScalar(PARAM_DRY_PENALTY_LAST_N, dryPenaltyLastN);
+    }
+
+    /**
+     * Returns a new request with the per-request DRY sequence breakers replaced. Sequence breakers are
+     * tokens at which DRY restarts matching, so repetition is not penalized across them (llama.cpp
+     * default: {@code ["\n", ":", "\"", "*"]}). Empty input is a no-op (returns {@code this}), so when
+     * this wither is not called nothing is emitted and the server's default breakers apply. Only takes
+     * effect when {@link #withDryMultiplier(float)} is non-zero.
+     *
+     * @param breakers the sequence-breaker strings
+     * @return a new instance with the breaker array set, or {@code this} if {@code breakers} is empty
+     */
+    public InferenceParameters withDrySequenceBreakers(String... breakers) {
+        if (breakers.length == 0) {
+            return this;
+        }
+        return withRaw(
+                PARAM_DRY_SEQUENCE_BREAKERS,
+                serializer.buildStopStrings(breakers).toString());
+    }
+
     /**
      * Returns a new request with the reasoning-format choice replaced.
      *
diff --git a/src/test/java/net/ladenthin/llama/parameters/InferenceParametersTest.java b/src/test/java/net/ladenthin/llama/parameters/InferenceParametersTest.java