Skip to content

Commit 381a3fc

Browse files
committed
Add per-request DRY sampling to InferenceParameters; carry upstream PRs #22393 and #23116 as patches
InferenceParameters gains five immutable withers mirroring the existing withMinP/withTopNSigma (scalars) and withStopStrings (string array) style: withDryMultiplier(float) -> "dry_multiplier" withDryBase(float) -> "dry_base" withDryAllowedLength(int) -> "dry_allowed_length" withDryPenaltyLastN(int) -> "dry_penalty_last_n" (rejects < -1) withDrySequenceBreakers(String...) -> "dry_sequence_breakers" (omitted when unset) This exposes DRY per request, uniformly with the other samplers, instead of only at model/launch level (ModelParameters --dry-*). Defaults are unchanged: no wither call emits nothing and DRY stays disabled. Adds 12 unit tests covering field/value serialization, the JSON string array, the no-op-when-empty contract, penalty-last-n validation, and immutable-instance semantics (InferenceParametersTest: 90 -> 102 tests). Also carries two still-open upstream llama.cpp PRs as local patches (named after the PR number), refreshed against the pinned b9803 source and verified to apply cleanly + reverse-check idempotently: patches/0003-pr22393-... server_context slot_prompt_similarity get/set patches/0004-pr23116-... per-request reasoning_budget_tokens override (incl. upstream test-chat.cpp additions, verbatim) Updates CLAUDE.md patches table and CHANGELOG. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NoVagFhnb7af9DFSDzpsuY
1 parent b9ad93a commit 381a3fc

6 files changed

Lines changed: 341 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
1818
- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
1919
- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
2020
- Per-request KV controls: `InferenceParameters.withSlotId(int)` and `withCacheReuse(int)`.
21+
- Per-request DRY sampling to `InferenceParameters` (`dry_multiplier`/`dry_base`/`dry_allowed_length`/`dry_penalty_last_n`/`dry_sequence_breakers`).
2122
- Typed cache observability through `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, and `ServerMetrics.getSlotMetrics()`.
2223
- Authenticated JSON `GET /metrics` and `GET /slots` endpoints on the embedded server.
2324

CLAUDE.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,8 @@ Current patches:
384384
|-------|-------|
385385
| `0001-win32-arg-parse-embed-guard.patch` | Windows JNI regression from llama.cpp **#24779** (introduced b9739): on Windows `common_params_parse` re-derived argv from the **process** command line (`GetCommandLineW`) and adopted it, so an embedded/JNI caller (`java.exe`) lost its `--model …` args → "Failed to parse model parameters". b9789 narrowed the unconditional override to a **count-guard** (`if (static_cast<int>(utf8.buf.size()) == argc) { argv = utf8.ptrs.data(); }`), but that is exactly the variant the project already found breaks its Windows server-integration tests (when the embedded argv length coincides with `java.exe`'s). The patch carries the **complete upstream change** (so it can be submitted to llama.cpp verbatim and then dropped here): **(1)** `common_params_parse` parses **exactly the argv it is given** (no `GetCommandLineW` magic) and a new `common_params_parse_main()` wrapper holds the UTF-8 recovery for the standalone tools' `main()` (`common/arg.{cpp,h}`); **(2)** the **~34 standalone `main()` call sites** (every `common_params_parse(argc, argv, …)` across `tools/*`, `examples/*` and the `tests/*` programs) flip to `common_params_parse_main()`; **(3)** a `tests/test-arg-parser.cpp` regression case pins that `common_params_parse` honors a caller-supplied argv. The embedded caller (`jllama.cpp`) keeps calling `common_params_parse` and is never overridden. **Our subproject build compiles only the `arg.{cpp,h}` core** — `LLAMA_BUILD_TOOLS`/`LLAMA_BUILD_TESTS` are OFF for a FetchContent subproject — so the flips + test are applied-but-not-compiled here; they were validated via a one-off `-DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON` build (the new test compiles and its asserts pass; `test-arg-parser`'s only red there is the live `ggml.ai` download check, which is sandbox-network, not the patch). Because it spans **37 files** it must be refreshed on every llama.cpp bump (the applier fails loud). |
386386
| `0002-server-preserve-caller-load-progress-callback.patch` | Load-progress-callback regression introduced in llama.cpp **b9789**: `server_context::load_model` (`tools/server/server-context.cpp`) now **unconditionally** installs the server's own load-progress reporter on `params_base.load_progress_callback` immediately before `common_init_from_params`, clobbering any callback the embedding caller already set. libjllama's `LoadProgressCallback` feature wires `common_params.load_progress_callback` to a JNI trampoline *before* calling `load_model`, so the bump silently killed it — `LoadProgressCallbackTest` saw zero progress updates and the abort-on-`false` path never threw. The patch guards the assignment with `if (params_base.load_progress_callback == nullptr)`, so the server installs its own reporter **only when the caller hasn't** — a caller-supplied callback survives and fires during load. Standalone `llama-server` (no caller callback, so the field is null) is unaffected. Same JNI-vs-standalone divergence class as `0001`. |
387+
| `0003-pr22393-server-add-slot-prompt-similarity-getter-setter.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#22393](https://github.com/ggml-org/llama.cpp/pull/22393) ("server : add slot_prompt_similarity getter/setter") while it is still open upstream. Purely additive: adds `server_context::get_slot_prompt_similarity()` / `set_slot_prompt_similarity(float)` (`tools/server/server-context.{cpp,h}`) so an embedding/JNI caller can query and tune the slot-selection threshold at runtime without reloading the model. Verbatim copy of the PR — drop it once a pinned `b<nnnn>` includes the change. |
388+
| `0004-pr23116-server-per-request-reasoning-budget-tokens.patch` | **Upstream-PR carry** of [ggml-org/llama.cpp#23116](https://github.com/ggml-org/llama.cpp/pull/23116) ("server: honour per-request reasoning_budget_tokens in chat completions"), motivated by java-llama.cpp#140, while it is still open upstream. `oaicompat_chat_params_parse` (`tools/server/server-common.cpp`) only read the Anthropic `thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so a per-request `reasoning_budget_tokens` / `reasoning_budget_message` on a chat-completions request was ignored. The patch reads both overrides **before** the generic copy loop (precedence: `reasoning_budget_tokens` > `thinking_budget_tokens` alias > server default) and threads the per-request message through. Carries the upstream `tests/test-chat.cpp` additions verbatim so the patch is submittable as-is; like `0001`'s test/call-site flips they are **applied-but-not-compiled** here (`LLAMA_BUILD_TESTS` is OFF for the FetchContent subproject). Drop it once a pinned `b<nnnn>` includes the change. |
387389

388390
## OuteTTS build-time extraction (`cmake/generate-tts-upstream.cmake`)
389391

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
Upstream PR: ggml-org/llama.cpp#22393 — "server : add slot_prompt_similarity getter/setter"
2+
https://github.com/ggml-org/llama.cpp/pull/22393
3+
4+
Carried locally until the PR is merged upstream. Adds public get/set accessors for the
5+
server_context `slot_prompt_similarity` field so an embedding/JNI caller can query and tune
6+
the slot-selection threshold at runtime without reloading the model. The change is purely
7+
additive (two new accessors + their declarations) and is a verbatim copy of the upstream PR,
8+
so it can be dropped from patches/ once b<nnnn> includes it. Refresh against the new source on
9+
every llama.cpp version bump (the applier fails loud if the context shifts).
10+
11+
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
12+
index 39b7eb2..7c274cb 100644
13+
--- a/tools/server/server-context.cpp
14+
+++ b/tools/server/server-context.cpp
15+
@@ -3965,6 +3965,14 @@ server_response_reader server_context::get_response_reader() {
16+
return impl->get_response_reader();
17+
}
18+
19+
+float server_context::get_slot_prompt_similarity() const {
20+
+ return impl->slot_prompt_similarity;
21+
+}
22+
+
23+
+void server_context::set_slot_prompt_similarity(float value) {
24+
+ impl->slot_prompt_similarity = value;
25+
+}
26+
+
27+
server_context_meta server_context::get_meta() const {
28+
auto bos_id = llama_vocab_bos(impl->vocab);
29+
auto eos_id = llama_vocab_eos(impl->vocab);
30+
diff --git a/tools/server/server-context.h b/tools/server/server-context.h
31+
index 952f825..938c985 100644
32+
--- a/tools/server/server-context.h
33+
+++ b/tools/server/server-context.h
34+
@@ -106,6 +106,11 @@ struct server_context {
35+
// not thread-safe, should only be used from the main thread
36+
server_context_meta get_meta() const;
37+
38+
+ // get/set the slot-prompt-similarity threshold for slot selection
39+
+ // not thread-safe, should only be used from the main thread
40+
+ float get_slot_prompt_similarity() const;
41+
+ void set_slot_prompt_similarity(float value);
42+
+
43+
// note: must be set before load_model() is called
44+
void set_state_callback(server_state_callback_t callback);
45+
};
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
Upstream PR: ggml-org/llama.cpp#23116 — "server: honour per-request reasoning_budget_tokens in
2+
chat completions"
3+
https://github.com/ggml-org/llama.cpp/pull/23116
4+
5+
Carried locally until the PR is merged upstream. Motivated by java-llama.cpp#140: a per-request
6+
`reasoning_budget_tokens` (and `reasoning_budget_message`) sent on a chat-completions request must
7+
override the server-launch default. Upstream `oaicompat_chat_params_parse` only read the Anthropic
8+
`thinking_budget_tokens` alias and always wrote the server-level `reasoning_budget_message`, so the
9+
canonical per-request keys were ignored. The patch reads both overrides before the generic copy loop
10+
(precedence: reasoning_budget_tokens > thinking_budget_tokens alias > server default) and threads the
11+
per-request message through. Includes the upstream test additions (tests/test-chat.cpp) verbatim so
12+
the patch is submittable as-is; LLAMA_BUILD_TESTS is OFF for the FetchContent subproject, so those are
13+
applied-but-not-compiled here. Refresh against the new source on every llama.cpp version bump (the
14+
applier fails loud if the context shifts).
15+
16+
diff --git a/tests/test-chat.cpp b/tests/test-chat.cpp
17+
index c38aed8..dfa8006 100644
18+
--- a/tests/test-chat.cpp
19+
+++ b/tests/test-chat.cpp
20+
@@ -5780,6 +5780,71 @@ static void test_developer_role_to_system_workaround() {
21+
}
22+
}
23+
24+
+static void test_reasoning_budget_tokens_per_request() {
25+
+ LOG_DBG("%s\n", __func__);
26+
+ // Use Qwen3 template which has <think>...</think> reasoning markers.
27+
+ // The autoparser detects them and sets thinking_start/end_tag, which enables
28+
+ // the reasoning-budget code path in oaicompat_chat_params_parse.
29+
+ auto tmpls = read_templates("models/templates/Qwen-Qwen3-0.6B.jinja");
30+
+
31+
+ server_chat_params opt;
32+
+ opt.tmpls = std::move(tmpls);
33+
+ opt.use_jinja = true;
34+
+ opt.enable_thinking = true;
35+
+ opt.reasoning_budget = -1;
36+
+ opt.reasoning_format = COMMON_REASONING_FORMAT_NONE;
37+
+
38+
+ // Body with per-request reasoning_budget_tokens=0 (suppress thinking).
39+
+ json body = {
40+
+ {"messages", json::array({json{{"role", "user"}, {"content", "hello"}}})},
41+
+ {"reasoning_budget_tokens", 0},
42+
+ };
43+
+ std::vector<raw_buffer> out_files;
44+
+ auto llama_params = oaicompat_chat_params_parse(body, opt, out_files);
45+
+
46+
+ // The per-request value must win over the server default (-1).
47+
+ if (!llama_params.contains("reasoning_budget_tokens")) {
48+
+ throw std::runtime_error("reasoning_budget_tokens missing from llama_params (thinking_end_tag may be empty for this template)");
49+
+ }
50+
+ int got = llama_params["reasoning_budget_tokens"].get<int>();
51+
+ if (got != 0) {
52+
+ throw std::runtime_error(std::string("Expected reasoning_budget_tokens=0, got ") + std::to_string(got));
53+
+ }
54+
+}
55+
+
56+
+static void test_reasoning_budget_message_per_request() {
57+
+ LOG_DBG("%s\n", __func__);
58+
+ // Same code path as test_reasoning_budget_tokens_per_request: the Qwen3 template's
59+
+ // <think>...</think> markers enable the reasoning-budget block in oaicompat_chat_params_parse.
60+
+ auto tmpls = read_templates("models/templates/Qwen-Qwen3-0.6B.jinja");
61+
+
62+
+ server_chat_params opt;
63+
+ opt.tmpls = std::move(tmpls);
64+
+ opt.use_jinja = true;
65+
+ opt.enable_thinking = true;
66+
+ opt.reasoning_budget = -1;
67+
+ opt.reasoning_format = COMMON_REASONING_FORMAT_NONE;
68+
+ opt.reasoning_budget_message = "server default";
69+
+
70+
+ // Body with a per-request reasoning_budget_message override.
71+
+ const std::string per_request_message = "per-request message";
72+
+ json body = {
73+
+ {"messages", json::array({json{{"role", "user"}, {"content", "hello"}}})},
74+
+ {"reasoning_budget_message", per_request_message},
75+
+ };
76+
+ std::vector<raw_buffer> out_files;
77+
+ auto llama_params = oaicompat_chat_params_parse(body, opt, out_files);
78+
+
79+
+ // The per-request value must win over the server default.
80+
+ if (!llama_params.contains("reasoning_budget_message")) {
81+
+ throw std::runtime_error("reasoning_budget_message missing from llama_params (thinking_end_tag may be empty for this template)");
82+
+ }
83+
+ std::string got = llama_params["reasoning_budget_message"].get<std::string>();
84+
+ if (got != per_request_message) {
85+
+ throw std::runtime_error("Expected reasoning_budget_message='" + per_request_message + "', got '" + got + "'");
86+
+ }
87+
+}
88+
+
89+
static void test_msg_diffs_compute() {
90+
LOG_DBG("%s\n", __func__);
91+
{
92+
@@ -5937,6 +6002,8 @@ int main(int argc, char ** argv) {
93+
test_convert_responses_to_chatcmpl();
94+
test_developer_role_to_system_workaround();
95+
test_template_generation_prompt();
96+
+ test_reasoning_budget_tokens_per_request();
97+
+ test_reasoning_budget_message_per_request();
98+
test_template_output_peg_parsers(detailed_debug);
99+
std::cout << "\n[chat] All tests passed!" << '\n';
100+
}
101+
diff --git a/tools/server/server-common.cpp b/tools/server/server-common.cpp
102+
index ac291d3..26cdfd2 100644
103+
--- a/tools/server/server-common.cpp
104+
+++ b/tools/server/server-common.cpp
105+
@@ -1116,16 +1116,24 @@ json oaicompat_chat_params_parse(
106+
107+
// Reasoning budget: pass parameters through to sampling layer
108+
{
109+
- int reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
110+
+ // Per-request overrides, read before writing to llama_params so the generic copy
111+
+ // loop (which skips keys already present) won't clobber the caller-supplied values.
112+
+ // Precedence: canonical reasoning_budget_tokens > Anthropic thinking_budget_tokens
113+
+ // alias > server-level default.
114+
+ int reasoning_budget = json_value(body, "reasoning_budget_tokens", -1);
115+
+ if (reasoning_budget == -1) {
116+
+ reasoning_budget = json_value(body, "thinking_budget_tokens", -1);
117+
+ }
118+
if (reasoning_budget == -1) {
119+
reasoning_budget = opt.reasoning_budget;
120+
}
121+
+ std::string reasoning_budget_message = json_value(body, "reasoning_budget_message", opt.reasoning_budget_message);
122+
123+
if (!chat_params.thinking_end_tag.empty()) {
124+
llama_params["reasoning_budget_tokens"] = reasoning_budget;
125+
llama_params["reasoning_budget_start_tag"] = chat_params.thinking_start_tag;
126+
llama_params["reasoning_budget_end_tag"] = chat_params.thinking_end_tag;
127+
- llama_params["reasoning_budget_message"] = opt.reasoning_budget_message;
128+
+ llama_params["reasoning_budget_message"] = reasoning_budget_message;
129+
llama_params["reasoning_control"] = json_value(body, "reasoning_control", false);
130+
}
131+
}

src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,11 @@ public final class InferenceParameters extends JsonParameters {
103103
private static final String PARAM_TOOLS = "tools";
104104
private static final String PARAM_TOOL_CHOICE = "tool_choice";
105105
private static final String PARAM_PARALLEL_TOOL_CALLS = "parallel_tool_calls";
106+
private static final String PARAM_DRY_MULTIPLIER = "dry_multiplier";
107+
private static final String PARAM_DRY_BASE = "dry_base";
108+
private static final String PARAM_DRY_ALLOWED_LENGTH = "dry_allowed_length";
109+
private static final String PARAM_DRY_PENALTY_LAST_N = "dry_penalty_last_n";
110+
private static final String PARAM_DRY_SEQUENCE_BREAKERS = "dry_sequence_breakers";
106111

107112
private static final InferenceParameters EMPTY = new InferenceParameters();
108113

@@ -734,6 +739,83 @@ public InferenceParameters withTopNSigma(float topNSigma) {
734739
return withScalar(PARAM_TOP_N_SIGMA, topNSigma);
735740
}
736741

742+
/**
743+
* Returns a new request with the per-request DRY (Don't Repeat Yourself) repetition multiplier
744+
* replaced (default: 0.0, 0.0 = DRY disabled). DRY suppresses repeated multi-token sequences
745+
* without the collateral damage of the classic {@code repeat_penalty}. This is the per-request
746+
* mirror of {@link ModelParameters#setDryMultiplier(float)} (the {@code --dry-multiplier} launch
747+
* flag); when this wither is not called, nothing is emitted and DRY stays disabled.
748+
*
749+
* @param dryMultiplier the DRY sampling multiplier (0.0 = disabled)
750+
* @return a new instance; this instance is unchanged
751+
*/
752+
public InferenceParameters withDryMultiplier(float dryMultiplier) {
753+
return withScalar(PARAM_DRY_MULTIPLIER, dryMultiplier);
754+
}
755+
756+
/**
757+
* Returns a new request with the per-request DRY base replaced (default: 1.75). The base is the
758+
* exponential growth factor applied to the penalty as a repeated sequence lengthens; it only takes
759+
* effect when {@link #withDryMultiplier(float)} is non-zero. Per-request mirror of
760+
* {@link ModelParameters#setDryBase(float)} (the {@code --dry-base} launch flag).
761+
*
762+
* @param dryBase the DRY sampling base value
763+
* @return a new instance; this instance is unchanged
764+
*/
765+
public InferenceParameters withDryBase(float dryBase) {
766+
return withScalar(PARAM_DRY_BASE, dryBase);
767+
}
768+
769+
/**
770+
* Returns a new request with the per-request DRY allowed length replaced (default: 2). Sequences
771+
* up to this length are not penalized; the penalty applies only once a repeated sequence grows
772+
* longer. Only takes effect when {@link #withDryMultiplier(float)} is non-zero. Per-request mirror
773+
* of {@link ModelParameters#setDryAllowedLength(int)} (the {@code --dry-allowed-length} launch flag).
774+
*
775+
* @param dryAllowedLength the allowed length for DRY sampling
776+
* @return a new instance; this instance is unchanged
777+
*/
778+
public InferenceParameters withDryAllowedLength(int dryAllowedLength) {
779+
return withScalar(PARAM_DRY_ALLOWED_LENGTH, dryAllowedLength);
780+
}
781+
782+
/**
783+
* Returns a new request with the per-request DRY penalty window replaced (default: -1, -1 = context
784+
* size, 0 = disabled). Only takes effect when {@link #withDryMultiplier(float)} is non-zero.
785+
* Per-request mirror of {@link ModelParameters#setDryPenaltyLastN(int)} (the
786+
* {@code --dry-penalty-last-n} launch flag); values below {@code -1} are rejected.
787+
*
788+
* @param dryPenaltyLastN the DRY penalty window (-1 = context size, 0 = disabled)
789+
* @return a new instance; this instance is unchanged
790+
* @throws IllegalArgumentException if {@code dryPenaltyLastN} is less than {@code -1}
791+
*/
792+
public InferenceParameters withDryPenaltyLastN(int dryPenaltyLastN) {
793+
if (dryPenaltyLastN < -1) {
794+
throw new IllegalArgumentException("Invalid dry_penalty_last_n value: " + dryPenaltyLastN
795+
+ " (must be >= -1; -1 = context size, 0 = disabled)");
796+
}
797+
return withScalar(PARAM_DRY_PENALTY_LAST_N, dryPenaltyLastN);
798+
}
799+
800+
/**
801+
* Returns a new request with the per-request DRY sequence breakers replaced. Sequence breakers are
802+
* tokens at which DRY restarts matching, so repetition is not penalized across them (llama.cpp
803+
* default: {@code ["\n", ":", "\"", "*"]}). Empty input is a no-op (returns {@code this}), so when
804+
* this wither is not called nothing is emitted and the server's default breakers apply. Only takes
805+
* effect when {@link #withDryMultiplier(float)} is non-zero.
806+
*
807+
* @param breakers the sequence-breaker strings
808+
* @return a new instance with the breaker array set, or {@code this} if {@code breakers} is empty
809+
*/
810+
public InferenceParameters withDrySequenceBreakers(String... breakers) {
811+
if (breakers.length == 0) {
812+
return this;
813+
}
814+
return withRaw(
815+
PARAM_DRY_SEQUENCE_BREAKERS,
816+
serializer.buildStopStrings(breakers).toString());
817+
}
818+
737819
/**
738820
* Returns a new request with the reasoning-format choice replaced.
739821
*

0 commit comments

Comments
 (0)