Skip to content

Commit e2c548b

Browse files
Merge pull request #251 from vaiju1981/feature/session_improvement
feat: pin sessions to their slot and expose KV-cache usage + metrics
2 parents 2cc53b3 + 6860cc8 commit e2c548b

33 files changed

Lines changed: 577 additions & 37 deletions

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
1717
- Real-model tool-calling integration tests for blocking and streaming required tool calls (`ToolCallingIntegrationTest`, Qwen2.5-1.5B-Instruct), wired into CI and `validate-models`.
1818
- End-to-end vision input across blocking, typed `ChatRequest`, streaming, and OpenAI-compatible request mapping; real-model tests verify that distinct red and blue images produce the correct semantic answers.
1919
- Explicit `setMmprojAuto(boolean)` and `setMmprojOffload(boolean)` controls, including the upstream `--no-mmproj-auto` and `--no-mmproj-offload` flags.
20+
- Per-request KV controls: `InferenceParameters.withSlotId(int)` and `withCacheReuse(int)`.
21+
- Typed cache observability through `Usage.getCachedTokens()`, `Usage.getProcessedPromptTokens()`, `SlotMetrics`, and `ServerMetrics.getSlotMetrics()`.
22+
- Authenticated JSON `GET /metrics` and `GET /slots` endpoints on the embedded server.
2023

2124
### Changed
2225
- Unified `CONTRIBUTING.md` and `SECURITY.md` structure with sibling repositories in the project family.
@@ -30,6 +33,8 @@ from version 5.0.0 onward. Pre-fork releases (`1.x`–`4.2.0`) were authored by
3033
- Preserved decoded image buffers across the JNI chat boundary and submitted media requests through llama.cpp's upstream multimodal task path instead of silently tokenizing them as text-only prompts.
3134
- Preserved multipart image content when using the typed `ChatRequest` serializer.
3235
- The standalone OpenAI-compatible server now advertises vision only when the loaded model confirms usable vision support.
36+
- `Session` now pins every inference request to its configured slot, so generation and slot save/restore/erase target the same KV state.
37+
- Cached-token usage is preserved through typed Java responses and OpenAI Responses/Anthropic blocking and streaming adapters.
3338

3439
### Added
3540
- Reasoning-budget tests (Qwen3-0.6B).

README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -473,6 +473,23 @@ a JSON response, matching the HTTP server's contract:
473473
Server state is exposed via `getMetrics()`, `eraseSlot(int)`, `saveSlot(int, String)`,
474474
`restoreSlot(int, String)`, and `getModelMeta()`.
475475

476+
### Prompt and KV Cache Reuse
477+
478+
Prompt-prefix reuse is enabled by default in llama.cpp and can be controlled per request with
479+
`InferenceParameters.withCachePrompt(boolean)`. `withCacheReuse(int)` enables non-prefix chunk reuse,
480+
while `withSlotId(int)` pins a request to a specific server slot. `Session` applies its slot id to every
481+
request, so generation and `save`/`restore` operate on the same KV state.
482+
483+
Typed results expose logical prompt, generated, cached prompt, and evaluated prompt counts through
484+
`Usage`. Per-request timing also remains available through `Timings.getCacheN()`.
485+
`LlamaModel.getMetricsTyped().getSlotMetrics()` reports each slot's logical, processed, cached,
486+
decoded, and remaining token counts.
487+
488+
The embedded HTTP server exposes the same native JSON at authenticated `GET /metrics`, with the slot
489+
array alone at `GET /slots`. OpenAI responses preserve
490+
`usage.prompt_tokens_details.cached_tokens`; Responses API output uses
491+
`usage.input_tokens_details.cached_tokens`; Anthropic output uses `cache_read_input_tokens`.
492+
476493
### OpenAI-compatible HTTP server
477494

478495
`net.ladenthin.llama.server.OpenAiCompatServer` turns a loaded model into a local
@@ -488,6 +505,8 @@ serves:
488505
| `POST /v1/rerank` (requires `--reranking`) | `LlamaModel.handleRerank` (reshaped to `results`/`data`) |
489506
| `POST /infill` | `LlamaModel.handleInfill` (fill-in-the-middle autocomplete) |
490507
| `GET /v1/models` | the configured model id |
508+
| `GET /metrics` | native server and per-slot token/cache counters (JSON) |
509+
| `GET /slots` | native per-slot token/cache counters (JSON array) |
491510
| `GET /health` | static `{"status":"ok"}` (unauthenticated) |
492511

493512
Chat completions support **streaming via Server-Sent Events** and non-streaming, forwarding

src/main/cpp/jllama.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,7 @@ static void populate_completion_task(server_task &task, jllama_context *jctx, in
205205
}
206206
}
207207
task.params = server_schema::eval_llama_cmpl_schema(jctx->vocab, jctx->params, n_ctx_slot, logit_bias_eog, data);
208+
configure_task_slot_impl(task, data);
208209
}
209210

210211
[[nodiscard]] static jint dispatch_streaming_completion(JNIEnv *env, jllama_context *jctx, const json &data,

src/main/cpp/jni_helpers.hpp

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
// require_json_field_impl, jint_array_to_tokens_impl
1515
//
1616
// Layer B — JNI + server orchestration:
17-
// configure_multimodal_task_impl,
17+
// configure_multimodal_task_impl, configure_task_slot_impl,
1818
// json_to_jstring_impl, results_to_jstring_impl,
1919
// embedding_to_jfloat_array_impl, tokens_to_jint_array_impl
2020
//
@@ -175,6 +175,12 @@ inline void erase_reader(jllama_context *jctx, int id_task) {
175175
return true;
176176
}
177177

178+
// Match server_routes::handle_completions_impl(): slot selection is task
179+
// metadata, not part of task_params, so eval_llama_cmpl_schema() does not set it.
180+
inline void configure_task_slot_impl(server_task &task, const json &data) {
181+
task.id_slot = json_value(data, "id_slot", -1);
182+
}
183+
178184
// ---------------------------------------------------------------------------
179185
// json_to_jstring_impl
180186
//

src/main/java/net/ladenthin/llama/Session.java

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,14 @@ public void close() {
185185
* @return inference parameters carrying the system message + wire messages
186186
*/
187187
private InferenceParameters buildParams(@Nullable String systemMessage, List<Pair<String, String>> wireMessages) {
188-
InferenceParameters params = InferenceParameters.empty().withMessages(systemMessage, wireMessages);
189-
return paramsCustomizer == null ? params : paramsCustomizer.apply(params);
188+
InferenceParameters params = InferenceParameters.empty()
189+
.withMessages(systemMessage, wireMessages)
190+
.withCachePrompt(true);
191+
if (paramsCustomizer != null) {
192+
params = paramsCustomizer.apply(params);
193+
}
194+
// Apply last: a Session must never drift away from the slot used by
195+
// save(), restore(), and close(), even if a customizer supplies another id.
196+
return params.withSlotId(slotId);
190197
}
191198
}

src/main/java/net/ladenthin/llama/json/ChatResponseParser.java

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -150,9 +150,14 @@ public ChatResponse parseResponse(String json) {
150150
JsonNode node = OBJECT_MAPPER.readTree(json);
151151
String id = node.path("id").asText("");
152152
List<ChatChoice> choices = parseChoices(node.path("choices"));
153+
JsonNode usageNode = node.path("usage");
153154
Usage usage = new Usage(
154-
node.path("usage").path("prompt_tokens").asLong(0L),
155-
node.path("usage").path("completion_tokens").asLong(0L));
155+
usageNode.path("prompt_tokens").asLong(0L),
156+
usageNode.path("completion_tokens").asLong(0L),
157+
usageNode
158+
.path("prompt_tokens_details")
159+
.path("cached_tokens")
160+
.asLong(0L));
156161
Timings timings = Timings.fromJson(node.path("timings"));
157162
TimingsLogger.log(timings);
158163
return new ChatResponse(id, choices, usage, timings, json);

src/main/java/net/ladenthin/llama/json/CompletionResponseParser.java

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,10 +187,11 @@ public CompletionResult parseCompletionResult(String json) {
187187
try {
188188
JsonNode node = OBJECT_MAPPER.readTree(json);
189189
String text = extractContent(node);
190+
Timings timings = Timings.fromJson(node.path("timings"));
190191
Usage usage = new Usage(
191192
node.path("tokens_evaluated").asLong(0L),
192-
node.path("tokens_predicted").asLong(0L));
193-
Timings timings = Timings.fromJson(node.path("timings"));
193+
node.path("tokens_predicted").asLong(0L),
194+
Math.max(0, timings.getCacheN()));
194195
TimingsLogger.log(timings);
195196
List<TokenLogprob> logprobs = parseLogprobs(node);
196197
StopReason stopReason =

src/main/java/net/ladenthin/llama/parameters/InferenceParameters.java

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,8 @@ public final class InferenceParameters extends JsonParameters {
5858
private static final String PARAM_INPUT_PREFIX = "input_prefix";
5959
private static final String PARAM_INPUT_SUFFIX = "input_suffix";
6060
private static final String PARAM_CACHE_PROMPT = "cache_prompt";
61+
private static final String PARAM_CACHE_REUSE = "n_cache_reuse";
62+
private static final String PARAM_SLOT_ID = "id_slot";
6163
private static final String PARAM_STREAM_OPTIONS = "stream_options";
6264
private static final String PARAM_RESPONSE_FORMAT = "response_format";
6365
private static final String PARAM_N_PREDICT = "n_predict";
@@ -204,6 +206,36 @@ public InferenceParameters withCachePrompt(boolean cachePrompt) {
204206
return withScalar(PARAM_CACHE_PROMPT, cachePrompt);
205207
}
206208

209+
/**
210+
* Returns a new request with the minimum reusable KV-cache chunk size replaced.
211+
* A value of {@code 0} disables non-prefix chunk reuse. Ordinary common-prefix
212+
* reuse remains controlled by {@link #withCachePrompt(boolean)}.
213+
*
214+
* @param cacheReuse minimum reusable chunk size, or {@code 0} to disable
215+
* @return a new instance; this instance is unchanged
216+
*/
217+
public InferenceParameters withCacheReuse(int cacheReuse) {
218+
if (cacheReuse < 0) {
219+
throw new IllegalArgumentException("cacheReuse must be non-negative");
220+
}
221+
return withScalar(PARAM_CACHE_REUSE, cacheReuse);
222+
}
223+
224+
/**
225+
* Returns a new request pinned to a llama.cpp server slot. Pinning is useful
226+
* for deterministic multi-turn KV reuse and for matching inference with
227+
* {@code saveSlot}/{@code restoreSlot} operations.
228+
*
229+
* @param slotId non-negative slot identifier
230+
* @return a new instance; this instance is unchanged
231+
*/
232+
public InferenceParameters withSlotId(int slotId) {
233+
if (slotId < 0) {
234+
throw new IllegalArgumentException("slotId must be non-negative");
235+
}
236+
return withScalar(PARAM_SLOT_ID, slotId);
237+
}
238+
207239
/**
208240
* Returns a new request with the number of tokens to predict replaced
209241
* (default: -1, -1 = infinity, -2 = until context filled).

src/main/java/net/ladenthin/llama/parameters/ModelParameters.java

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1398,10 +1398,10 @@ public ModelParameters setKvUnified(boolean kvUnified) {
13981398
/**
13991399
* Set the maximum RAM cache size in MiB used to store saved slot KV state.
14001400
* <p>
1401-
* Requires {@link #setKvUnified(boolean) unified KV} to be enabled.
14021401
* Set to {@code -1} for no limit, {@code 0} to disable (default: 8192 MiB).
1403-
* Together with {@link #setClearIdle} this allows idle slots to be evicted
1404-
* from GPU/CPU memory and restored quickly on the next matching request.
1402+
* Together with {@link #setClearIdle}, idle slot states are copied into this
1403+
* RAM cache and restored on a matching request. Unified KV is required only
1404+
* when those idle slots should also be cleared from the active KV buffer.
14051405
*
14061406
* @param cacheRamMib maximum cache size in MiB, or {@code -1} for unlimited
14071407
* @return this builder
@@ -1414,14 +1414,13 @@ public ModelParameters setCacheRamMib(int cacheRamMib) {
14141414
* Enable or disable saving and clearing idle slots when a new task starts.
14151415
* <p>
14161416
* When enabled (the default), idle slots have their KV state saved to the
1417-
* RAM cache ({@link #setCacheRamMib}) and are then cleared, freeing GPU/CPU
1418-
* memory for the active request. The saved state is transparently restored
1419-
* on the next request that shares the same prompt prefix, so cache-hit
1420-
* latency is preserved.
1417+
* RAM cache ({@link #setCacheRamMib}). With unified KV enabled, the active
1418+
* slot state is also cleared, freeing KV-buffer capacity for other requests.
1419+
* Without unified KV the RAM-cache copy is still created, but the active
1420+
* slot remains allocated.
14211421
* <p>
1422-
* Requires {@link #setKvUnified(boolean) unified KV} and a non-zero
1423-
* {@link #setCacheRamMib RAM cache}. If either dependency is absent the
1424-
* server logs a warning and silently disables the feature.
1422+
* Requires a non-zero {@link #setCacheRamMib RAM cache}. Unified KV is
1423+
* required only for active-buffer eviction.
14251424
*
14261425
* @param clearIdle {@code true} to enable idle-slot eviction (default), {@code false} to disable
14271426
* @return this builder

src/main/java/net/ladenthin/llama/server/AnthropicApiSupport.java

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,13 @@ static String toAnthropicResponse(String openAiCompletionJson, String model) {
254254
stopReason = anthropicStopReason(choice.path("finish_reason").asText("stop"));
255255
JsonNode openAiUsage = completion.path("usage");
256256
if (openAiUsage.isObject()) {
257-
usage.put("input_tokens", openAiUsage.path("prompt_tokens").asInt(0));
257+
int promptTokens = openAiUsage.path("prompt_tokens").asInt(0);
258+
int cachedTokens = openAiUsage
259+
.path("prompt_tokens_details")
260+
.path("cached_tokens")
261+
.asInt(0);
262+
usage.put("input_tokens", Math.max(0, promptTokens - cachedTokens));
263+
usage.put("cache_read_input_tokens", cachedTokens);
258264
usage.put("output_tokens", openAiUsage.path("completion_tokens").asInt(0));
259265
}
260266
} catch (IOException e) {
@@ -391,12 +397,20 @@ static String blockStopEvent(int index) {
391397

392398
/** {@code message_delta} event carrying the final stop reason. */
393399
static String messageDeltaEvent(String stopReason) {
400+
return messageDeltaEvent(stopReason, 0, 0, 0);
401+
}
402+
403+
/** Final message delta carrying token usage collected from the trailing OpenAI usage chunk. */
404+
static String messageDeltaEvent(String stopReason, int inputTokens, int outputTokens, int cachedTokens) {
394405
ObjectNode data = OBJECT_MAPPER.createObjectNode();
395406
data.put("type", "message_delta");
396407
ObjectNode delta = data.putObject("delta");
397408
delta.put("stop_reason", stopReason);
398409
delta.putNull("stop_sequence");
399-
data.putObject("usage").put("output_tokens", 0);
410+
ObjectNode usage = data.putObject("usage");
411+
usage.put("input_tokens", inputTokens);
412+
usage.put("output_tokens", outputTokens);
413+
usage.put("cache_read_input_tokens", cachedTokens);
400414
return sseEvent("message_delta", data.toString());
401415
}
402416

0 commit comments

Comments
 (0)