Name and Version
llama-server --version
version: 9670 (02810c7aa)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Summary
When a completion request specifies id_slot, the server bypasses get_available_slot() and directly calls get_slot_by_id(). As a result, the prompt cache restore path (prompt_save(), prompt_load(), prompt_cache->update()) is never executed for explicit-slot requests.
This causes prompt-cache-backed reuse to stop working for workloads that pin conversations to specific slots.
Impact
For long-running conversations, especially with hybrid recurrent models where partial seq_rm() may fail, this results in:
- Missed prompt cache restores
- Reduced cache reuse
- Unnecessary checkpoint rescue activation
- Full or partial prompt re-evaluation when a prompt-cache restore could have been used instead
- Significant latency regressions on large contexts
In observed cases, an explicit-slot request with a prompt that should have matched an existing cached prompt instead fell through to the recurrent recovery path:
partial seq_rm at p0=21619 failed; restored context checkpoint ...
rather than restoring from the prompt cache.
Root Cause
There are currently two slot acquisition paths:
Automatic slot selection
This path performs prompt-cache maintenance:
ret->prompt_save(*prompt_cache);
if (!ret->prompt_load(*prompt_cache, task.tokens)) {
ret->prompt_clear(false);
}
prompt_cache->update();
Explicit slot selection
get_slot_by_id(task.id_slot)
This path performs none of the above work.
As a result, the two code paths have different behavior even though both ultimately launch a completion task.
Proposed Fix
Factor prompt-cache preparation into a shared helper:
void update_slot_prompt_cache(
server_slot * slot,
const server_task & task,
const char * label);
Introduce a common slot-selection path:
server_slot * get_slot_for_task(const server_task & task);
Behavior:
task.id_slot == -1
- delegate to existing
get_available_slot(task)
- explicit
id_slot
- lookup slot
- skip if currently processing
- perform prompt-cache update/restore
- return prepared slot
Then replace:
server_slot * slot =
id_slot != -1
? get_slot_by_id(id_slot)
: get_available_slot(task);
with:
server_slot * slot = get_slot_for_task(task);
Why Not Put This in get_slot_by_id()?
get_slot_by_id() is a generic lookup routine used by multiple code paths, including slot-management operations.
Adding prompt-cache side effects to a simple lookup function would make its behavior non-obvious and could introduce unexpected work in unrelated callers.
Keeping cache preparation in a task-oriented wrapper preserves separation of concerns:
get_slot_by_id() → lookup only
get_available_slot() → scheduling
get_slot_for_task() → prepare slot for completion execution
Expected Result
Explicit-slot requests receive the same prompt-cache restore behavior as automatically scheduled requests.
This restores symmetry between the two execution paths and avoids unnecessary fallback into checkpoint rescue or full prompt replay when a matching cached prompt already exists.
First Bad Commit
No response
Relevant log output
Logs
# Automatic slot path: prompt-cache restore runs
2026-06-18 01:18:35.896 I slot get_availabl: id 1 | task -1 | selected slot by LRU
2026-06-18 01:18:35.896 I srv get_availabl: updating prompt cache
2026-06-18 01:18:35.896 I srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
2026-06-18 01:18:35.896 I srv load: - found better prompt with f_keep = 0.996, sim = 0.998
2026-06-18 01:18:35.945 I srv update: - cache state: 1 prompts, 5249.568 MiB
2026-06-18 01:18:35.946 I slot launch_slot_: id 1 | task 1956 | processing task
# Explicit slot path: request launches on slot 0 without the get_availabl / prompt-cache restore path
2026-06-18 01:45:57.098 I Server POST /api/v1/chat/completions - Streaming
2026-06-18 01:45:57.551 I srv params_from_: Chat format: peg-native
2026-06-18 01:45:57.554 I slot launch_slot_: id 0 | task 1994 | processing task
2026-06-18 01:45:57.554 I slot process_sing: id 1 | task -1 | saving idle slot to prompt cache
2026-06-18 01:45:57.555 W srv prompt_save: - saving prompt with length 21750, total state size = 341.697 MiB
2026-06-18 01:45:57.658 I srv update: - cache state: 2 prompts, 34298.837 MiB
2026-06-18 01:45:57.658 I srv update: - prompt 0x55c1c07bf2a0: 175715 tokens, checkpoints: 113, 33851.889 MiB
2026-06-18 01:45:57.658 I srv update: - prompt 0x55c1b6fce470: 21750 tokens, checkpoints: 1, 446.948 MiB
2026-06-18 01:45:57.658 I slot prompt_clear: id 1 | task -1 | clearing prompt with 21750 tokens
# Consequence: active slot has pos_next = 0, so checkpoints are invalidated instead of rehydrated
2026-06-18 01:45:57.662 W slot update_slots: id 0 | task 1994 | erased invalidated context checkpoint (pos_min = 9284, pos_max = 9284, n_tokens = 9285, n_swa = 0, pos_next = 0, size = 81.125 MiB)
2026-06-18 01:45:57.665 W slot update_slots: id 0 | task 1994 | erased invalidated context checkpoint (pos_min = 22914, pos_max = 22914, n_tokens = 22915, n_swa = 0, pos_next = 0, size = 108.006 MiB)
2026-06-18 01:45:57.667 W slot update_slots: id 0 | task 1994 | erased invalidated context checkpoint (pos_min = 35290, pos_max = 35290, n_tokens = 35291, n_swa = 0, pos_next = 0, size = 132.414 MiB)
Fix
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -1386,6 +1386,26 @@
return nullptr;
}
+ void update_slot_prompt_cache(server_slot * slot, const server_task & task, const char * label) {
+ if (!slot || !prompt_cache || task.type != SERVER_TASK_TYPE_COMPLETION) {
+ return;
+ }
+
+ SRV_INF("updating prompt cache%s\n", label);
+
+ const int64_t t_start = ggml_time_us();
+
+ slot->prompt_save(*prompt_cache);
+
+ if (!slot->prompt_load(*prompt_cache, task.tokens)) {
+ slot->prompt_clear(false);
+ }
+
+ prompt_cache->update();
+
+ SRV_INF("prompt cache update%s took %.2f ms\n", label, (ggml_time_us() - t_start) / 1000.0);
+ }
+
server_slot * get_available_slot(const server_task & task) {
server_slot * ret = nullptr;
@@ -1463,23 +1483,27 @@
update_cache = update_cache && task.type == SERVER_TASK_TYPE_COMPLETION;
if (update_cache) {
- SRV_INF("%s", "updating prompt cache\n");
-
- const int64_t t_start = ggml_time_us();
+ update_slot_prompt_cache(ret, task, "");
+ }
+ }
- ret->prompt_save(*prompt_cache);
+ return ret;
+ }
- if (!ret->prompt_load(*prompt_cache, task.tokens)) {
- ret->prompt_clear(false);
- }
+ server_slot * get_slot_for_task(const server_task & task) {
+ if (task.id_slot == -1) {
+ return get_available_slot(task);
+ }
- prompt_cache->update();
+ server_slot * slot = get_slot_by_id(task.id_slot);
- SRV_INF("prompt cache update took %.2f ms\n", (ggml_time_us() - t_start) / 1000.0);
- }
+ if (!slot || slot->is_processing()) {
+ return slot;
}
- return ret;
+ update_slot_prompt_cache(slot, task, " for requested slot");
+
+ return slot;
}
// return true if at least one slot has been cleared
@@ -2179,7 +2203,6 @@
- const int id_slot = task.id_slot;
const int id_task = task.id;
- server_slot * slot = id_slot != -1 ? get_slot_by_id(id_slot) : get_available_slot(task);
+ server_slot * slot = get_slot_for_task(task);
//
// slot scheduling logic
Name and Version
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Summary
When a completion request specifies
id_slot, the server bypassesget_available_slot()and directly callsget_slot_by_id(). As a result, the prompt cache restore path (prompt_save(),prompt_load(),prompt_cache->update()) is never executed for explicit-slot requests.This causes prompt-cache-backed reuse to stop working for workloads that pin conversations to specific slots.
Impact
For long-running conversations, especially with hybrid recurrent models where partial
seq_rm()may fail, this results in:In observed cases, an explicit-slot request with a prompt that should have matched an existing cached prompt instead fell through to the recurrent recovery path:
rather than restoring from the prompt cache.
Root Cause
There are currently two slot acquisition paths:
Automatic slot selection
get_available_slot(task)This path performs prompt-cache maintenance:
Explicit slot selection
get_slot_by_id(task.id_slot)This path performs none of the above work.
As a result, the two code paths have different behavior even though both ultimately launch a completion task.
Proposed Fix
Factor prompt-cache preparation into a shared helper:
Introduce a common slot-selection path:
Behavior:
task.id_slot == -1get_available_slot(task)id_slotThen replace:
server_slot * slot = id_slot != -1 ? get_slot_by_id(id_slot) : get_available_slot(task);with:
Why Not Put This in
get_slot_by_id()?get_slot_by_id()is a generic lookup routine used by multiple code paths, including slot-management operations.Adding prompt-cache side effects to a simple lookup function would make its behavior non-obvious and could introduce unexpected work in unrelated callers.
Keeping cache preparation in a task-oriented wrapper preserves separation of concerns:
get_slot_by_id()→ lookup onlyget_available_slot()→ schedulingget_slot_for_task()→ prepare slot for completion executionExpected Result
Explicit-slot requests receive the same prompt-cache restore behavior as automatically scheduled requests.
This restores symmetry between the two execution paths and avoids unnecessary fallback into checkpoint rescue or full prompt replay when a matching cached prompt already exists.
First Bad Commit
No response
Relevant log output
Logs
Fix