Misc. bug: Explicit Slot Requests Bypass Prompt Cache Restore

### Name and Version

```
llama-server --version
version: 9670 (02810c7aa)
built with GNU 13.3.0 for Linux x86_64
```

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell

```

### Problem description & steps to reproduce

## Summary

When a completion request specifies `id_slot`, the server bypasses `get_available_slot()` and directly calls `get_slot_by_id()`. As a result, the prompt cache restore path (`prompt_save()`, `prompt_load()`, `prompt_cache->update()`) is never executed for explicit-slot requests.

This causes prompt-cache-backed reuse to stop working for workloads that pin conversations to specific slots.

## Impact

For long-running conversations, especially with hybrid recurrent models where partial `seq_rm()` may fail, this results in:

- Missed prompt cache restores
- Reduced cache reuse
- Unnecessary checkpoint rescue activation
- Full or partial prompt re-evaluation when a prompt-cache restore could have been used instead
- Significant latency regressions on large contexts

In observed cases, an explicit-slot request with a prompt that should have matched an existing cached prompt instead fell through to the recurrent recovery path:

```text
partial seq_rm at p0=21619 failed; restored context checkpoint ...
```

rather than restoring from the prompt cache.

## Root Cause

There are currently two slot acquisition paths:

### Automatic slot selection

```cpp
get_available_slot(task)
```

This path performs prompt-cache maintenance:

```cpp
ret->prompt_save(*prompt_cache);

if (!ret->prompt_load(*prompt_cache, task.tokens)) {
    ret->prompt_clear(false);
}

prompt_cache->update();
```

### Explicit slot selection

```cpp
get_slot_by_id(task.id_slot)
```

This path performs none of the above work.

As a result, the two code paths have different behavior even though both ultimately launch a completion task.

## Proposed Fix

Factor prompt-cache preparation into a shared helper:

```cpp
void update_slot_prompt_cache(
    server_slot * slot,
    const server_task & task,
    const char * label);
```

Introduce a common slot-selection path:

```cpp
server_slot * get_slot_for_task(const server_task & task);
```

Behavior:

- `task.id_slot == -1`
  - delegate to existing `get_available_slot(task)`
- explicit `id_slot`
  - lookup slot
  - skip if currently processing
  - perform prompt-cache update/restore
  - return prepared slot

Then replace:

```cpp
server_slot * slot =
    id_slot != -1
        ? get_slot_by_id(id_slot)
        : get_available_slot(task);
```

with:

```cpp
server_slot * slot = get_slot_for_task(task);
```

## Why Not Put This in `get_slot_by_id()`?

`get_slot_by_id()` is a generic lookup routine used by multiple code paths, including slot-management operations.

Adding prompt-cache side effects to a simple lookup function would make its behavior non-obvious and could introduce unexpected work in unrelated callers.

Keeping cache preparation in a task-oriented wrapper preserves separation of concerns:

- `get_slot_by_id()` → lookup only
- `get_available_slot()` → scheduling
- `get_slot_for_task()` → prepare slot for completion execution

## Expected Result

Explicit-slot requests receive the same prompt-cache restore behavior as automatically scheduled requests.

This restores symmetry between the two execution paths and avoids unnecessary fallback into checkpoint rescue or full prompt replay when a matching cached prompt already exists.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console
# Automatic slot path: prompt-cache restore runs
2026-06-18 01:18:35.896 I slot get_availabl: id  1 | task -1 | selected slot by LRU
2026-06-18 01:18:35.896 I srv  get_availabl: updating prompt cache
2026-06-18 01:18:35.896 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
2026-06-18 01:18:35.896 I srv          load:  - found better prompt with f_keep = 0.996, sim = 0.998
2026-06-18 01:18:35.945 I srv        update:  - cache state: 1 prompts, 5249.568 MiB
2026-06-18 01:18:35.946 I slot launch_slot_: id  1 | task 1956 | processing task

# Explicit slot path: request launches on slot 0 without the get_availabl / prompt-cache restore path
2026-06-18 01:45:57.098 I Server POST /api/v1/chat/completions - Streaming
2026-06-18 01:45:57.551 I srv  params_from_: Chat format: peg-native
2026-06-18 01:45:57.554 I slot launch_slot_: id  0 | task 1994 | processing task
2026-06-18 01:45:57.554 I slot process_sing: id  1 | task -1 | saving idle slot to prompt cache
2026-06-18 01:45:57.555 W srv   prompt_save:  - saving prompt with length 21750, total state size = 341.697 MiB
2026-06-18 01:45:57.658 I srv        update:  - cache state: 2 prompts, 34298.837 MiB
2026-06-18 01:45:57.658 I srv        update:    - prompt 0x55c1c07bf2a0: 175715 tokens, checkpoints: 113, 33851.889 MiB
2026-06-18 01:45:57.658 I srv        update:    - prompt 0x55c1b6fce470: 21750 tokens, checkpoints: 1, 446.948 MiB
2026-06-18 01:45:57.658 I slot prompt_clear: id  1 | task -1 | clearing prompt with 21750 tokens

# Consequence: active slot has pos_next = 0, so checkpoints are invalidated instead of rehydrated
2026-06-18 01:45:57.662 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 9284, pos_max = 9284, n_tokens = 9285, n_swa = 0, pos_next = 0, size = 81.125 MiB)
2026-06-18 01:45:57.665 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 22914, pos_max = 22914, n_tokens = 22915, n_swa = 0, pos_next = 0, size = 108.006 MiB)
2026-06-18 01:45:57.667 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 35290, pos_max = 35290, n_tokens = 35291, n_swa = 0, pos_next = 0, size = 132.414 MiB)
```
</details>



## Fix

```diff
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -1386,6 +1386,26 @@
         return nullptr;
     }
 
+    void update_slot_prompt_cache(server_slot * slot, const server_task & task, const char * label) {
+        if (!slot || !prompt_cache || task.type != SERVER_TASK_TYPE_COMPLETION) {
+            return;
+        }
+
+        SRV_INF("updating prompt cache%s\n", label);
+
+        const int64_t t_start = ggml_time_us();
+
+        slot->prompt_save(*prompt_cache);
+
+        if (!slot->prompt_load(*prompt_cache, task.tokens)) {
+            slot->prompt_clear(false);
+        }
+
+        prompt_cache->update();
+
+        SRV_INF("prompt cache update%s took %.2f ms\n", label, (ggml_time_us() - t_start) / 1000.0);
+    }
+
     server_slot * get_available_slot(const server_task & task) {
         server_slot * ret = nullptr;
 
@@ -1463,23 +1483,27 @@
             update_cache = update_cache && task.type == SERVER_TASK_TYPE_COMPLETION;
 
             if (update_cache) {
-                SRV_INF("%s", "updating prompt cache\n");
-
-                const int64_t t_start = ggml_time_us();
+                update_slot_prompt_cache(ret, task, "");
+            }
+        }
 
-                ret->prompt_save(*prompt_cache);
+        return ret;
+    }
 
-                if (!ret->prompt_load(*prompt_cache, task.tokens)) {
-                    ret->prompt_clear(false);
-                }
+    server_slot * get_slot_for_task(const server_task & task) {
+        if (task.id_slot == -1) {
+            return get_available_slot(task);
+        }
 
-                prompt_cache->update();
+        server_slot * slot = get_slot_by_id(task.id_slot);
 
-                SRV_INF("prompt cache update took %.2f ms\n", (ggml_time_us() - t_start) / 1000.0);
-            }
+        if (!slot || slot->is_processing()) {
+            return slot;
         }
 
-        return ret;
+        update_slot_prompt_cache(slot, task, " for requested slot");
+
+        return slot;
     }
 
     // return true if at least one slot has been cleared
@@ -2179,7 +2203,6 @@
-                    const int id_slot = task.id_slot;
                     const int id_task = task.id;
 
-                    server_slot * slot = id_slot != -1 ? get_slot_by_id(id_slot) : get_available_slot(task);
+                    server_slot * slot = get_slot_for_task(task);
 
                     //
                     // slot scheduling logic
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Explicit Slot Requests Bypass Prompt Cache Restore #24746

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Impact

Root Cause

Automatic slot selection

Explicit slot selection

Proposed Fix

Why Not Put This in `get_slot_by_id()`?

Expected Result

First Bad Commit

Relevant log output

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Misc. bug: Explicit Slot Requests Bypass Prompt Cache Restore #24746

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

Summary

Impact

Root Cause

Automatic slot selection

Explicit slot selection

Proposed Fix

Why Not Put This in get_slot_by_id()?

Expected Result

First Bad Commit

Relevant log output

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why Not Put This in `get_slot_by_id()`?