Skip to content

Misc. bug: Explicit Slot Requests Bypass Prompt Cache Restore #24746

Description

@aarononeal

Name and Version

llama-server --version
version: 9670 (02810c7aa)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

Problem description & steps to reproduce

Summary

When a completion request specifies id_slot, the server bypasses get_available_slot() and directly calls get_slot_by_id(). As a result, the prompt cache restore path (prompt_save(), prompt_load(), prompt_cache->update()) is never executed for explicit-slot requests.

This causes prompt-cache-backed reuse to stop working for workloads that pin conversations to specific slots.

Impact

For long-running conversations, especially with hybrid recurrent models where partial seq_rm() may fail, this results in:

  • Missed prompt cache restores
  • Reduced cache reuse
  • Unnecessary checkpoint rescue activation
  • Full or partial prompt re-evaluation when a prompt-cache restore could have been used instead
  • Significant latency regressions on large contexts

In observed cases, an explicit-slot request with a prompt that should have matched an existing cached prompt instead fell through to the recurrent recovery path:

partial seq_rm at p0=21619 failed; restored context checkpoint ...

rather than restoring from the prompt cache.

Root Cause

There are currently two slot acquisition paths:

Automatic slot selection

get_available_slot(task)

This path performs prompt-cache maintenance:

ret->prompt_save(*prompt_cache);

if (!ret->prompt_load(*prompt_cache, task.tokens)) {
    ret->prompt_clear(false);
}

prompt_cache->update();

Explicit slot selection

get_slot_by_id(task.id_slot)

This path performs none of the above work.

As a result, the two code paths have different behavior even though both ultimately launch a completion task.

Proposed Fix

Factor prompt-cache preparation into a shared helper:

void update_slot_prompt_cache(
    server_slot * slot,
    const server_task & task,
    const char * label);

Introduce a common slot-selection path:

server_slot * get_slot_for_task(const server_task & task);

Behavior:

  • task.id_slot == -1
    • delegate to existing get_available_slot(task)
  • explicit id_slot
    • lookup slot
    • skip if currently processing
    • perform prompt-cache update/restore
    • return prepared slot

Then replace:

server_slot * slot =
    id_slot != -1
        ? get_slot_by_id(id_slot)
        : get_available_slot(task);

with:

server_slot * slot = get_slot_for_task(task);

Why Not Put This in get_slot_by_id()?

get_slot_by_id() is a generic lookup routine used by multiple code paths, including slot-management operations.

Adding prompt-cache side effects to a simple lookup function would make its behavior non-obvious and could introduce unexpected work in unrelated callers.

Keeping cache preparation in a task-oriented wrapper preserves separation of concerns:

  • get_slot_by_id() → lookup only
  • get_available_slot() → scheduling
  • get_slot_for_task() → prepare slot for completion execution

Expected Result

Explicit-slot requests receive the same prompt-cache restore behavior as automatically scheduled requests.

This restores symmetry between the two execution paths and avoids unnecessary fallback into checkpoint rescue or full prompt replay when a matching cached prompt already exists.

First Bad Commit

No response

Relevant log output

Logs
# Automatic slot path: prompt-cache restore runs
2026-06-18 01:18:35.896 I slot get_availabl: id  1 | task -1 | selected slot by LRU
2026-06-18 01:18:35.896 I srv  get_availabl: updating prompt cache
2026-06-18 01:18:35.896 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
2026-06-18 01:18:35.896 I srv          load:  - found better prompt with f_keep = 0.996, sim = 0.998
2026-06-18 01:18:35.945 I srv        update:  - cache state: 1 prompts, 5249.568 MiB
2026-06-18 01:18:35.946 I slot launch_slot_: id  1 | task 1956 | processing task

# Explicit slot path: request launches on slot 0 without the get_availabl / prompt-cache restore path
2026-06-18 01:45:57.098 I Server POST /api/v1/chat/completions - Streaming
2026-06-18 01:45:57.551 I srv  params_from_: Chat format: peg-native
2026-06-18 01:45:57.554 I slot launch_slot_: id  0 | task 1994 | processing task
2026-06-18 01:45:57.554 I slot process_sing: id  1 | task -1 | saving idle slot to prompt cache
2026-06-18 01:45:57.555 W srv   prompt_save:  - saving prompt with length 21750, total state size = 341.697 MiB
2026-06-18 01:45:57.658 I srv        update:  - cache state: 2 prompts, 34298.837 MiB
2026-06-18 01:45:57.658 I srv        update:    - prompt 0x55c1c07bf2a0: 175715 tokens, checkpoints: 113, 33851.889 MiB
2026-06-18 01:45:57.658 I srv        update:    - prompt 0x55c1b6fce470: 21750 tokens, checkpoints: 1, 446.948 MiB
2026-06-18 01:45:57.658 I slot prompt_clear: id  1 | task -1 | clearing prompt with 21750 tokens

# Consequence: active slot has pos_next = 0, so checkpoints are invalidated instead of rehydrated
2026-06-18 01:45:57.662 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 9284, pos_max = 9284, n_tokens = 9285, n_swa = 0, pos_next = 0, size = 81.125 MiB)
2026-06-18 01:45:57.665 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 22914, pos_max = 22914, n_tokens = 22915, n_swa = 0, pos_next = 0, size = 108.006 MiB)
2026-06-18 01:45:57.667 W slot update_slots: id  0 | task 1994 | erased invalidated context checkpoint (pos_min = 35290, pos_max = 35290, n_tokens = 35291, n_swa = 0, pos_next = 0, size = 132.414 MiB)

Fix

--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -1386,6 +1386,26 @@
         return nullptr;
     }
 
+    void update_slot_prompt_cache(server_slot * slot, const server_task & task, const char * label) {
+        if (!slot || !prompt_cache || task.type != SERVER_TASK_TYPE_COMPLETION) {
+            return;
+        }
+
+        SRV_INF("updating prompt cache%s\n", label);
+
+        const int64_t t_start = ggml_time_us();
+
+        slot->prompt_save(*prompt_cache);
+
+        if (!slot->prompt_load(*prompt_cache, task.tokens)) {
+            slot->prompt_clear(false);
+        }
+
+        prompt_cache->update();
+
+        SRV_INF("prompt cache update%s took %.2f ms\n", label, (ggml_time_us() - t_start) / 1000.0);
+    }
+
     server_slot * get_available_slot(const server_task & task) {
         server_slot * ret = nullptr;
 
@@ -1463,23 +1483,27 @@
             update_cache = update_cache && task.type == SERVER_TASK_TYPE_COMPLETION;
 
             if (update_cache) {
-                SRV_INF("%s", "updating prompt cache\n");
-
-                const int64_t t_start = ggml_time_us();
+                update_slot_prompt_cache(ret, task, "");
+            }
+        }
 
-                ret->prompt_save(*prompt_cache);
+        return ret;
+    }
 
-                if (!ret->prompt_load(*prompt_cache, task.tokens)) {
-                    ret->prompt_clear(false);
-                }
+    server_slot * get_slot_for_task(const server_task & task) {
+        if (task.id_slot == -1) {
+            return get_available_slot(task);
+        }
 
-                prompt_cache->update();
+        server_slot * slot = get_slot_by_id(task.id_slot);
 
-                SRV_INF("prompt cache update took %.2f ms\n", (ggml_time_us() - t_start) / 1000.0);
-            }
+        if (!slot || slot->is_processing()) {
+            return slot;
         }
 
-        return ret;
+        update_slot_prompt_cache(slot, task, " for requested slot");
+
+        return slot;
     }
 
     // return true if at least one slot has been cleared
@@ -2179,7 +2203,6 @@
-                    const int id_slot = task.id_slot;
                     const int id_task = task.id;
 
-                    server_slot * slot = id_slot != -1 ? get_slot_by_id(id_slot) : get_available_slot(task);
+                    server_slot * slot = get_slot_for_task(task);
 
                     //
                     // slot scheduling logic

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions