[kvoffload] feat: make LMCache connecter work by AlpinDale · Pull Request #1589 · dphnAI/aphrodite-engine

AlpinDale · 2025-11-04T10:04:01Z

Enables LMCache offloader backend, and makes the CLI work properly.

Metrics

Model: Llama 3.1 8B
GPU: RTX 6000 ADA (48GB)
Average input length: 12566.1
P50 input length: 10047
Min Prompt Length: 8556
Max Prompt Length: 29651

Launch the benchmark:

python benchmarks/benchmark_prefix_caching.py \
  --model NousResearch/Meta-Llama-3.1-8B-Instruct \
  --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json \
  --enable-prefix-caching \
  --num-prompts 100 \
  --repeat-count 5 \
  --input-length-range 8192:32768

Download the dataset from here

Add these to enable KV cache offloading:

--kv-offloading-size: allocate CPU memory for KV cache offloading, takes a number in GBs (100 in this case)
--kv-offloading-backend: the offloader to use, supports native and lmcache (required pip install -r requirements/kv_connectors.txt)
--disable-hybrid-kv-cache-manager: not supported by the offloader connector

Backend	Prefill Throughput (tok/s)	Cost Time (s)	Speedup
No KV Offloading	17041.03	115.05	1.0x
Native Offloader	31960.46	63.58	1.81x
LMCache Offloader	34955.26	58.03	1.98x

Signed-off-by: AlpinDale <alpindale@gmail.com>

gemini-code-assist

Code Review

This pull request enables the LMCache offloader backend and makes the CLI work properly. The changes involve vendoring several components from lmcache and integrating them, as well as updating configuration handling for KV cache offloading. My review identified several critical and high-severity issues, including potential AttributeError and ZeroDivisionError exceptions, incorrect package exports, and improper resource management in background threads. Addressing these issues is crucial for the stability and correctness of this new feature.

gemini-code-assist · 2025-11-04T10:09:45Z

+            for s in self.push_sockets:
+                s.close(linger=0)  # type: ignore[arg-type]


The LMCacheAsyncLookupServer class does not have an attribute push_sockets. It has self.push_socket. This will raise an AttributeError during shutdown, preventing a clean exit. You should close self.push_socket directly without iterating.

Suggested change

for s in self.push_sockets:

s.close(linger=0) # type: ignore[arg-type]

self.push_socket.close(linger=0) # type: ignore[arg-type]

gemini-code-assist · 2025-11-04T10:09:45Z

+            kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * (
+                self.offloaded_block_size // self.gpu_block_size
+            )
+            num_blocks = self.swap_space_bytes // kv_bytes_per_offloaded_block


The calculation of num_blocks can lead to a ZeroDivisionError if kv_bytes_per_offloaded_block is zero. This can happen if aphrodite_config.cache_config.kv_bytes_per_block is zero (e.g., for attention-free models). You should add a check to prevent division by zero.

This logic is also duplicated in get_handlers (lines 73-76) and should be fixed there as well. Consider refactoring this calculation into a helper method or a property to avoid code duplication.

kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * ( self.offloaded_block_size // self.gpu_block_size ) num_blocks = ( self.swap_space_bytes // kv_bytes_per_offloaded_block if kv_bytes_per_offloaded_block > 0 else 0 )

gemini-code-assist · 2025-11-04T10:09:45Z

+            self._stats_monitor.update_interval_vllm_hit_tokens(
                request.load_spec.aphrodite_cached_tokens
            )


The method name update_interval_vllm_hit_tokens seems to be a leftover from vllm, which could be confusing in the aphrodite codebase. While this is inherited from the lmcache dependency, consider aliasing or wrapping this to use a more consistent naming convention (e.g., update_interval_aphrodite_hit_tokens) if possible, to improve code clarity and maintainability.

gemini-code-assist · 2025-11-04T10:09:45Z

+    def close(self):
+        self.socket.close(linger=0)
+        # TODO: close the thread!


The close method for LMCacheLookupServer does not properly shut down the background thread, as noted by the TODO. While the thread is a daemon, a graceful shutdown is better to ensure resources are released correctly and prevent potential issues. You should set self.running = False and join the thread.

Suggested change

def close(self):

self.socket.close(linger=0)

# TODO: close the thread!

def close(self):

self.running = False

self.socket.close(linger=0)

try:

self.thread.join(timeout=1.0)

except Exception as e:

logger.warning(f"Failed to join thread during close: {e}")

gemini-code-assist · 2025-11-04T10:09:45Z

+# SPDX-License-Identifier: Apache-2.0
+from ..lookup_client.abstract_client import LookupClientInterface
+from ..lookup_client.factory import LookupClientFactory
+from ..lookup_client.lmcache_lookup_client import (LMCacheLookupClient,
+                                                   LMCacheLookupServer)
+from ..lookup_client.mooncake_lookup_client import MooncakeLookupClient
+
+__all__ = [
+    "LookupClientInterface",
+    "LookupClientFactory",
+    "MooncakeLookupClient",
+    "LMCacheLookupClient",
+    "LMCacheLookupServer",
+]


This __init__.py file for the offload_server package appears to be incorrectly exporting symbols from the lookup_client package. This is likely a copy-paste error and should be corrected to export symbols relevant to the offload server, such as OffloadServerInterface and ZMQOffloadServer.

# SPDX-License-Identifier: Apache-2.0 from .abstract_server import OffloadServerInterface from .zmq_server import ZMQOffloadServer __all__ = [ "OffloadServerInterface", "ZMQOffloadServer", ]

gemini-code-assist · 2025-11-04T10:09:46Z

+    def close(self) -> None:
+        self.socket.close(linger=0)
+        self.running = False
+        self.thread.join()


The order of operations in close() could be improved for a more graceful shutdown. By setting self.running = False before closing the socket, you ensure that if the thread is between recv calls, it will exit the loop cleanly. Closing the socket first relies on an exception being raised in the blocked recv call to terminate the thread, which is less graceful.

Suggested change

def close(self) -> None:

self.socket.close(linger=0)

self.running = False

self.thread.join()

def close(self) -> None:

self.running = False

self.socket.close(linger=0)

self.thread.join()

gemini-code-assist · 2025-11-04T10:09:46Z

+    if isinstance(rpc_port, str):
+        rpc_port = rpc_port + str(tp_rank)
+    else:
+        rpc_port += tp_rank


The function get_zmq_rpc_path_lmcache has a type hint for rpc_port as int, but the implementation also handles str type, leading to inconsistent behavior. If rpc_port is a string (e.g., "100") and tp_rank is 1, the result is "1001". If rpc_port is an integer (100), the result is 101. This can cause subtle bugs. The handling should be consistent by converting rpc_port to an integer.

if isinstance(rpc_port, str): rpc_port = int(rpc_port) rpc_port += tp_rank

[kvoffload] feat: make LMCache connecter work

2ccf543

Signed-off-by: AlpinDale <alpindale@gmail.com>

gemini-code-assist Bot reviewed Nov 4, 2025

View reviewed changes

AlpinDale merged commit 52d12ec into main Nov 4, 2025
1 check passed

AlpinDale deleted the lmcache-offloader branch November 4, 2025 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[kvoffload] feat: make LMCache connecter work#1589

[kvoffload] feat: make LMCache connecter work#1589
AlpinDale merged 1 commit into
mainfrom
lmcache-offloader

AlpinDale commented Nov 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for s in self.push_sockets:
		s.close(linger=0) # type: ignore[arg-type]

Uh oh!

Uh oh!

Conversation

AlpinDale commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Metrics

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Nov 4, 2025 •

edited

Loading