Skip to content

[kvoffload] feat: make LMCache connecter work#1589

Merged
AlpinDale merged 1 commit into
mainfrom
lmcache-offloader
Nov 4, 2025
Merged

[kvoffload] feat: make LMCache connecter work#1589
AlpinDale merged 1 commit into
mainfrom
lmcache-offloader

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

@AlpinDale AlpinDale commented Nov 4, 2025

Enables LMCache offloader backend, and makes the CLI work properly.

Metrics

Model: Llama 3.1 8B
GPU: RTX 6000 ADA (48GB)
Average input length: 12566.1
P50 input length: 10047
Min Prompt Length: 8556
Max Prompt Length: 29651

Launch the benchmark:

python benchmarks/benchmark_prefix_caching.py \
  --model NousResearch/Meta-Llama-3.1-8B-Instruct \
  --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json \
  --enable-prefix-caching \
  --num-prompts 100 \
  --repeat-count 5 \
  --input-length-range 8192:32768

Download the dataset from here

Add these to enable KV cache offloading:

  • --kv-offloading-size: allocate CPU memory for KV cache offloading, takes a number in GBs (100 in this case)
  • --kv-offloading-backend: the offloader to use, supports native and lmcache (required pip install -r requirements/kv_connectors.txt)
  • --disable-hybrid-kv-cache-manager: not supported by the offloader connector
Backend Prefill Throughput (tok/s) Cost Time (s) Speedup
No KV Offloading 17041.03 115.05 1.0x
Native Offloader 31960.46 63.58 1.81x
LMCache Offloader 34955.26 58.03 1.98x

Signed-off-by: AlpinDale <alpindale@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the LMCache offloader backend and makes the CLI work properly. The changes involve vendoring several components from lmcache and integrating them, as well as updating configuration handling for KV cache offloading. My review identified several critical and high-severity issues, including potential AttributeError and ZeroDivisionError exceptions, incorrect package exports, and improper resource management in background threads. Addressing these issues is crucial for the stability and correctness of this new feature.

Comment on lines +323 to +324
for s in self.push_sockets:
s.close(linger=0) # type: ignore[arg-type]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The LMCacheAsyncLookupServer class does not have an attribute push_sockets. It has self.push_socket. This will raise an AttributeError during shutdown, preventing a clean exit. You should close self.push_socket directly without iterating.

Suggested change
for s in self.push_sockets:
s.close(linger=0) # type: ignore[arg-type]
self.push_socket.close(linger=0) # type: ignore[arg-type]

Comment on lines +51 to +54
kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * (
self.offloaded_block_size // self.gpu_block_size
)
num_blocks = self.swap_space_bytes // kv_bytes_per_offloaded_block
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calculation of num_blocks can lead to a ZeroDivisionError if kv_bytes_per_offloaded_block is zero. This can happen if aphrodite_config.cache_config.kv_bytes_per_block is zero (e.g., for attention-free models). You should add a check to prevent division by zero.

This logic is also duplicated in get_handlers (lines 73-76) and should be fixed there as well. Consider refactoring this calculation into a helper method or a property to avoid code duplication.

            kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * (
                self.offloaded_block_size // self.gpu_block_size
            )
            num_blocks = (
                self.swap_space_bytes // kv_bytes_per_offloaded_block
                if kv_bytes_per_offloaded_block > 0 else 0
            )

Comment on lines +822 to 824
self._stats_monitor.update_interval_vllm_hit_tokens(
request.load_spec.aphrodite_cached_tokens
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The method name update_interval_vllm_hit_tokens seems to be a leftover from vllm, which could be confusing in the aphrodite codebase. While this is inherited from the lmcache dependency, consider aliasing or wrapping this to use a more consistent naming convention (e.g., update_interval_aphrodite_hit_tokens) if possible, to improve code clarity and maintainability.

Comment on lines +247 to +249
def close(self):
self.socket.close(linger=0)
# TODO: close the thread!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The close method for LMCacheLookupServer does not properly shut down the background thread, as noted by the TODO. While the thread is a daemon, a graceful shutdown is better to ensure resources are released correctly and prevent potential issues. You should set self.running = False and join the thread.

Suggested change
def close(self):
self.socket.close(linger=0)
# TODO: close the thread!
def close(self):
self.running = False
self.socket.close(linger=0)
try:
self.thread.join(timeout=1.0)
except Exception as e:
logger.warning(f"Failed to join thread during close: {e}")

Comment on lines +1 to +14
# SPDX-License-Identifier: Apache-2.0
from ..lookup_client.abstract_client import LookupClientInterface
from ..lookup_client.factory import LookupClientFactory
from ..lookup_client.lmcache_lookup_client import (LMCacheLookupClient,
LMCacheLookupServer)
from ..lookup_client.mooncake_lookup_client import MooncakeLookupClient

__all__ = [
"LookupClientInterface",
"LookupClientFactory",
"MooncakeLookupClient",
"LMCacheLookupClient",
"LMCacheLookupServer",
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This __init__.py file for the offload_server package appears to be incorrectly exporting symbols from the lookup_client package. This is likely a copy-paste error and should be corrected to export symbols relevant to the offload server, such as OffloadServerInterface and ZMQOffloadServer.

# SPDX-License-Identifier: Apache-2.0
from .abstract_server import OffloadServerInterface
from .zmq_server import ZMQOffloadServer

__all__ = [
    "OffloadServerInterface",
    "ZMQOffloadServer",
]

Comment on lines +70 to +73
def close(self) -> None:
self.socket.close(linger=0)
self.running = False
self.thread.join()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The order of operations in close() could be improved for a more graceful shutdown. By setting self.running = False before closing the socket, you ensure that if the thread is between recv calls, it will exit the loop cleanly. Closing the socket first relies on an exception being raised in the blocked recv call to terminate the thread, which is less graceful.

Suggested change
def close(self) -> None:
self.socket.close(linger=0)
self.running = False
self.thread.join()
def close(self) -> None:
self.running = False
self.socket.close(linger=0)
self.thread.join()

Comment on lines +102 to +105
if isinstance(rpc_port, str):
rpc_port = rpc_port + str(tp_rank)
else:
rpc_port += tp_rank
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function get_zmq_rpc_path_lmcache has a type hint for rpc_port as int, but the implementation also handles str type, leading to inconsistent behavior. If rpc_port is a string (e.g., "100") and tp_rank is 1, the result is "1001". If rpc_port is an integer (100), the result is 101. This can cause subtle bugs. The handling should be consistent by converting rpc_port to an integer.

    if isinstance(rpc_port, str):
        rpc_port = int(rpc_port)
    rpc_port += tp_rank

@AlpinDale AlpinDale merged commit 52d12ec into main Nov 4, 2025
1 check passed
@AlpinDale AlpinDale deleted the lmcache-offloader branch November 4, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant