[kvoffload] feat: make LMCache connecter work#1589
Conversation
Signed-off-by: AlpinDale <alpindale@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request enables the LMCache offloader backend and makes the CLI work properly. The changes involve vendoring several components from lmcache and integrating them, as well as updating configuration handling for KV cache offloading. My review identified several critical and high-severity issues, including potential AttributeError and ZeroDivisionError exceptions, incorrect package exports, and improper resource management in background threads. Addressing these issues is crucial for the stability and correctness of this new feature.
| for s in self.push_sockets: | ||
| s.close(linger=0) # type: ignore[arg-type] |
There was a problem hiding this comment.
The LMCacheAsyncLookupServer class does not have an attribute push_sockets. It has self.push_socket. This will raise an AttributeError during shutdown, preventing a clean exit. You should close self.push_socket directly without iterating.
| for s in self.push_sockets: | |
| s.close(linger=0) # type: ignore[arg-type] | |
| self.push_socket.close(linger=0) # type: ignore[arg-type] | |
| kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * ( | ||
| self.offloaded_block_size // self.gpu_block_size | ||
| ) | ||
| num_blocks = self.swap_space_bytes // kv_bytes_per_offloaded_block |
There was a problem hiding this comment.
The calculation of num_blocks can lead to a ZeroDivisionError if kv_bytes_per_offloaded_block is zero. This can happen if aphrodite_config.cache_config.kv_bytes_per_block is zero (e.g., for attention-free models). You should add a check to prevent division by zero.
This logic is also duplicated in get_handlers (lines 73-76) and should be fixed there as well. Consider refactoring this calculation into a helper method or a property to avoid code duplication.
kv_bytes_per_offloaded_block = self.aphrodite_config.cache_config.kv_bytes_per_block * (
self.offloaded_block_size // self.gpu_block_size
)
num_blocks = (
self.swap_space_bytes // kv_bytes_per_offloaded_block
if kv_bytes_per_offloaded_block > 0 else 0
)| self._stats_monitor.update_interval_vllm_hit_tokens( | ||
| request.load_spec.aphrodite_cached_tokens | ||
| ) |
There was a problem hiding this comment.
The method name update_interval_vllm_hit_tokens seems to be a leftover from vllm, which could be confusing in the aphrodite codebase. While this is inherited from the lmcache dependency, consider aliasing or wrapping this to use a more consistent naming convention (e.g., update_interval_aphrodite_hit_tokens) if possible, to improve code clarity and maintainability.
| def close(self): | ||
| self.socket.close(linger=0) | ||
| # TODO: close the thread! |
There was a problem hiding this comment.
The close method for LMCacheLookupServer does not properly shut down the background thread, as noted by the TODO. While the thread is a daemon, a graceful shutdown is better to ensure resources are released correctly and prevent potential issues. You should set self.running = False and join the thread.
| def close(self): | |
| self.socket.close(linger=0) | |
| # TODO: close the thread! | |
| def close(self): | |
| self.running = False | |
| self.socket.close(linger=0) | |
| try: | |
| self.thread.join(timeout=1.0) | |
| except Exception as e: | |
| logger.warning(f"Failed to join thread during close: {e}") |
| # SPDX-License-Identifier: Apache-2.0 | ||
| from ..lookup_client.abstract_client import LookupClientInterface | ||
| from ..lookup_client.factory import LookupClientFactory | ||
| from ..lookup_client.lmcache_lookup_client import (LMCacheLookupClient, | ||
| LMCacheLookupServer) | ||
| from ..lookup_client.mooncake_lookup_client import MooncakeLookupClient | ||
|
|
||
| __all__ = [ | ||
| "LookupClientInterface", | ||
| "LookupClientFactory", | ||
| "MooncakeLookupClient", | ||
| "LMCacheLookupClient", | ||
| "LMCacheLookupServer", | ||
| ] |
There was a problem hiding this comment.
This __init__.py file for the offload_server package appears to be incorrectly exporting symbols from the lookup_client package. This is likely a copy-paste error and should be corrected to export symbols relevant to the offload server, such as OffloadServerInterface and ZMQOffloadServer.
# SPDX-License-Identifier: Apache-2.0
from .abstract_server import OffloadServerInterface
from .zmq_server import ZMQOffloadServer
__all__ = [
"OffloadServerInterface",
"ZMQOffloadServer",
]| def close(self) -> None: | ||
| self.socket.close(linger=0) | ||
| self.running = False | ||
| self.thread.join() |
There was a problem hiding this comment.
The order of operations in close() could be improved for a more graceful shutdown. By setting self.running = False before closing the socket, you ensure that if the thread is between recv calls, it will exit the loop cleanly. Closing the socket first relies on an exception being raised in the blocked recv call to terminate the thread, which is less graceful.
| def close(self) -> None: | |
| self.socket.close(linger=0) | |
| self.running = False | |
| self.thread.join() | |
| def close(self) -> None: | |
| self.running = False | |
| self.socket.close(linger=0) | |
| self.thread.join() |
| if isinstance(rpc_port, str): | ||
| rpc_port = rpc_port + str(tp_rank) | ||
| else: | ||
| rpc_port += tp_rank |
There was a problem hiding this comment.
The function get_zmq_rpc_path_lmcache has a type hint for rpc_port as int, but the implementation also handles str type, leading to inconsistent behavior. If rpc_port is a string (e.g., "100") and tp_rank is 1, the result is "1001". If rpc_port is an integer (100), the result is 101. This can cause subtle bugs. The handling should be consistent by converting rpc_port to an integer.
if isinstance(rpc_port, str):
rpc_port = int(rpc_port)
rpc_port += tp_rank
Enables LMCache offloader backend, and makes the CLI work properly.
Metrics
Model: Llama 3.1 8B
GPU: RTX 6000 ADA (48GB)
Average input length: 12566.1
P50 input length: 10047
Min Prompt Length: 8556
Max Prompt Length: 29651
Launch the benchmark:
Download the dataset from here
Add these to enable KV cache offloading:
--kv-offloading-size: allocate CPU memory for KV cache offloading, takes a number in GBs (100in this case)--kv-offloading-backend: the offloader to use, supportsnativeandlmcache(requiredpip install -r requirements/kv_connectors.txt)--disable-hybrid-kv-cache-manager: not supported by the offloader connector