Skip to content

Commit 2cb480c

Browse files
committed
[Feat] Integrate Mooncake into UCM with UcmMooncakeStoreV1 (Python Version)
1 parent 3a11f16 commit 2cb480c

6 files changed

Lines changed: 1151 additions & 312 deletions

File tree

docs/source/user-guide/prefix-cache/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,4 +82,5 @@ performance.
8282
pipeline_store
8383
nfs_store
8484
ds3fs_store
85+
mooncakestore
8586
:::
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Mooncake Store
2+
3+
This document describes how to use `UcmMooncakeStoreV1` as the storage backend for UCM Prefix Cache in Ascend environments.
4+
5+
## Overview
6+
7+
`UcmMooncakeStoreV1` is a Mooncake-based storage backend provided by UCM for Prefix Cache scenarios. It is designed for Ascend platforms and integrated into the vLLM inference workflow through `UCMConnector`. It is responsible for prefix cache lookup, loading, and dumping, so that Prefix Cache is no longer limited to local memory within a single process or a single instance.
8+
9+
By integrating Mooncake, UCM extends its original local caching capability with both DRAM pooling and remote storage support. As a result, in Prefix Cache scenarios, a tiered cache hierarchy can be formed:
10+
11+
- Local DRAM on the serving node acts as the high-speed near-end cache.
12+
- The DRAM pool provided by Mooncake serves as a shareable intermediate cache layer.
13+
- Remote storage connected through UCM serves as a larger-capacity persistence layer.
14+
15+
This three-tier design provides a better balance among capacity, shareability, and access cost, allowing Prefix Cache to be reused across a broader scope and improving overall cache-hit benefits in long-prefix scenarios.
16+
17+
This document focuses on the capability boundaries, configuration, and basic usage flow of `UcmMooncakeStoreV1` in vLLM.
18+
19+
## Features
20+
21+
The current `UcmMooncakeStoreV1` implementation supports:
22+
23+
- `lookup` / `lookup_on_prefix`: probing prefix hits by block hash
24+
- `load_data`: loading KV blocks from Mooncake into model KV buffers
25+
- `dump_data`: dumping KV blocks from model KV buffers into Mooncake
26+
- `wait` / `check`: handling asynchronous task completion
27+
- Register NPU buffers for RDMA transfer
28+
29+
## Prerequisites
30+
31+
`UcmMooncakeStoreV1` is intended for Ascend-based deployments and requires:
32+
33+
- Linux
34+
- Ascend/NPU runtime with `torch.npu` available
35+
- vLLM + vLLM-Ascend + UCM integration environment
36+
- Mooncake runtime environment
37+
38+
For deployment, it is recommended to use the pre-built vLLM-Ascend Docker image directly. The `vllm-ascend 0.17.0 image` already includes the Mooncake runtime dependencies required by this guide.
39+
40+
If Mooncake needs to be installed manually, refer to the [official Ascend Store / KV Pool guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html) and follow its Mooncake installation instructions.
41+
42+
## Configuration for Prefix Caching
43+
44+
Edit or copy:
45+
46+
`unified-cache-management/examples/ucm_config_example.yaml`
47+
48+
### Minimal Configuration Example
49+
50+
```yaml
51+
ucm_connectors:
52+
- ucm_connector_name: "UcmMooncakeStoreV1"
53+
ucm_connector_config:
54+
protocol: "ascend"
55+
local_hostname: "127.0.0.1"
56+
metadata_server: "P2PHANDSHAKE"
57+
master_server_address: "127.0.0.1:50088"
58+
device_name: ""
59+
global_segment_size: "5GB"
60+
local_buffer_size: "5GB"
61+
executor_workers: 4
62+
```
63+
64+
### Required Parameters
65+
66+
- `ucm_connector_name`
67+
- Must be set to `UcmMooncakeStoreV1`.
68+
- `protocol`
69+
- Must be set to `ascend`.
70+
- `metadata_server`
71+
- Specifies the Mooncake metadata discovery mode or endpoint. In the common Ascend deployment path, use `P2PHANDSHAKE`.
72+
- `master_server_address`
73+
- Specifies the address of the Mooncake master service, for example `127.0.0.1:50088`.
74+
75+
### Common Optional Parameters
76+
77+
- `local_hostname` (default: `127.0.0.1`)
78+
- Local host address passed into Mooncake setup.
79+
- `device_name` (default: empty)
80+
- Optional device identifier passed to Mooncake.
81+
- `global_segment_size` (default: `5GB`)
82+
- Size of the global Mooncake segment. This represents the registered memory size per card.
83+
- `local_buffer_size` (default: `5GB`)
84+
- Size of the local buffer used by the connector.
85+
- `executor_workers` (default: `4`)
86+
- Number of worker threads used for asynchronous load and dump execution.
87+
88+
## Run Mooncake Master
89+
90+
Before launching vLLM, start the Mooncake master service:
91+
92+
```bash
93+
mooncake_master \
94+
--port 50088 \
95+
--eviction_high_watermark_ratio 0.9 \
96+
--eviction_ratio 0.1 \
97+
--default_kv_lease_ttl 11000
98+
```
99+
100+
Parameter description:
101+
102+
- `eviction_high_watermark_ratio`
103+
- Controls the watermark at which eviction is triggered.
104+
- `eviction_ratio`
105+
- Controls the fraction of objects to evict once eviction starts.
106+
- `default_kv_lease_ttl`
107+
- Controls the default KV lease TTL. It should be configured larger than both `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`.
108+
109+
## Launching Inference
110+
111+
Use `vllm serve` with `UCMConnector`, and pass the Mooncake-backed UCM configuration file through `UCM_CONFIG_FILE`.
112+
113+
### Recommended Launch Command
114+
115+
```bash
116+
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
117+
export PYTHONHASHSEED=0
118+
export HCCL_INTRA_ROCE_ENABLE=1
119+
export HCCL_RDMA_TIMEOUT=17
120+
export ASCEND_CONNECT_TIMEOUT=10000
121+
export ASCEND_TRANSFER_TIMEOUT=10000
122+
123+
vllm serve <your-model> \
124+
--host 0.0.0.0 \
125+
--port 8100 \
126+
--trust-remote-code \
127+
--enforce-eager \
128+
--no-enable-prefix-caching \
129+
--tensor-parallel-size 1 \
130+
--data-parallel-size 1 \
131+
--max-model-len 32768 \
132+
--block-size 128 \
133+
--max-num-batched-tokens 16384 \
134+
--kv-transfer-config \
135+
'{
136+
"kv_connector": "UCMConnector",
137+
"kv_role": "kv_both",
138+
"kv_connector_module_path": "ucm.integration.vllm.ucm_connector",
139+
"kv_connector_extra_config": {
140+
"UCM_CONFIG_FILE": "/path/to/unified-cache-management/examples/ucm_config_example.yaml"
141+
}
142+
}'
143+
```

ucm/integration/vllm/ucm_connector.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,12 @@ def _create_store(
335335
config["shard_size"] = kv_cache_layout.shard_size * self.blocks_per_chunk
336336
config["block_size"] = kv_cache_layout.block_size * self.blocks_per_chunk
337337
config["local_rank_size"] = self.tp_size if self.is_mla else 1
338+
register_buffer_ptrs, register_buffer_sizes = (
339+
self._build_register_buffer_regions()
340+
)
341+
if register_buffer_ptrs:
342+
config["register_buffer_ptrs"] = register_buffer_ptrs
343+
config["register_buffer_sizes"] = register_buffer_sizes
338344
if cpu_affinity_cores:
339345
config["cpu_affinity_cores"] = list(cpu_affinity_cores)
340346
else:
@@ -353,6 +359,28 @@ def _create_store(
353359
logger.info(f"create {name} with config: {config}")
354360
return UcmConnectorFactoryV1.create_connector(name, config, module_path)
355361

362+
def _build_register_buffer_regions(self) -> tuple[list[int], list[int]]:
363+
ptrs: list[int] = []
364+
sizes: list[int] = []
365+
for kv_layer in self.kv_caches.values():
366+
for tensor in self._iter_register_buffer_tensors(kv_layer):
367+
ptrs.append(int(tensor.data_ptr()))
368+
sizes.append(int(tensor.numel() * tensor.element_size()))
369+
return ptrs, sizes
370+
371+
def _iter_register_buffer_tensors(
372+
self, kv_layer: torch.Tensor | Tuple[torch.Tensor, ...]
373+
) -> list[torch.Tensor]:
374+
if isinstance(kv_layer, torch.Tensor):
375+
if kv_layer.dim() == 5:
376+
return [kv_layer[0], kv_layer[1]]
377+
if kv_layer.dim() == 3:
378+
return [kv_layer]
379+
raise ValueError(f"Unsupported kv cache tensor shape: {kv_layer.shape}")
380+
if isinstance(kv_layer, tuple):
381+
return list(kv_layer)
382+
raise TypeError(f"Unsupported kv cache type: {type(kv_layer)}")
383+
356384
def register_kv_caches(self, kv_caches: dict[str, torch.Tensor]):
357385
if has_ucm_sparse() and os.getenv("VLLM_HASH_ATTENTION") == "1":
358386
for layer_name, value in kv_caches.items():

ucm/store/factory_v1.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,3 +68,8 @@ def create_connector(
6868
UcmConnectorFactoryV1.register_connector(
6969
"UcmPipelineStore", "ucm.store.pipeline.connector", "UcmPipelineStore"
7070
)
71+
UcmConnectorFactoryV1.register_connector(
72+
"UcmMooncakeStoreV1",
73+
"ucm.store.mooncakestore.mooncake_connector",
74+
"UcmMooncakeStoreV1",
75+
)

0 commit comments

Comments
 (0)