|
| 1 | +# Mooncake Store |
| 2 | + |
| 3 | +This document describes how to use `UcmMooncakeStoreV1` as the storage backend for UCM Prefix Cache in Ascend environments. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +`UcmMooncakeStoreV1` is a Mooncake-based storage backend provided by UCM for Prefix Cache scenarios. It is designed for Ascend platforms and integrated into the vLLM inference workflow through `UCMConnector`. It is responsible for prefix cache lookup, loading, and dumping, so that Prefix Cache is no longer limited to local memory within a single process or a single instance. |
| 8 | + |
| 9 | +By integrating Mooncake, UCM extends its original local caching capability with both DRAM pooling and remote storage support. As a result, in Prefix Cache scenarios, a tiered cache hierarchy can be formed: |
| 10 | + |
| 11 | +- Local DRAM on the serving node acts as the high-speed near-end cache. |
| 12 | +- The DRAM pool provided by Mooncake serves as a shareable intermediate cache layer. |
| 13 | +- Remote storage connected through UCM serves as a larger-capacity persistence layer. |
| 14 | + |
| 15 | +This three-tier design provides a better balance among capacity, shareability, and access cost, allowing Prefix Cache to be reused across a broader scope and improving overall cache-hit benefits in long-prefix scenarios. |
| 16 | + |
| 17 | +This document focuses on the capability boundaries, configuration, and basic usage flow of `UcmMooncakeStoreV1` in vLLM. |
| 18 | + |
| 19 | +## Features |
| 20 | + |
| 21 | +The current `UcmMooncakeStoreV1` implementation supports: |
| 22 | + |
| 23 | +- `lookup` / `lookup_on_prefix`: probing prefix hits by block hash |
| 24 | +- `load_data`: loading KV blocks from Mooncake into model KV buffers |
| 25 | +- `dump_data`: dumping KV blocks from model KV buffers into Mooncake |
| 26 | +- `wait` / `check`: handling asynchronous task completion |
| 27 | +- Register NPU buffers for RDMA transfer |
| 28 | + |
| 29 | +## Prerequisites |
| 30 | + |
| 31 | +`UcmMooncakeStoreV1` is intended for Ascend-based deployments and requires: |
| 32 | + |
| 33 | +- Linux |
| 34 | +- Ascend/NPU runtime with `torch.npu` available |
| 35 | +- vLLM + vLLM-Ascend + UCM integration environment |
| 36 | +- Mooncake runtime environment |
| 37 | + |
| 38 | +For deployment, it is recommended to use the pre-built vLLM-Ascend Docker image directly. The `vllm-ascend 0.17.0 image` already includes the Mooncake runtime dependencies required by this guide. |
| 39 | + |
| 40 | +If Mooncake needs to be installed manually, refer to the [official Ascend Store / KV Pool guide](https://docs.vllm.ai/projects/ascend/en/latest/user_guide/feature_guide/kv_pool.html) and follow its Mooncake installation instructions. |
| 41 | + |
| 42 | +## Configuration for Prefix Caching |
| 43 | + |
| 44 | +Edit or copy: |
| 45 | + |
| 46 | +`unified-cache-management/examples/ucm_config_example.yaml` |
| 47 | + |
| 48 | +### Minimal Configuration Example |
| 49 | + |
| 50 | +```yaml |
| 51 | +ucm_connectors: |
| 52 | + - ucm_connector_name: "UcmMooncakeStoreV1" |
| 53 | + ucm_connector_config: |
| 54 | + protocol: "ascend" |
| 55 | + local_hostname: "127.0.0.1" |
| 56 | + metadata_server: "P2PHANDSHAKE" |
| 57 | + master_server_address: "127.0.0.1:50088" |
| 58 | + device_name: "" |
| 59 | + global_segment_size: "5GB" |
| 60 | + local_buffer_size: "5GB" |
| 61 | + executor_workers: 4 |
| 62 | +``` |
| 63 | +
|
| 64 | +### Required Parameters |
| 65 | +
|
| 66 | +- `ucm_connector_name` |
| 67 | + - Must be set to `UcmMooncakeStoreV1`. |
| 68 | +- `protocol` |
| 69 | + - Must be set to `ascend`. |
| 70 | +- `metadata_server` |
| 71 | + - Specifies the Mooncake metadata discovery mode or endpoint. In the common Ascend deployment path, use `P2PHANDSHAKE`. |
| 72 | +- `master_server_address` |
| 73 | + - Specifies the address of the Mooncake master service, for example `127.0.0.1:50088`. |
| 74 | + |
| 75 | +### Common Optional Parameters |
| 76 | + |
| 77 | +- `local_hostname` (default: `127.0.0.1`) |
| 78 | + - Local host address passed into Mooncake setup. |
| 79 | +- `device_name` (default: empty) |
| 80 | + - Optional device identifier passed to Mooncake. |
| 81 | +- `global_segment_size` (default: `5GB`) |
| 82 | + - Size of the global Mooncake segment. This represents the registered memory size per card. |
| 83 | +- `local_buffer_size` (default: `5GB`) |
| 84 | + - Size of the local buffer used by the connector. |
| 85 | +- `executor_workers` (default: `4`) |
| 86 | + - Number of worker threads used for asynchronous load and dump execution. |
| 87 | + |
| 88 | +## Run Mooncake Master |
| 89 | + |
| 90 | +Before launching vLLM, start the Mooncake master service: |
| 91 | + |
| 92 | +```bash |
| 93 | +mooncake_master \ |
| 94 | + --port 50088 \ |
| 95 | + --eviction_high_watermark_ratio 0.9 \ |
| 96 | + --eviction_ratio 0.1 \ |
| 97 | + --default_kv_lease_ttl 11000 |
| 98 | +``` |
| 99 | + |
| 100 | +Parameter description: |
| 101 | + |
| 102 | +- `eviction_high_watermark_ratio` |
| 103 | + - Controls the watermark at which eviction is triggered. |
| 104 | +- `eviction_ratio` |
| 105 | + - Controls the fraction of objects to evict once eviction starts. |
| 106 | +- `default_kv_lease_ttl` |
| 107 | + - Controls the default KV lease TTL. It should be configured larger than both `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`. |
| 108 | + |
| 109 | +## Launching Inference |
| 110 | + |
| 111 | +Use `vllm serve` with `UCMConnector`, and pass the Mooncake-backed UCM configuration file through `UCM_CONFIG_FILE`. |
| 112 | + |
| 113 | +### Recommended Launch Command |
| 114 | + |
| 115 | +```bash |
| 116 | +export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH |
| 117 | +export PYTHONHASHSEED=0 |
| 118 | +export HCCL_INTRA_ROCE_ENABLE=1 |
| 119 | +export HCCL_RDMA_TIMEOUT=17 |
| 120 | +export ASCEND_CONNECT_TIMEOUT=10000 |
| 121 | +export ASCEND_TRANSFER_TIMEOUT=10000 |
| 122 | +
|
| 123 | +vllm serve <your-model> \ |
| 124 | + --host 0.0.0.0 \ |
| 125 | + --port 8100 \ |
| 126 | + --trust-remote-code \ |
| 127 | + --enforce-eager \ |
| 128 | + --no-enable-prefix-caching \ |
| 129 | + --tensor-parallel-size 1 \ |
| 130 | + --data-parallel-size 1 \ |
| 131 | + --max-model-len 32768 \ |
| 132 | + --block-size 128 \ |
| 133 | + --max-num-batched-tokens 16384 \ |
| 134 | + --kv-transfer-config \ |
| 135 | + '{ |
| 136 | + "kv_connector": "UCMConnector", |
| 137 | + "kv_role": "kv_both", |
| 138 | + "kv_connector_module_path": "ucm.integration.vllm.ucm_connector", |
| 139 | + "kv_connector_extra_config": { |
| 140 | + "UCM_CONFIG_FILE": "/path/to/unified-cache-management/examples/ucm_config_example.yaml" |
| 141 | + } |
| 142 | + }' |
| 143 | +``` |
0 commit comments