Skip to content

Commit daad723

Browse files
authored
[feat]Add Layerwise Connector (#656)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ OUR OFFICIAL WEBSITE. --> # Purpose Introduce Layerwise Connector. The LayerwiseConnector is designed to overlap computation with load/dump operations, thereby speeding up the prefill phase. As soon as the attention computation for layer i finishes, its KV-cache is dumped immediately; we no longer wait until every layer is done. Before a forward pass begins, all KV-caches are loaded asynchronously, and layer i can start its attention computation as soon as its own KV-cache becomes available—no need to wait for the entire load to complete. > **:warning: Note** > **Only PipelineStore is available for layerwise, NfsStore doesn't support layerwise** # Modifications # Test ## Performance H20-QwQ32B-TP4-prefix cache-80%hit-TTFT | input | output | parallel| recalculation |PipelineStore with layerwise |speedup | NfsStore | speedup | |------|----------|--------|--------|---------------|--------|-------------|--------| | 4000 | 1000 | 1 | 551.99 | 169.26 | 226.12%| 177.69 | 210.65%| | 8000 | 1000 | 1 | 1102.31| 298.52 | 269.26%| 327.72 | 236.36%| | 16000 | 1000 | 1 | 2356.01| 610.73 | 285.77%| 688.89 | 242.00%| | 32000 | 1000 | 1 | 5341.1 | 1384.49 | 285.78%| 1544.76 | 245.76%| | 4000 | 1000 | 8 | 2642.04| 981.39 | 169.21%| 1038.27 | 154.47%| | 8000 | 1000 | 8 | 5031.3 | 1706.1 | 194.90%| 1858.99 | 170.65%| | 16000 | 1000 | 8 | 10840.92|3250.35 | 233.53%| 3544.2 | 205.88%| | 32000 | 1000 | 8 | 24709.55|6848.46 | 260.80%| 7958.29 | 210.49%| | 4000 | 1000 | 16 | 4791.96| 1628.33 | 194.29%| 1747.01 | 174.29%| | 8000 | 1000 | 16 | 9489.08| 3002.13 | 216.08%| 3269.28 | 190.25%| | 16000 | 1000 | 16 | 20556.38|5677.6 | 262.06%| 6342.14 | 224.12%| | 32000 | 1000 | 16 | 46992.56|12584.95 | 273.40%| 14296.63 | 228.70%| H20-QwQ32B-TP4-20%hit-TTFT | input | output | parallel| recalculation |PipelineStore with layerwise |speedup | NfsStore | speedup | |----------|----------|--------|----------|---------------|--------|-------------|--------| | 4000 | 1000 | 1 | 551.99 | 473.26 | 16.64% | 483.74 | 14.11% | | 8000 | 1000 | 1 | 1102.31 | 936.96 | 17.65% | 961.04 | 14.70% | | 16000 | 1000 | 1 | 2356.01 | 1986.57 | 18.60% | 2044.30 | 15.25% | | 32000 | 1000 | 1 | 5341.10 | 4615.83 | 15.71% | 4732.61 | 12.86% | | 4000 | 1000 | 8 | 2642.04 | 2352.42 | 12.31% | 2507.12 | 5.38% | | 8000 | 1000 | 8 | 5031.30 | 4481.31 | 12.27% | 4849.90 | 3.74% | | 16000 | 1000 | 8 | 10840.92 | 9342.19 | 16.04% | 10010.21 | 8.30% | | 32000 | 1000 | 8 | 24709.55 | 21135.45 | 16.91% | 22135.89 | 11.63% | | 4000 | 1000 | 16 | 4791.96 | 4173.80 | 14.81% | 4498.91 | 6.51% | | 8000 | 1000 | 16 | 9489.08 | 8264.28 | 14.82% | 9005.86 | 5.37% | | 16000 | 1000 | 16 | 20556.38 | 17299.15 | 18.83% | 18669.48 | 10.11% | | 32000 | 1000 | 16 | 46992.56 | 39894.73 | 17.79% | 41823.67 | 12.36% |
1 parent 6f90147 commit daad723

5 files changed

Lines changed: 236 additions & 44 deletions

File tree

docs/source/getting-started/quickstart_vllm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ For quick start, just follow the guide below to launch your own inference experi
133133

134134
### Feature 1: Prefix Caching
135135

136-
You may directly edit the example file at `unified-cache-management/examples/ucm_config_example.yaml`. For more please refer to [Prefix Cache with NFS Store](../user-guide/prefix-cache/nfs_store.md) document.
136+
You may directly edit the example file at `unified-cache-management/examples/ucm_config_example.yaml`. For more please refer to [Prefix Cache with NFS Store](../user-guide/prefix-cache/nfs_store.md) and [Prefix Cache with Pipeline Store](../user-guide/prefix-cache/pipeline_store.md) document.
137137

138138
⚠️ Make sure to replace `/mnt/test` with your actual storage directory.
139139

docs/source/getting-started/quickstart_vllm_ascend.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ For quick start, just follow the guide below to launch your own inference experi
107107

108108
### Feature 1: Prefix Caching
109109

110-
You may directly edit the example file at `unified-cache-management/examples/ucm_config_example.yaml`. For more please refer to [Prefix Cache with NFS Store](../user-guide/prefix-cache/nfs_store.md) document.
110+
You may directly edit the example file at `unified-cache-management/examples/ucm_config_example.yaml`. For more please refer to [Prefix Cache with NFS Store](../user-guide/prefix-cache/nfs_store.md) and [Prefix Cache with Pipeline Store](../user-guide/prefix-cache/pipeline_store.md) document.
111111

112112
⚠️ Make sure to replace `/mnt/test` with your actual storage directory.
113113

docs/source/user-guide/prefix-cache/pipeline_store.md

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,22 @@ load_only_first_rank: false
102102
Whether to enable direct I/O.
103103

104104
* **stream_number** *(optional, default: 8)*
105-
Number of concurrent streams used for data transfer.
105+
Number of threads used for data transfer between the Host and Storage.
106+
107+
* **buffer_number** *(optional, default: 16384)*
108+
The number of dram pinned buffers for data transfer between the Device and Host.
109+
In the vast majority of cases, the default value of 16384 is already sufficient.
110+
You can also check the vLLM startup logs, where you’ll see a line like
111+
```
112+
vllm cache_config_info with initialization after num_gpu_blocks is: xxx
113+
```
114+
As a rule of thumb, set `buffer_number` **>=** the reported `num_gpu_blocks` for better performance.
115+
If you are using the **Layerwise Connector**, you could set
116+
```
117+
buffer_number = num_gpu_blocks × num_layers
118+
```
119+
But as said before, the default value of 16384 is already enough in most cases.
120+
106121
107122
* **waiting_queue_depth** *(optional, default: 1024)*
108123
Depth of the waiting queue for transfer tasks.
@@ -113,9 +128,6 @@ load_only_first_rank: false
113128
* **timeout_ms** *(optional, default: 30000)*
114129
Timeout in milliseconds for external interfaces.
115130
116-
* **buffer_size** *(optional, default: 64GB)*
117-
Amount of dram pinned memory used by a single worker process.
118-
119131
### Must-be-Set Parameters
120132
121133
* **load_only_first_rank** (must be `false`):
@@ -146,6 +158,15 @@ vllm serve Qwen/Qwen2.5-14B-Instruct \
146158
"kv_connector_extra_config": {"UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"}
147159
}'
148160
```
161+
You can also use the Layerwise Connector by adding `"use_layerwise": true` to the `kv_connector_extra_config`.
162+
for example:
163+
164+
```bash
165+
"kv_connector_extra_config": {
166+
"use_layerwise": true,
167+
"UCM_CONFIG_FILE": "/home/qiuyuhao1/unified-cache-management/examples/ucm_config_example.yaml"
168+
}
169+
```
149170

150171
**⚠️ Make sure to replace `"/vllm-workspace/unified-cache-management/examples/ucm_config_example.yaml"` with your actual config file path.**
151172

0 commit comments

Comments
 (0)