|
| 1 | +[简体中文](../zh/best_practices/Disaggregated.md) |
| 2 | + |
| 3 | +# PD Disaggregated Deployment Best Practices |
| 4 | + |
| 5 | +This document provides a comprehensive guide to FastDeploy's PD (Prefill-Decode) disaggregated deployment solution, covering both single-machine and cross-machine deployment modes with support for Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP). |
| 6 | + |
| 7 | +## 1. Deployment Overview and Environment Preparation |
| 8 | + |
| 9 | +This guide demonstrates deployment practices using the ERNIE-4.5-300B-A47B-Paddle model on H100 80GB GPUs. Below are the minimum GPU requirements for different deployment configurations: |
| 10 | + |
| 11 | +**Single-Machine Deployment (8 GPUs, Single Node)** |
| 12 | + |
| 13 | +| Configuration | TP | DP | EP | GPUs Required | |
| 14 | +|---------|----|----|----|---------| |
| 15 | +| P:TP4DP1<br>D:TP4DP1 | 4 | 1 | - | 8 | |
| 16 | +| P:TP1DP4EP4 <br> D:TP1DP4EP4| 1 | 4 | ✓ | 8 | |
| 17 | + |
| 18 | +**Multi-Machine Deployment (16 GPUs, Cross-Node)** |
| 19 | + |
| 20 | +| Configuration | TP | DP | EP | GPUs Required | |
| 21 | +|---------|----|----|----|---------| |
| 22 | +| P:TP8DP1<br>D:TP8DP1 | 8 | 1 | - | 16 | |
| 23 | +| P:TP4DP2<br>D:TP4DP2 | 4 | 2 | - | 16 | |
| 24 | +| P:TP1DP8EP8<br>D:TP1DP8EP8 | 1 | 8 | ✓ | 16 | |
| 25 | + |
| 26 | +**Important Notes**: |
| 27 | +1. **Quantization**: All configurations above use WINT4 quantization, specified via `--quantization wint4` |
| 28 | +2. **EP Limitations**: When Expert Parallelism (EP) is enabled, only TP=1 is currently supported; multi-TP scenarios are not yet available |
| 29 | +3. **Cross-Machine Network**: Cross-machine deployment requires RDMA network support for high-speed KV Cache transmission |
| 30 | +4. **GPU Calculation**: Total GPUs = TP × DP × 2, with identical configurations for both Prefill and Decode instances |
| 31 | +5. **CUDA Graph Capture**: Decode instances enable CUDA Graph capture by default for inference acceleration, while Prefill instances do not |
| 32 | + |
| 33 | +### 1.1 Installing FastDeploy |
| 34 | + |
| 35 | +Please refer to the [FastDeploy Installation Guide](https://paddlepaddle.github.io/FastDeploy/zh/install/) to set up your environment. |
| 36 | + |
| 37 | +For model downloads, please check the [Supported Models List](https://paddlepaddle.github.io/FastDeploy/zh/model_summary/). |
| 38 | + |
| 39 | +### 1.2 Deployment Topology |
| 40 | + |
| 41 | +**Single-Machine Deployment Topology** |
| 42 | + |
| 43 | +``` |
| 44 | +┌──────────────────────────────┐ |
| 45 | +│ Single Machine 8×H100 80GB │ |
| 46 | +│ ┌──────────────┐ │ |
| 47 | +│ │ Router │ │ |
| 48 | +│ │ 0.0.0.0:8109│ │ |
| 49 | +│ └──────────────┘ │ |
| 50 | +│ │ │ |
| 51 | +│ ┌────┴────┐ │ |
| 52 | +│ ▼ ▼ │ |
| 53 | +│ ┌─────────┐ ┌─────────┐ │ |
| 54 | +│ │Prefill │ │Decode │ │ |
| 55 | +│ │GPU 0-3 │ │GPU 4-7 │ │ |
| 56 | +│ └─────────┘ └─────────┘ │ |
| 57 | +└──────────────────────────────┘ |
| 58 | +``` |
| 59 | + |
| 60 | +**Cross-Machine Deployment Topology** |
| 61 | + |
| 62 | +``` |
| 63 | +┌─────────────────────┐ ┌─────────────────────┐ |
| 64 | +│ Prefill Machine │ RDMA Network │ Decode Machine │ |
| 65 | +│ 8×H100 80GB │◄────────────────────►│ 8×H100 80GB │ |
| 66 | +│ │ │ │ |
| 67 | +│ ┌──────────────┐ │ │ │ |
| 68 | +│ │ Router │ │ │ │ |
| 69 | +│ │ 0.0.0.0:8109 │───┼──────────────────────┼────────── │ |
| 70 | +│ └──────────────┘ │ │ │ │ |
| 71 | +│ │ │ │ │ │ |
| 72 | +│ ▼ │ │ ▼ │ |
| 73 | +│ ┌──────────────┐ │ │ ┌──────────────┐ │ |
| 74 | +│ │Prefill Nodes │ │ │ │Decode Nodes │ │ |
| 75 | +│ │GPU 0-7 │ │ │ │GPU 0-7 │ │ |
| 76 | +│ └──────────────┘ │ │ └──────────────┘ │ |
| 77 | +└─────────────────────┘ └─────────────────────┘ |
| 78 | +``` |
| 79 | + |
| 80 | +--- |
| 81 | +## 2. Single-Machine PD Disaggregated Deployment |
| 82 | + |
| 83 | +### 2.1 Test Scenarios and Parallelism Configuration |
| 84 | + |
| 85 | +This chapter demonstrates the **TP4DP1|D:TP4DP1** configuration test scenario: |
| 86 | +- **Tensor Parallelism (TP)**: 4 — Each 4 GPUs independently load complete model parameters |
| 87 | +- **Data Parallelism (DP)**: 1 — Each GPU forms a data parallelism group |
| 88 | +- **Expert Parallelism (EP)**: Not enabled |
| 89 | + |
| 90 | +**To test other parallelism configurations, adjust parameters as follows:** |
| 91 | +1. **TP Adjustment**: Modify `--tensor-parallel-size` |
| 92 | +2. **DP Adjustment**: Modify `--data-parallel-size`, ensuring `--ports` and `--num-servers` remain consistent with DP |
| 93 | +3. **EP Toggle**: Add or remove `--enable-expert-parallel` |
| 94 | +4. **GPU Allocation**: Control GPUs used by Prefill and Decode instances via `CUDA_VISIBLE_DEVICES` |
| 95 | + |
| 96 | +### 2.2 Startup Scripts |
| 97 | + |
| 98 | +#### Start Router |
| 99 | + |
| 100 | +```bash |
| 101 | +python -m fastdeploy.router.launch \ |
| 102 | + --port 8109 \ |
| 103 | + --splitwise |
| 104 | +``` |
| 105 | + |
| 106 | +Note: This uses the Python version of the router. If needed, you can also use the high-performance [Golang version router](../online_serving/router.md). |
| 107 | + |
| 108 | +#### Start Prefill Nodes |
| 109 | + |
| 110 | +```bash |
| 111 | +export CUDA_VISIBLE_DEVICES=0,1,2,3 |
| 112 | + |
| 113 | +python -m fastdeploy.entrypoints.openai.api_server \ |
| 114 | + --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ |
| 115 | + --port 8188 \ |
| 116 | + --splitwise-role "prefill" \ |
| 117 | + --cache-transfer-protocol "rdma,ipc" \ |
| 118 | + --router "0.0.0.0:8109" \ |
| 119 | + --quantization wint4 \ |
| 120 | + --tensor-parallel-size 4 \ |
| 121 | + --data-parallel-size 1 \ |
| 122 | + --max-model-len 8192 \ |
| 123 | + --max-num-seqs 64 |
| 124 | +``` |
| 125 | + |
| 126 | +#### Start Decode Nodes |
| 127 | + |
| 128 | +```bash |
| 129 | +export CUDA_VISIBLE_DEVICES=4,5,6,7 |
| 130 | + |
| 131 | +python -m fastdeploy.entrypoints.openai.multi_api_server \ |
| 132 | + --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ |
| 133 | + --ports 8200,8201 \ |
| 134 | + --splitwise-role "decode" \ |
| 135 | + --cache-transfer-protocol "rdma,ipc" \ |
| 136 | + --router "0.0.0.0:8109" \ |
| 137 | + --quantization wint4 \ |
| 138 | + --tensor-parallel-size 2 \ |
| 139 | + --data-parallel-size 2 \ |
| 140 | + --max-model-len 8192 \ |
| 141 | + --max-num-seqs 64 |
| 142 | +``` |
| 143 | + |
| 144 | +### 2.3 Key Parameter Descriptions |
| 145 | + |
| 146 | +| Parameter | Description | |
| 147 | +|-----|------| |
| 148 | +| `--splitwise` | Enable PD disaggregated mode | |
| 149 | +| `--splitwise-role` | Node role: `prefill` or `decode` | |
| 150 | +| `--cache-transfer-protocol` | KV Cache transfer protocol: `rdma` or `ipc` | |
| 151 | +| `--router` | Router service address | |
| 152 | +| `--quantization` | Quantization strategy (wint4/wint8/fp8, etc.) | |
| 153 | +| `--tensor-parallel-size` | Tensor parallelism degree (TP) | |
| 154 | +| `--data-parallel-size` | Data parallelism degree (DP) | |
| 155 | +| `--max-model-len` | Maximum sequence length | |
| 156 | +| `--max-num-seqs` | Maximum concurrent sequences | |
| 157 | +| `--num-gpu-blocks-override` | GPU KV Cache block count override | |
| 158 | + |
| 159 | +--- |
| 160 | + |
| 161 | +## 3. Cross-Machine PD Disaggregated Deployment |
| 162 | + |
| 163 | +### 3.1 Deployment Principles |
| 164 | + |
| 165 | +Cross-machine PD disaggregation deploys Prefill and Decode instances on different physical machines: |
| 166 | +- **Prefill Machine**: Runs the Router and Prefill nodes, responsible for processing input sequence prefill computation |
| 167 | +- **Decode Machine**: Runs Decode nodes, communicates with the Prefill machine via RDMA network, responsible for autoregressive decoding generation |
| 168 | + |
| 169 | +### 3.2 Test Scenarios and Parallelism Configuration |
| 170 | + |
| 171 | +This chapter demonstrates the **TP1DP8EP8|D:TP1DP8EP8** cross-machine configuration (16 GPUs total): |
| 172 | +- **Tensor Parallelism (TP)**: 1 |
| 173 | +- **Data Parallelism (DP)**: 8 — 8 GPUs per machine, totaling 8 Prefill instances and 8 Decode instances |
| 174 | +- **Expert Parallelism (EP)**: Enabled — MoE layer shared experts are distributed across 8 GPUs for parallel computation |
| 175 | + |
| 176 | +**To test other cross-machine parallelism configurations, adjust parameters as follows:** |
| 177 | +1. **Inter-Machine Communication**: Ensure RDMA network connectivity between machines; Prefill machine needs `KVCACHE_RDMA_NICS` environment variable configured |
| 178 | +2. **Router Address**: The `--router` parameter on the Decode machine must point to the actual IP address of the Prefill machine |
| 179 | +3. **Port Configuration**: The number of ports in the `--ports` list must match `--num-servers` and `--data-parallel-size` |
| 180 | +4. **GPU Visibility**: Each machine specifies its local GPUs via `CUDA_VISIBLE_DEVICES` |
| 181 | + |
| 182 | +### 3.3 Prefill Machine Startup Scripts |
| 183 | + |
| 184 | +#### Start Router |
| 185 | + |
| 186 | +```bash |
| 187 | +unset http_proxy && unset https_proxy |
| 188 | + |
| 189 | +python -m fastdeploy.router.launch \ |
| 190 | + --port 8109 \ |
| 191 | + --splitwise |
| 192 | +``` |
| 193 | + |
| 194 | +#### Start Prefill Nodes |
| 195 | + |
| 196 | +```bash |
| 197 | +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| 198 | + |
| 199 | +python -m fastdeploy.entrypoints.openai.multi_api_server \ |
| 200 | + --ports 8198,8199,8200,8201,8202,8203,8204,8205 \ |
| 201 | + --num-servers 8 \ |
| 202 | + --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ |
| 203 | + --splitwise-role "prefill" \ |
| 204 | + --cache-transfer-protocol "rdma,ipc" \ |
| 205 | + --router "<ROUTER_MACHINE_IP>:8109" \ |
| 206 | + --quantization wint4 \ |
| 207 | + --tensor-parallel-size 1 \ |
| 208 | + --data-parallel-size 8 \ |
| 209 | + --enable-expert-parallel \ |
| 210 | + --max-model-len 8192 \ |
| 211 | + --max-num-seqs 64 |
| 212 | +``` |
| 213 | + |
| 214 | +### 3.4 Decode Machine Startup Scripts |
| 215 | + |
| 216 | +#### Start Decode Nodes |
| 217 | + |
| 218 | +```bash |
| 219 | +export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| 220 | + |
| 221 | +python -m fastdeploy.entrypoints.openai.multi_api_server \ |
| 222 | + --ports 8198,8199,8200,8201,8202,8203,8204,8205 \ |
| 223 | + --num-servers 8 \ |
| 224 | + --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \ |
| 225 | + --splitwise-role "decode" \ |
| 226 | + --cache-transfer-protocol "rdma,ipc" \ |
| 227 | + --router "<PREFILL_MACHINE_IP>:8109" \ |
| 228 | + --quantization wint4 \ |
| 229 | + --tensor-parallel-size 1 \ |
| 230 | + --data-parallel-size 8 \ |
| 231 | + --enable-expert-parallel \ |
| 232 | + --max-model-len 8192 \ |
| 233 | + --max-num-seqs 64 |
| 234 | +``` |
| 235 | + |
| 236 | +**Note**: Please replace `<PREFILL_MACHINE_IP>` with the actual IP address of the Prefill machine. |
| 237 | + |
| 238 | +## 4. Sending Test Requests |
| 239 | + |
| 240 | +```bash |
| 241 | +curl -X POST "http://localhost:8109/v1/chat/completions" \ |
| 242 | +-H "Content-Type: application/json" \ |
| 243 | +-d '{ |
| 244 | + "messages": [ |
| 245 | + {"role": "user", "content": "你好,请介绍一下自己。"} |
| 246 | + ], |
| 247 | + "max_tokens": 100, |
| 248 | + "stream": false |
| 249 | +}' |
| 250 | +``` |
| 251 | + |
| 252 | +## 5. Frequently Asked Questions (FAQ) |
| 253 | + |
| 254 | +If you encounter issues during use, please refer to [FAQ](./FAQ.md) for solutions. |
0 commit comments