Skip to content

Commit 3b56411

Browse files
authored
[Docs] Add docs for disaggregated deployment (#6700)
* add docs for disaggregated deployment * pre-commit run for style check * update docs
1 parent ceaf5df commit 3b56411

6 files changed

Lines changed: 513 additions & 0 deletions

File tree

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
[简体中文](../zh/best_practices/Disaggregated.md)
2+
3+
# PD Disaggregated Deployment Best Practices
4+
5+
This document provides a comprehensive guide to FastDeploy's PD (Prefill-Decode) disaggregated deployment solution, covering both single-machine and cross-machine deployment modes with support for Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP).
6+
7+
## 1. Deployment Overview and Environment Preparation
8+
9+
This guide demonstrates deployment practices using the ERNIE-4.5-300B-A47B-Paddle model on H100 80GB GPUs. Below are the minimum GPU requirements for different deployment configurations:
10+
11+
**Single-Machine Deployment (8 GPUs, Single Node)**
12+
13+
| Configuration | TP | DP | EP | GPUs Required |
14+
|---------|----|----|----|---------|
15+
| P:TP4DP1<br>D:TP4DP1 | 4 | 1 | - | 8 |
16+
| P:TP1DP4EP4 <br> D:TP1DP4EP4| 1 | 4 || 8 |
17+
18+
**Multi-Machine Deployment (16 GPUs, Cross-Node)**
19+
20+
| Configuration | TP | DP | EP | GPUs Required |
21+
|---------|----|----|----|---------|
22+
| P:TP8DP1<br>D:TP8DP1 | 8 | 1 | - | 16 |
23+
| P:TP4DP2<br>D:TP4DP2 | 4 | 2 | - | 16 |
24+
| P:TP1DP8EP8<br>D:TP1DP8EP8 | 1 | 8 || 16 |
25+
26+
**Important Notes**:
27+
1. **Quantization**: All configurations above use WINT4 quantization, specified via `--quantization wint4`
28+
2. **EP Limitations**: When Expert Parallelism (EP) is enabled, only TP=1 is currently supported; multi-TP scenarios are not yet available
29+
3. **Cross-Machine Network**: Cross-machine deployment requires RDMA network support for high-speed KV Cache transmission
30+
4. **GPU Calculation**: Total GPUs = TP × DP × 2, with identical configurations for both Prefill and Decode instances
31+
5. **CUDA Graph Capture**: Decode instances enable CUDA Graph capture by default for inference acceleration, while Prefill instances do not
32+
33+
### 1.1 Installing FastDeploy
34+
35+
Please refer to the [FastDeploy Installation Guide](https://paddlepaddle.github.io/FastDeploy/zh/install/) to set up your environment.
36+
37+
For model downloads, please check the [Supported Models List](https://paddlepaddle.github.io/FastDeploy/zh/model_summary/).
38+
39+
### 1.2 Deployment Topology
40+
41+
**Single-Machine Deployment Topology**
42+
43+
```
44+
┌──────────────────────────────┐
45+
│ Single Machine 8×H100 80GB │
46+
│ ┌──────────────┐ │
47+
│ │ Router │ │
48+
│ │ 0.0.0.0:8109│ │
49+
│ └──────────────┘ │
50+
│ │ │
51+
│ ┌────┴────┐ │
52+
│ ▼ ▼ │
53+
│ ┌─────────┐ ┌─────────┐ │
54+
│ │Prefill │ │Decode │ │
55+
│ │GPU 0-3 │ │GPU 4-7 │ │
56+
│ └─────────┘ └─────────┘ │
57+
└──────────────────────────────┘
58+
```
59+
60+
**Cross-Machine Deployment Topology**
61+
62+
```
63+
┌─────────────────────┐ ┌─────────────────────┐
64+
│ Prefill Machine │ RDMA Network │ Decode Machine │
65+
│ 8×H100 80GB │◄────────────────────►│ 8×H100 80GB │
66+
│ │ │ │
67+
│ ┌──────────────┐ │ │ │
68+
│ │ Router │ │ │ │
69+
│ │ 0.0.0.0:8109 │───┼──────────────────────┼────────── │
70+
│ └──────────────┘ │ │ │ │
71+
│ │ │ │ │ │
72+
│ ▼ │ │ ▼ │
73+
│ ┌──────────────┐ │ │ ┌──────────────┐ │
74+
│ │Prefill Nodes │ │ │ │Decode Nodes │ │
75+
│ │GPU 0-7 │ │ │ │GPU 0-7 │ │
76+
│ └──────────────┘ │ │ └──────────────┘ │
77+
└─────────────────────┘ └─────────────────────┘
78+
```
79+
80+
---
81+
## 2. Single-Machine PD Disaggregated Deployment
82+
83+
### 2.1 Test Scenarios and Parallelism Configuration
84+
85+
This chapter demonstrates the **TP4DP1|D:TP4DP1** configuration test scenario:
86+
- **Tensor Parallelism (TP)**: 4 — Each 4 GPUs independently load complete model parameters
87+
- **Data Parallelism (DP)**: 1 — Each GPU forms a data parallelism group
88+
- **Expert Parallelism (EP)**: Not enabled
89+
90+
**To test other parallelism configurations, adjust parameters as follows:**
91+
1. **TP Adjustment**: Modify `--tensor-parallel-size`
92+
2. **DP Adjustment**: Modify `--data-parallel-size`, ensuring `--ports` and `--num-servers` remain consistent with DP
93+
3. **EP Toggle**: Add or remove `--enable-expert-parallel`
94+
4. **GPU Allocation**: Control GPUs used by Prefill and Decode instances via `CUDA_VISIBLE_DEVICES`
95+
96+
### 2.2 Startup Scripts
97+
98+
#### Start Router
99+
100+
```bash
101+
python -m fastdeploy.router.launch \
102+
--port 8109 \
103+
--splitwise
104+
```
105+
106+
Note: This uses the Python version of the router. If needed, you can also use the high-performance [Golang version router](../online_serving/router.md).
107+
108+
#### Start Prefill Nodes
109+
110+
```bash
111+
export CUDA_VISIBLE_DEVICES=0,1,2,3
112+
113+
python -m fastdeploy.entrypoints.openai.api_server \
114+
--model /path/to/ERNIE-4.5-300B-A47B-Paddle \
115+
--port 8188 \
116+
--splitwise-role "prefill" \
117+
--cache-transfer-protocol "rdma,ipc" \
118+
--router "0.0.0.0:8109" \
119+
--quantization wint4 \
120+
--tensor-parallel-size 4 \
121+
--data-parallel-size 1 \
122+
--max-model-len 8192 \
123+
--max-num-seqs 64
124+
```
125+
126+
#### Start Decode Nodes
127+
128+
```bash
129+
export CUDA_VISIBLE_DEVICES=4,5,6,7
130+
131+
python -m fastdeploy.entrypoints.openai.multi_api_server \
132+
--model /path/to/ERNIE-4.5-300B-A47B-Paddle \
133+
--ports 8200,8201 \
134+
--splitwise-role "decode" \
135+
--cache-transfer-protocol "rdma,ipc" \
136+
--router "0.0.0.0:8109" \
137+
--quantization wint4 \
138+
--tensor-parallel-size 2 \
139+
--data-parallel-size 2 \
140+
--max-model-len 8192 \
141+
--max-num-seqs 64
142+
```
143+
144+
### 2.3 Key Parameter Descriptions
145+
146+
| Parameter | Description |
147+
|-----|------|
148+
| `--splitwise` | Enable PD disaggregated mode |
149+
| `--splitwise-role` | Node role: `prefill` or `decode` |
150+
| `--cache-transfer-protocol` | KV Cache transfer protocol: `rdma` or `ipc` |
151+
| `--router` | Router service address |
152+
| `--quantization` | Quantization strategy (wint4/wint8/fp8, etc.) |
153+
| `--tensor-parallel-size` | Tensor parallelism degree (TP) |
154+
| `--data-parallel-size` | Data parallelism degree (DP) |
155+
| `--max-model-len` | Maximum sequence length |
156+
| `--max-num-seqs` | Maximum concurrent sequences |
157+
| `--num-gpu-blocks-override` | GPU KV Cache block count override |
158+
159+
---
160+
161+
## 3. Cross-Machine PD Disaggregated Deployment
162+
163+
### 3.1 Deployment Principles
164+
165+
Cross-machine PD disaggregation deploys Prefill and Decode instances on different physical machines:
166+
- **Prefill Machine**: Runs the Router and Prefill nodes, responsible for processing input sequence prefill computation
167+
- **Decode Machine**: Runs Decode nodes, communicates with the Prefill machine via RDMA network, responsible for autoregressive decoding generation
168+
169+
### 3.2 Test Scenarios and Parallelism Configuration
170+
171+
This chapter demonstrates the **TP1DP8EP8|D:TP1DP8EP8** cross-machine configuration (16 GPUs total):
172+
- **Tensor Parallelism (TP)**: 1
173+
- **Data Parallelism (DP)**: 8 — 8 GPUs per machine, totaling 8 Prefill instances and 8 Decode instances
174+
- **Expert Parallelism (EP)**: Enabled — MoE layer shared experts are distributed across 8 GPUs for parallel computation
175+
176+
**To test other cross-machine parallelism configurations, adjust parameters as follows:**
177+
1. **Inter-Machine Communication**: Ensure RDMA network connectivity between machines; Prefill machine needs `KVCACHE_RDMA_NICS` environment variable configured
178+
2. **Router Address**: The `--router` parameter on the Decode machine must point to the actual IP address of the Prefill machine
179+
3. **Port Configuration**: The number of ports in the `--ports` list must match `--num-servers` and `--data-parallel-size`
180+
4. **GPU Visibility**: Each machine specifies its local GPUs via `CUDA_VISIBLE_DEVICES`
181+
182+
### 3.3 Prefill Machine Startup Scripts
183+
184+
#### Start Router
185+
186+
```bash
187+
unset http_proxy && unset https_proxy
188+
189+
python -m fastdeploy.router.launch \
190+
--port 8109 \
191+
--splitwise
192+
```
193+
194+
#### Start Prefill Nodes
195+
196+
```bash
197+
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
198+
199+
python -m fastdeploy.entrypoints.openai.multi_api_server \
200+
--ports 8198,8199,8200,8201,8202,8203,8204,8205 \
201+
--num-servers 8 \
202+
--args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
203+
--splitwise-role "prefill" \
204+
--cache-transfer-protocol "rdma,ipc" \
205+
--router "<ROUTER_MACHINE_IP>:8109" \
206+
--quantization wint4 \
207+
--tensor-parallel-size 1 \
208+
--data-parallel-size 8 \
209+
--enable-expert-parallel \
210+
--max-model-len 8192 \
211+
--max-num-seqs 64
212+
```
213+
214+
### 3.4 Decode Machine Startup Scripts
215+
216+
#### Start Decode Nodes
217+
218+
```bash
219+
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
220+
221+
python -m fastdeploy.entrypoints.openai.multi_api_server \
222+
--ports 8198,8199,8200,8201,8202,8203,8204,8205 \
223+
--num-servers 8 \
224+
--args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
225+
--splitwise-role "decode" \
226+
--cache-transfer-protocol "rdma,ipc" \
227+
--router "<PREFILL_MACHINE_IP>:8109" \
228+
--quantization wint4 \
229+
--tensor-parallel-size 1 \
230+
--data-parallel-size 8 \
231+
--enable-expert-parallel \
232+
--max-model-len 8192 \
233+
--max-num-seqs 64
234+
```
235+
236+
**Note**: Please replace `<PREFILL_MACHINE_IP>` with the actual IP address of the Prefill machine.
237+
238+
## 4. Sending Test Requests
239+
240+
```bash
241+
curl -X POST "http://localhost:8109/v1/chat/completions" \
242+
-H "Content-Type: application/json" \
243+
-d '{
244+
"messages": [
245+
{"role": "user", "content": "你好,请介绍一下自己。"}
246+
],
247+
"max_tokens": 100,
248+
"stream": false
249+
}'
250+
```
251+
252+
## 5. Frequently Asked Questions (FAQ)
253+
254+
If you encounter issues during use, please refer to [FAQ](./FAQ.md) for solutions.

docs/best_practices/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@
99
- [ERNIE-4.5-VL-28B-A3B-Paddle](ERNIE-4.5-VL-28B-A3B-Paddle.md)
1010
- [ERNIE-4.5-VL-424B-A47B-Paddle](ERNIE-4.5-VL-424B-A47B-Paddle.md)
1111
- [PaddleOCR-VL-0.9B.md](PaddleOCR-VL-0.9B.md)
12+
- [Disaggregated.md](Disaggregated.md)

docs/features/disaggregated.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
[简体中文](../zh/features/disaggregated.md)
22

3+
[Best Practice](../best_practices/Disaggregated.md)
4+
35
# Disaggregated Deployment
46

57
Large Language Model (LLM) inference is divided into two phases: **Prefill** and **Decode**, which are compute-intensive and memory-bound, respectively.

0 commit comments

Comments
 (0)