This document provides a comprehensive guide to FastDeploy's PD (Prefill-Decode) disaggregated deployment solution, covering both single-machine and cross-machine deployment modes with support for Tensor Parallelism (TP), Data Parallelism (DP), and Expert Parallelism (EP).
This guide demonstrates deployment practices using the ERNIE-4.5-300B-A47B-Paddle model on H100 80GB GPUs. Below are the minimum GPU requirements for different deployment configurations:
Single-Machine Deployment (8 GPUs, Single Node)
| Configuration | TP | DP | EP | GPUs Required |
|---|---|---|---|---|
| P:TP4DP1 D:TP4DP1 |
4 | 1 | - | 8 |
| P:TP1DP4EP4 D:TP1DP4EP4 |
1 | 4 | ✓ | 8 |
Multi-Machine Deployment (16 GPUs, Cross-Node)
| Configuration | TP | DP | EP | GPUs Required |
|---|---|---|---|---|
| P:TP8DP1 D:TP8DP1 |
8 | 1 | - | 16 |
| P:TP4DP2 D:TP4DP2 |
4 | 2 | - | 16 |
| P:TP1DP8EP8 D:TP1DP8EP8 |
1 | 8 | ✓ | 16 |
Important Notes:
- Quantization: All configurations above use WINT4 quantization, specified via
--quantization wint4 - EP Limitations: When Expert Parallelism (EP) is enabled, only TP=1 is currently supported; multi-TP scenarios are not yet available
- Cross-Machine Network: Cross-machine deployment requires RDMA network support for high-speed KV Cache transmission
- GPU Calculation: Total GPUs = TP × DP × 2, with identical configurations for both Prefill and Decode instances
- CUDA Graph Capture: Decode instances enable CUDA Graph capture by default for inference acceleration, while Prefill instances do not
Please refer to the FastDeploy Installation Guide to set up your environment.
For model downloads, please check the Supported Models List.
Single-Machine Deployment Topology
┌──────────────────────────────┐
│ Single Machine 8×H100 80GB │
│ ┌──────────────┐ │
│ │ Router │ │
│ │ 0.0.0.0:8109│ │
│ └──────────────┘ │
│ │ │
│ ┌────┴────┐ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │Prefill │ │Decode │ │
│ │GPU 0-3 │ │GPU 4-7 │ │
│ └─────────┘ └─────────┘ │
└──────────────────────────────┘
Cross-Machine Deployment Topology
┌─────────────────────┐ ┌─────────────────────┐
│ Prefill Machine │ RDMA Network │ Decode Machine │
│ 8×H100 80GB │◄────────────────────►│ 8×H100 80GB │
│ │ │ │
│ ┌──────────────┐ │ │ │
│ │ Router │ │ │ │
│ │ 0.0.0.0:8109 │───┼──────────────────────┼────────── │
│ └──────────────┘ │ │ │ │
│ │ │ │ │ │
│ ▼ │ │ ▼ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │Prefill Nodes │ │ │ │Decode Nodes │ │
│ │GPU 0-7 │ │ │ │GPU 0-7 │ │
│ └──────────────┘ │ │ └──────────────┘ │
└─────────────────────┘ └─────────────────────┘
This chapter demonstrates the TP4DP1|D:TP4DP1 configuration test scenario:
- Tensor Parallelism (TP): 4 — Each 4 GPUs independently load complete model parameters
- Data Parallelism (DP): 1 — Each GPU forms a data parallelism group
- Expert Parallelism (EP): Not enabled
To test other parallelism configurations, adjust parameters as follows:
- TP Adjustment: Modify
--tensor-parallel-size - DP Adjustment: Modify
--data-parallel-size, ensuring--portsand--num-serversremain consistent with DP - EP Toggle: Add or remove
--enable-expert-parallel - GPU Allocation: Control GPUs used by Prefill and Decode instances via
CUDA_VISIBLE_DEVICES
python -m fastdeploy.router.launch \
--port 8109 \
--splitwiseNote: This uses the Python version of the router. If needed, you can also use the high-performance Golang version router.
export CUDA_VISIBLE_DEVICES=0,1,2,3
python -m fastdeploy.entrypoints.openai.api_server \
--model /path/to/ERNIE-4.5-300B-A47B-Paddle \
--port 8188 \
--splitwise-role "prefill" \
--cache-transfer-protocol "rdma,ipc" \
--router "0.0.0.0:8109" \
--quantization wint4 \
--tensor-parallel-size 4 \
--data-parallel-size 1 \
--max-model-len 8192 \
--max-num-seqs 64export CUDA_VISIBLE_DEVICES=4,5,6,7
python -m fastdeploy.entrypoints.openai.multi_api_server \
--model /path/to/ERNIE-4.5-300B-A47B-Paddle \
--ports 8200,8201 \
--splitwise-role "decode" \
--cache-transfer-protocol "rdma,ipc" \
--router "0.0.0.0:8109" \
--quantization wint4 \
--tensor-parallel-size 2 \
--data-parallel-size 2 \
--max-model-len 8192 \
--max-num-seqs 64| Parameter | Description |
|---|---|
--splitwise |
Enable PD disaggregated mode |
--splitwise-role |
Node role: prefill or decode |
--cache-transfer-protocol |
KV Cache transfer protocol: rdma or ipc |
--router |
Router service address |
--quantization |
Quantization strategy (wint4/wint8/fp8, etc.) |
--tensor-parallel-size |
Tensor parallelism degree (TP) |
--data-parallel-size |
Data parallelism degree (DP) |
--max-model-len |
Maximum sequence length |
--max-num-seqs |
Maximum concurrent sequences |
--num-gpu-blocks-override |
GPU KV Cache block count override |
Cross-machine PD disaggregation deploys Prefill and Decode instances on different physical machines:
- Prefill Machine: Runs the Router and Prefill nodes, responsible for processing input sequence prefill computation
- Decode Machine: Runs Decode nodes, communicates with the Prefill machine via RDMA network, responsible for autoregressive decoding generation
This chapter demonstrates the TP1DP8EP8|D:TP1DP8EP8 cross-machine configuration (16 GPUs total):
- Tensor Parallelism (TP): 1
- Data Parallelism (DP): 8 — 8 GPUs per machine, totaling 8 Prefill instances and 8 Decode instances
- Expert Parallelism (EP): Enabled — MoE layer shared experts are distributed across 8 GPUs for parallel computation
To test other cross-machine parallelism configurations, adjust parameters as follows:
- Inter-Machine Communication: Ensure RDMA network connectivity between machines; Prefill machine needs
KVCACHE_RDMA_NICSenvironment variable configured - Router Address: The
--routerparameter on the Decode machine must point to the actual IP address of the Prefill machine - Port Configuration: The number of ports in the
--portslist must match--num-serversand--data-parallel-size - GPU Visibility: Each machine specifies its local GPUs via
CUDA_VISIBLE_DEVICES
unset http_proxy && unset https_proxy
python -m fastdeploy.router.launch \
--port 8109 \
--splitwiseexport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports 8198,8199,8200,8201,8202,8203,8204,8205 \
--num-servers 8 \
--args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
--splitwise-role "prefill" \
--cache-transfer-protocol "rdma,ipc" \
--router "<ROUTER_MACHINE_IP>:8109" \
--quantization wint4 \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 8192 \
--max-num-seqs 64export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m fastdeploy.entrypoints.openai.multi_api_server \
--ports 8198,8199,8200,8201,8202,8203,8204,8205 \
--num-servers 8 \
--args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
--splitwise-role "decode" \
--cache-transfer-protocol "rdma,ipc" \
--router "<PREFILL_MACHINE_IP>:8109" \
--quantization wint4 \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel \
--max-model-len 8192 \
--max-num-seqs 64Note: Please replace <PREFILL_MACHINE_IP> with the actual IP address of the Prefill machine.
curl -X POST "http://localhost:8109/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "你好,请介绍一下自己。"}
],
"max_tokens": 100,
"stream": false
}'If you encounter issues during use, please refer to FAQ for solutions.