PD分离部署最佳实践

本文档详细介绍 FastDeploy 的 PD（Prefill-Decode）分离式部署方案，涵盖单机部署与跨机部署两种模式，支持张量并行（TP）、数据并行（DP）和专家并行（EP）。

一、部署方案概览与环境准备

本文以 ERNIE-4.5-300B-A47B-Paddle 模型为例进行部署实践，硬件环境采用 H100 80GB GPU。下面例举了不同部署模式下的最小 GPU 卡数需求：

单机部署（8卡单节点）

配置方案	TP	DP	EP	所需卡数
P：TP4DP1 D：TP4DP1	4	1	-	8
P：TP1DP4EP4 D：TP1DP4EP4	1	4	✓	8

多机部署（16卡跨节点）

配置方案	TP	DP	EP	所需卡数
P：TP8DP1 D：TP8DP1	8	1	-	16
P：TP4DP2 D：TP4DP2	4	2	-	16
P：TP1DP8EP8 D：TP1DP8EP8	1	8	✓	16

重要说明：

量化精度：以上所有配置均采用 WINT4 量化，通过 --quantization wint4 参数指定
EP 限制：开启专家并行（EP）后，当前仅支持 TP=1，暂不支持多 TP 场景
跨机网络：跨机部署依赖 RDMA 网络实现 KV Cache 的高速传输
卡数计算：总卡数 = TP × DP × 2（Prefill 实例与 Decode 实例配置相同）
CUDA Graph 捕获：Decode 实例默认启用 CUDA Graph 捕获以加速推理，Prefill 实例默认不启用

1.1 安装 FastDeploy

请参考 FastDeploy 安装指南完成环境搭建。

模型下载请参考支持模型列表。

1.2 部署拓扑结构

单机部署拓扑

┌──────────────────────────────┐
│  单机 8×H100 80GB             │
│  ┌──────────────┐            │
│  │  Router      │            │
│  │  0.0.0.0:8109│            │
│  └──────────────┘            │
│         │                    │
│    ┌────┴────┐               │
│    ▼         ▼               │
│ ┌─────────┐  ┌─────────┐     │
│ │Prefill  │  │Decode   │     │
│ │GPU 0-3  │  │GPU 4-7  │     │
│ └─────────┘  └─────────┘     │
└──────────────────────────────┘

跨机部署拓扑

┌─────────────────────┐                      ┌─────────────────────┐
│   Prefill Machine   │      RDMA Network    │   Decode Machine    │
│   8×H100 80GB       │◄────────────────────►│   8×H100 80GB       │
│                     │                      │                     │
│  ┌──────────────┐   │                      │                     │
│  │  Router      │   │                      │                     │
│  │ 0.0.0.0:8109 │───┼──────────────────────┼──────────           │
│  └──────────────┘   │                      │         │           │
│         │           │                      │         │           │
│         ▼           │                      │         ▼           │
│  ┌──────────────┐   │                      │  ┌──────────────┐   │
│  │Prefill Nodes │   │                      │  │Decode Nodes  │   │
│  │GPU 0-7       │   │                      │  │GPU 0-7       │   │
│  └──────────────┘   │                      │  └──────────────┘   │
└─────────────────────┘                      └─────────────────────┘

二、单机 PD 分离部署

2.1 测试场景与并行度配置

本节演示的测试场景为 P：TP4DP1｜D：TP4DP1 配置：

张量并行度（TP）：4 —— 每4张 GPU 独立加载完整模型参数
数据并行度（DP）：1 —— 每张 GPU 组成一个数据并行组
专家并行（EP）：不启用

若需测试其他并行度配置，请按以下方式调整参数：

TP 调整：修改 --tensor-parallel-size
DP 调整：修改 --data-parallel-size，同时确保 --ports 和 --num-servers 与 DP 保持一致
EP 开关：添加或移除 --enable-expert-parallel
GPU 分配：通过 CUDA_VISIBLE_DEVICES 控制 Prefill 和 Decode 实例使用的 GPU

2.2 启动脚本

启动 Router

python -m fastdeploy.router.launch \
    --port 8109 \
    --splitwise

注意：这里使用的是python版本router，如果有需要也可以使用高性能的Golang版本router

启动 Prefill 节点

export CUDA_VISIBLE_DEVICES=0,1,2,3

python -m fastdeploy.entrypoints.openai.api_server \
    --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --port 8188 \
    --splitwise-role "prefill" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "0.0.0.0:8109" \
    --quantization wint4 \
    --tensor-parallel-size 4 \
    --data-parallel-size 1 \
    --max-model-len 8192 \
    --max-num-seqs 64

启动 Decode 节点

export CUDA_VISIBLE_DEVICES=4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --ports 8200,8201 \
    --splitwise-role "decode" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "0.0.0.0:8109" \
    --quantization wint4 \
    --tensor-parallel-size 2 \
    --data-parallel-size 2 \
    --max-model-len 8192 \
    --max-num-seqs 64

2.3 关键参数说明

参数	说明
`--splitwise`	开启 PD 分离模式
`--splitwise-role`	节点角色：`prefill` 或 `decode`
`--cache-transfer-protocol`	KV Cache 传输协议：`rdma` 或 `ipc`
`--router`	Router 服务地址
`--quantization`	量化策略（wint4/wint8/fp8 等）
`--tensor-parallel-size`	张量并行度（TP）
`--data-parallel-size`	数据并行度（DP）
`--max-model-len`	最大序列长度
`--max-num-seqs`	最大并发序列数
`--num-gpu-blocks-override`	GPU KV Cache 块数量覆盖值

三、跨机 PD 分离部署

3.1 部署原理

跨机 PD 分离将 Prefill 和 Decode 实例部署在不同物理机器上：

Prefill 机器：运行 Router 和 Prefill 节点，负责处理输入序列的预填充计算
Decode 机器：运行 Decode 节点，通过 RDMA 网络与 Prefill 机器通信，负责自回归解码生成

3.2 测试场景与并行度配置

本章节演示的测试场景为 P：TP1DP8EP8 ｜ D：P：TP1DP8EP8 跨机配置（共 16 张 GPU）：

张量并行度（TP）：1
数据并行度（DP）：8 —— 每机 8 张 GPU，共 8 个 Prefill 实例和 8 个 Decode 实例
专家并行（EP）：启用—— MoE 层的共享专家分布在8张 GPU 上并行计算

若需测试其他跨机并行度配置，请按以下方法调整参数：

机器间通信：确保两机之间 RDMA 网络连通，Prefill 机器需配置 KVCACHE_RDMA_NICS 环境变量
Router 地址：Decode 机器的 --router 参数需指向 Prefill 机器的实际 IP 地址
端口配置：--ports 列表的端口数量必须与 --num-servers 和 --data-parallel-size 保持一致
GPU 可见性：每机通过 CUDA_VISIBLE_DEVICES 指定本机使用的 GPU

3.3 Prefill 机器启动脚本

启动 Router

unset http_proxy && unset https_proxy

python -m fastdeploy.router.launch \
    --port 8109 \
    --splitwise

启动 Prefill 节点

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --ports 8198,8199,8200,8201,8202,8203,8204,8205 \
    --num-servers 8 \
    --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --splitwise-role "prefill" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "<ROUTER_MACHINE_IP>:8109" \
    --quantization wint4 \
    --tensor-parallel-size 1 \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --max-model-len 8192 \
    --max-num-seqs 64

3.4 Decode 机器启动脚本

启动 Decode 节点

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

python -m fastdeploy.entrypoints.openai.multi_api_server \
    --ports 8198,8199,8200,8201,8202,8203,8204,8205 \
    --num-servers 8 \
    --args --model /path/to/ERNIE-4.5-300B-A47B-Paddle \
    --splitwise-role "decode" \
    --cache-transfer-protocol "rdma,ipc" \
    --router "<PREFILL_MACHINE_IP>:8109" \
    --quantization wint4 \
    --tensor-parallel-size 1 \
    --data-parallel-size 8 \
    --enable-expert-parallel \
    --max-model-len 8192 \
    --max-num-seqs 64

注意：请将 <PREFILL_MACHINE_IP> 替换为 Prefill 机器的实际 IP 地址。

四、发送测试请求

curl -X POST "http://localhost:8109/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
  "messages": [
    {"role": "user", "content": "你好，请介绍一下自己。"}
  ],
  "max_tokens": 100,
  "stream": false
}'

五、常见问题 FAQ

如果您在使用过程中遇到问题，可以在 FAQ 中查阅解决方案。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PD分离部署最佳实践

一、部署方案概览与环境准备

1.1 安装 FastDeploy

1.2 部署拓扑结构

二、单机 PD 分离部署

2.1 测试场景与并行度配置

2.2 启动脚本

启动 Router

启动 Prefill 节点

启动 Decode 节点

2.3 关键参数说明

三、跨机 PD 分离部署

3.1 部署原理

3.2 测试场景与并行度配置

3.3 Prefill 机器启动脚本

启动 Router

启动 Prefill 节点

3.4 Decode 机器启动脚本

启动 Decode 节点

四、发送测试请求

五、常见问题 FAQ

FilesExpand file tree

Disaggregated.md

Latest commit

History

Disaggregated.md

File metadata and controls

PD分离部署最佳实践

一、部署方案概览与环境准备

1.1 安装 FastDeploy

1.2 部署拓扑结构

二、单机 PD 分离部署

2.1 测试场景与并行度配置

2.2 启动脚本

启动 Router

启动 Prefill 节点

启动 Decode 节点

2.3 关键参数说明

三、跨机 PD 分离部署

3.1 部署原理

3.2 测试场景与并行度配置

3.3 Prefill 机器启动脚本

启动 Router

启动 Prefill 节点

3.4 Decode 机器启动脚本

启动 Decode 节点

四、发送测试请求

五、常见问题 FAQ