Dense Models
- Qwen3-32B (https://huggingface.co/Qwen/Qwen3-32B)
- meta-llama/Llama-3.1-8B-Instruct (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- amd/Llama-3.3-70B-Instruct-FP8-KV (https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV)
- amd/Llama-3.1-405B-Instruct-FP8-KV (https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV)
MoE Models
- DeepSeek-V3 (https://huggingface.co/deepseek-ai/DeepSeek-V3)
- DeepSeek-R1 (https://huggingface.co/deepseek-ai/DeepSeek-R1)
- Mixtral-8x7B-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
This repository contains scripts and documentation to launch PD Disaggregation for above models. You will find setup instructions, node assignment details and benchmarking commands.
- A Slurm cluster with required Nodes -> xP + yD (minimum size 2: xP=1 and yD=1)
- Docker container with SGLang, MoRI and NIC drivers built-in. Refer to Building the Docker image section below.
- Access to a shared filesystem for log collection( cluster specific)
Access the Dockerfile located at docker/sglang_disagg_inference.ubuntu.amd.Dockerfile.
It uses lmsysorg/sglang:v0.5.11-rocm700-mi30x as the base docker image.
docker build -t sglang_disagg_pd_image -f sglang_disagg_inference.ubuntu.amd.Dockerfile .| File | Description |
|---|---|
run_xPyD_models.slurm |
SLURM script to launch docker containers on all nodes via sbatch |
sglang_disagg_mori_io_ep.sh |
Container entrypoint — starts prefill/decode servers, proxy, and benchmark |
models.yaml |
Model-specific CLI flags for all supported models |
mori_ep_env.sh |
RDMA/NCCL/Gloo environment variables |
benchmark_xPyD.sh |
Concurrency sweep benchmark using sglang bench_serving |
benchmark_parser.py |
Log parser for CONCURRENCY benchmark logs |
git clone https://github.com/ROCm/MAD.git
cd scripts/sglang_disagg
export DOCKER_IMAGE_NAME=<DOCKER_IMAGE_NAME>
export xP=1
export yD=1
export MODEL_NAME=Llama-3.1-8B-Instruct
export RUN_MORI=1 # MoRI (default). Set RUN_MORI=0 for Mooncake/RIXLE (KV_TRANSFER_BACKEND=mooncake)
# num_nodes = xP + yD
sbatch -N 2 -n 2 --nodelist=<node1,node2> run_xPyD_models.slurmLogs are written to ${LOG_PATH}/${SLURM_JOB_ID}/:
| File | Description |
|---|---|
pd_sglang_bench_serving.sh_NODE<N>.log |
Main per-node log |
prefill_NODE<N>.log |
Prefill server log |
decode_NODE<N>.log |
Decode server log |
benchmark_*_CONCURRENCY.log / .csv |
Benchmark results |
Parse benchmark results:
python3 benchmark_parser.py <log_path>/benchmark_XXX_CONCURRENCY.logSmoke test from the proxy node:
curl -X POST http://127.0.0.1:2322/generate \
-H "Content-Type: application/json" \
-d '{"text": "Let me tell you a story", "sampling_params": {"temperature": 0.3}}'For larger models, such as DeepSeekV3 and Llama-3.1-405B-Instruct-FP8-KV and higher concurrency(512+), errors with below signature is observed:
'<TransferEncodingError: 400, message:\n Not enough data to satisfy transfer length header.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n '
This leads to dropping requests and lower throughput.This issue is being discussed on the SGLang forums.