MAD/scripts/sglang_disagg/README.MD at develop · ROCm/MAD

List of Models - focus SGLang Disaggregated P/D inference

Dense Models

Qwen3-32B (https://huggingface.co/Qwen/Qwen3-32B)
meta-llama/Llama-3.1-8B-Instruct (https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
amd/Llama-3.3-70B-Instruct-FP8-KV (https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV)
amd/Llama-3.1-405B-Instruct-FP8-KV (https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV)

MoE Models

DeepSeek-V3 (https://huggingface.co/deepseek-ai/DeepSeek-V3)
DeepSeek-R1 (https://huggingface.co/deepseek-ai/DeepSeek-R1)
Mixtral-8x7B-v0.1 (https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)

This repository contains scripts and documentation to launch PD Disaggregation for above models. You will find setup instructions, node assignment details and benchmarking commands.

📝 Prerequisites

A Slurm cluster with required Nodes -> xP + yD (minimum size 2: xP=1 and yD=1)
Docker container with SGLang, MoRI and NIC drivers built-in. Refer to Building the Docker image section below.
Access to a shared filesystem for log collection( cluster specific)

Building the Docker image

Access the Dockerfile located at docker/sglang_disagg_inference.ubuntu.amd.Dockerfile. It uses lmsysorg/sglang:v0.5.11-rocm700-mi30x as the base docker image.

docker build  -t sglang_disagg_pd_image -f sglang_disagg_inference.ubuntu.amd.Dockerfile .

Scripts

File	Description
`run_xPyD_models.slurm`	SLURM script to launch docker containers on all nodes via `sbatch`
`sglang_disagg_mori_io_ep.sh`	Container entrypoint — starts prefill/decode servers, proxy, and benchmark
`models.yaml`	Model-specific CLI flags for all supported models
`mori_ep_env.sh`	RDMA/NCCL/Gloo environment variables
`benchmark_xPyD.sh`	Concurrency sweep benchmark using sglang bench_serving
`benchmark_parser.py`	Log parser for CONCURRENCY benchmark logs

Quick Start

git clone https://github.com/ROCm/MAD.git
cd scripts/sglang_disagg

export DOCKER_IMAGE_NAME=<DOCKER_IMAGE_NAME>
export xP=1
export yD=1
export MODEL_NAME=Llama-3.1-8B-Instruct
export RUN_MORI=1  # MoRI (default). Set RUN_MORI=0 for Mooncake/RIXLE (KV_TRANSFER_BACKEND=mooncake)

# num_nodes = xP + yD
sbatch -N 2 -n 2 --nodelist=<node1,node2> run_xPyD_models.slurm

Log Files

Logs are written to ${LOG_PATH}/${SLURM_JOB_ID}/:

File	Description
`pd_sglang_bench_serving.sh_NODE<N>.log`	Main per-node log
`prefill_NODE<N>.log`	Prefill server log
`decode_NODE<N>.log`	Decode server log
`benchmark_*_CONCURRENCY.log` / `.csv`	Benchmark results

Benchmarking

Parse benchmark results:

python3 benchmark_parser.py <log_path>/benchmark_XXX_CONCURRENCY.log

Smoke test from the proxy node:

curl -X POST http://127.0.0.1:2322/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Let me tell you a story", "sampling_params": {"temperature": 0.3}}'

Known Issues

For larger models, such as DeepSeekV3 and Llama-3.1-405B-Instruct-FP8-KV and higher concurrency(512+), errors with below signature is observed:
'<TransferEncodingError: 400, message:\n Not enough data to satisfy transfer length header.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n '
This leads to dropping requests and lower throughput.This issue is being discussed on the SGLang forums.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of Models - focus SGLang Disaggregated P/D inference

📝 Prerequisites

Building the Docker image

Scripts

Quick Start

Log Files

Benchmarking

Known Issues

FilesExpand file tree

README.MD

Latest commit

History

README.MD

File metadata and controls

List of Models - focus SGLang Disaggregated P/D inference

📝 Prerequisites

Building the Docker image

Scripts

Quick Start

Log Files

Benchmarking

Known Issues