Skip to content

Latest commit

 

History

History
86 lines (65 loc) · 3.41 KB

File metadata and controls

86 lines (65 loc) · 3.41 KB

List of Models - focus SGLang Disaggregated P/D inference

Dense Models

MoE Models

This repository contains scripts and documentation to launch PD Disaggregation for above models. You will find setup instructions, node assignment details and benchmarking commands.

📝 Prerequisites

  • A Slurm cluster with required Nodes -> xP + yD (minimum size 2: xP=1 and yD=1)
  • Docker container with SGLang, MoRI and NIC drivers built-in. Refer to Building the Docker image section below.
  • Access to a shared filesystem for log collection( cluster specific)

Building the Docker image

Access the Dockerfile located at docker/sglang_disagg_inference.ubuntu.amd.Dockerfile. It uses lmsysorg/sglang:v0.5.11-rocm700-mi30x as the base docker image.

docker build  -t sglang_disagg_pd_image -f sglang_disagg_inference.ubuntu.amd.Dockerfile .

Scripts

File Description
run_xPyD_models.slurm SLURM script to launch docker containers on all nodes via sbatch
sglang_disagg_mori_io_ep.sh Container entrypoint — starts prefill/decode servers, proxy, and benchmark
models.yaml Model-specific CLI flags for all supported models
mori_ep_env.sh RDMA/NCCL/Gloo environment variables
benchmark_xPyD.sh Concurrency sweep benchmark using sglang bench_serving
benchmark_parser.py Log parser for CONCURRENCY benchmark logs

Quick Start

git clone https://github.com/ROCm/MAD.git
cd scripts/sglang_disagg

export DOCKER_IMAGE_NAME=<DOCKER_IMAGE_NAME>
export xP=1
export yD=1
export MODEL_NAME=Llama-3.1-8B-Instruct
export RUN_MORI=1  # MoRI (default). Set RUN_MORI=0 for Mooncake/RIXLE (KV_TRANSFER_BACKEND=mooncake)

# num_nodes = xP + yD
sbatch -N 2 -n 2 --nodelist=<node1,node2> run_xPyD_models.slurm

Log Files

Logs are written to ${LOG_PATH}/${SLURM_JOB_ID}/:

File Description
pd_sglang_bench_serving.sh_NODE<N>.log Main per-node log
prefill_NODE<N>.log Prefill server log
decode_NODE<N>.log Decode server log
benchmark_*_CONCURRENCY.log / .csv Benchmark results

Benchmarking

Parse benchmark results:

python3 benchmark_parser.py <log_path>/benchmark_XXX_CONCURRENCY.log

Smoke test from the proxy node:

curl -X POST http://127.0.0.1:2322/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Let me tell you a story", "sampling_params": {"temperature": 0.3}}'

Known Issues

For larger models, such as DeepSeekV3 and Llama-3.1-405B-Instruct-FP8-KV and higher concurrency(512+), errors with below signature is observed:
'<TransferEncodingError: 400, message:\n Not enough data to satisfy transfer length header.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n '
This leads to dropping requests and lower throughput.This issue is being discussed on the SGLang forums.