Commit bb00055
[AMD] vLLM Kimi MXFP4 & MiniMax M2.5 FP8 disaggregated prefill-decode for MI355X (#1569)
* [AMD] Add vLLM disaggregated prefill-decode benchmark for MI355X
Add multi-node vLLM PD disaggregation recipe using Nixl/RIXL KV transfer
and vllm-router, mirroring the existing SGLang disagg recipe structure.
- New benchmark config: dsr1-fp8-mi355x-vllm-disagg (1P2D, TP8)
- New utils: vllm_disagg_utils/ (job.slurm, server.sh, submit.sh, etc.)
- Runner: extend launch_mi355x-amds.sh for vllm-disagg framework
* [AMD] Refactor vLLM disagg recipe: models.yaml, UCX cleanup, QoS support
Extract hardcoded model configurations from server.sh bash maps and
job.slurm VALID_MODELS into a declarative models.yaml, mirroring the
SGLang disagg recipe pattern. Adding a new model now requires no script
changes.
Also:
- Consolidate UCX transport vars in job.slurm Docker env; remove
duplicated setup_ucx_env() from server.sh
- Extract RDMA workarounds (ionic /31 route fix, Nixl UCX patch) into
setup_rdma_env() helper
- Lower UCX_LOG_LEVEL from info to warn
- Add nicctl mount and QoS/DSCP auto-detection to env.sh
- Remove stale host libionic bind-mounts (driver now built into image)
* [AMD] Update vLLM disagg recipe for v0.17.1 NixlConnector API
Adapt server.sh to vLLM v0.17.1 breaking changes:
- Use simplified kv-transfer-config (side channel via env vars instead
of kv_ip/kv_port, add kv_load_failure_policy)
- Remove deprecated --disable-log-requests (disabled by default in v0.17)
- Route NIXL side channel through RDMA IP for correct fabric path
- Fix RIXL ucx_error_handling_mode patch for updated _api.py layout
* [AMD] Make vLLM disagg recipe CI-compatible (mia1 cluster)
bench.sh: replace `vllm bench serve` (log-only output) with the shared
run_benchmark_serving helper from benchmark_lib.sh, matching the SGLang
disagg pattern. This produces the .json result files that the multinode
CI workflow expects (benchmark-multinode-tmpl.yml → process_result.py).
server.sh: make the Nixl ucx_error_handling_mode=none runtime patch
conditional on Pensando ionic RDMA devices (IBDEVICES=*ionic*). On the
mia1 cluster (ConnectX/mlx5, IBDEVICES=rdma*), UCX handles error mode
natively and the patch is skipped.
Model-path resolution and IBDEVICES/UCX/QoS auto-detection were verified
to already work on mia1 — no changes needed.
Tested locally (Job 2802, 1P+2D, ISL/OSL=1024):
conc 8 → 507 tok/s conc 32 → 1778 tok/s
conc 16 → 1004 tok/s conc 64 → 2480 tok/s
All four .json result files produced; 100% external prefix cache hit rate.
* [AMD] Co-locate vLLM disagg router with prefill on NODE_RANK=0
Move the vllm-router from a dedicated proxy node onto the first prefill
node, mirroring SGLang's co-location pattern. This reduces the node count
from xP + yD + 1 to xP + yD (e.g., 3 nodes instead of 4 for 1P+2D).
- server.sh: NODE_RANK=0 now runs both vllm serve (prefill, port 2584)
and vllm-router (port 30000); barrier waits on all nodes
- submit.sh / job.slurm: NUM_NODES = PREFILL_NODES + DECODE_NODES
- bench.sh: ROUTER_PORT default updated to 30000
Local 1P+2D benchmark (ISL/OSL=1024, DeepSeek-R1 FP8, MI355X):
- Throughput: +1.6% to +8.4% across concurrency 8-64
- Mean TTFT: -22% to -63% (prefill is local to router)
- TPOT/ITL: unchanged (within noise)
- 25% fewer nodes, no performance regression
* [AMD] Use public vLLM base image with runtime dependency install
Replace the custom Docker image (vllm_disagg_pd:latest) with the public
vllm/vllm-openai-rocm:v0.17.1 base image. Missing components (UCX, RIXL,
etcd, libionic1, vllm-router) are now installed at container start via
setup_deps.sh, which is sourced by server.sh.
This eliminates the need to build, host, and maintain a custom image —
CI nodes can pull directly from Docker Hub.
Changes:
- Add setup_deps.sh: idempotent installer for UCX (ROCm fork), RIXL,
etcd, libionic1 (Pensando ionic), and vllm-router (NODE_RANK=0 only).
Build steps run in subshells to avoid CWD pollution.
- server.sh: source setup_deps.sh before any other logic
- job.slurm: add --entrypoint "" to override the base image's vllm CLI
entrypoint, allowing bash -lc to work correctly
- env.sh: update comment (paths now set by setup_deps.sh, not image ENV)
- amd-master.yaml: image changed to vllm/vllm-openai-rocm:v0.17.1
Tested locally (Job 2807, 3 nodes, ISL/OSL=1024):
Setup overhead: ~2.5 min per node (all components built from source)
Benchmark completed successfully across concurrency 8/16/32/64
* [AMD] Enable Expert Parallelism with MoRI all-to-all on vLLM disagg decode
Enable MoRI-based Expert Parallelism (--enable-expert-parallel
--all2all-backend mori) on decode workers for DeepSeek-R1-0528,
while keeping TP=8 to preserve KV cache transfer compatibility
with the prefill node via NixlConnector. This matches SGLang's
approach of TP=8 + EP within the TP group.
KV Transfer: RIXL/NixlConnector (unchanged)
MoE All-to-All: NCCL (default) -> MoRI-EP (--all2all-backend mori)
Changes:
- models.yaml: Add --enable-expert-parallel --all2all-backend mori
to decode_flags; increase engine ready timeout to 1200s
- setup_deps.sh: Add MoRI install and vLLM v0.17.1 patches for
MoRI-EP + FP8 compatibility (AITER assertion, defer_input_quant)
- server.sh: Support decode_env from models.yaml for decode-specific
environment overrides
- dsr1_fp8_mi355x_vllm-disagg.sh: Pass NODELIST to submit.sh for
Slurm node constraints
* [AMD] Switch vLLM disagg KV transfer to MoRI-IO with protocol-aware proxy
Replace NixlConnector with MoRIIOConnector for KV cache transfer and
replace the Rust-based vllm-router with a MoRI-IO-aware Python proxy
that handles both HTTP routing and ZMQ-based RDMA endpoint discovery.
The key architectural change is that the proxy enriches each request's
kv_transfer_params with remote RDMA endpoint info (handshake_port,
notify_port, host, port) before dispatching, enabling concurrent
prefill+decode in WRITE mode — something vllm-router could not do
because it only understands HTTP, not the MoRI-IO registration protocol.
Changes:
- Add moriio_proxy.py: MoRI-IO-aware proxy with ZMQ service discovery,
request enrichment, and /health endpoint (adapted from vLLM upstream
moriio_toy_proxy_server.py)
- server.sh: switch --kv-transfer-config from NixlConnector to
MoRIIOConnector with kv_connector_extra_config (proxy_ip,
proxy_ping_port, http_port); launch proxy before prefill on NODE_RANK=0;
set VLLM_DISABLE_REQUEST_ID_RANDOMIZATION=1 as workaround for v0.17.1
completion-ID mismatch (upstream fix: vllm-project/vllm#34907)
- setup_deps.sh: replace vllm-router/Rust install with lightweight
Python deps (quart, aiohttp, msgpack, pyzmq) for the proxy
Benchmark (Job 2853 vs 2818 NixlConnector baseline, ISL/OSL=1024):
TTFT median: -37% to -55% across C8–C64 (e.g. 384→241ms @C64)
TTFT p99: -63% at C64 (6622→2469ms)
Throughput: +8% at C64 (2634→2844 tok/s)
TPOT: unchanged (~22ms @C64)
* [AMD] BUG fix: RANDOM_RANGE_RATIO never reaches bench.sh
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* Bug fix: 1. With DRY_RUN=1, node 0 skipped starting proxy/prefill but still ran the first barrier; 2. kill and kill run only when DRY_RUN=0
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* [AMD] Fix vLLM disagg hang: READ mode support + safety timeouts
Enable READ-mode KV transfer (decode-initiated RDMA reads) with a
critical scheduler assertion fix, and add safety timeouts to prevent
indefinite hangs during RDMA transfers.
Changes:
- setup_deps.sh: Add patches — save_kv_layer/start_load_kv
handshake timeouts (30s), RDMA transfer timeout (120s), deferred
write task expiry (60s), write worker error handling, and scheduler
assertion fix for READ-mode intermediate request states
- moriio_proxy.py: Add stream idle timeout (PROXY_STREAM_IDLE_TIMEOUT)
to abort stalled decode streams, and proper response.release()
- submit.sh, job.slurm: Plumb PROXY_STREAM_IDLE_TIMEOUT and
VLLM_MORIIO_CONNECTOR_READ_MODE env vars into Docker containers
Validated: 1k/1k full sweep (C8–C512), 100% success rate at all
concurrency levels, peak 8500 output tok/s at C512.
* Adapt vLLM disagg recipe for 9N mia1 cluster (mlx5 NICs)
Port the vLLM disaggregated serving pipeline from the 4N cluster
(Pensando ionic NICs) to the 9N mia1 cluster (mlx5/rdma NICs).
Key changes:
- Fix C512 deadlock: apply ucx_error_handling_mode=none universally
instead of only for ionic NICs. Under high concurrency, UCX's default
UCP_ERR_HANDLING_MODE_PEER prevents RIXL RDMA READ retries from
recovering after ibv_post_send queue exhaustion, causing prefill KV
cache saturation and pipeline deadlock.
- Force-reinstall MoRI from b645fc8 to fix PCI topology assertion
failure on nodes with Broadcom PEX890xx PCIe switches.
- Auto-detect Docker privilege (sudo vs non-sudo) for cross-cluster
portability.
- Add SLURM_EXCLUDE_NODES support to skip nodes with broken Docker
sockets.
- Increase VLLM_ENGINE_READY_TIMEOUT_S to 3600 to accommodate longer
setup times (RIXL/MoRI source builds over NFS).
* [AMD] Fix vLLM disagg sweep hang: KV cache leak + benchmark client hardening
Server-side: RIXL can lose `finished_sending` notifications under high
concurrency with ibv_post_send failures, permanently leaking prefill KV
blocks. Over multiple benchmark rounds (sweep), leaked blocks accumulate
and saturate the prefill KV cache, deadlocking C512.
- Fix finished_sending handler to unconditionally free KV blocks
(the conditional status check had no recovery path, causing leaks)
- Add idle KV block reaper: detects engine idle >5s with finished
requests still holding blocks, then force-frees them
- Add 10s cooldown between benchmark rounds for reaper activation
Client-side: SSE streaming loop did not break on the [DONE] sentinel,
causing the benchmark client to hang when the proxy held connections
open after request completion.
- Break SSE loop on [DONE] in completions and chat completions
- Share a single aiohttp.ClientSession across all requests (connection
pooling via TCPConnector instead of per-request session creation)
- Add asyncio.wait_for timeout around asyncio.gather with proper task
cancellation and partial result collection
- Reduce AIOHTTP_TIMEOUT from 6h to 30min
Verified: sweep 1K/1K C128→C256→C512 all pass (Job 6222, 9N cluster).
* [AMD] Fix vLLM disagg Slurm job never terminating after benchmark completion
Background processes (proxy, prefill, decode, etcd) were started via
`cmd 2>&1 | tee logfile &`, causing bash $! to capture the PID of tee
rather than the actual process. `kill $pid` only killed tee, leaving the
real process running. The proxy kept port 30000 open, so decode nodes'
`sync.py wait` never detected shutdown and the Slurm job hung forever.
Additionally, etcd's stderr was not redirected, holding the Docker
container's main pipe open and preventing container exit even after
server.sh completed.
Changes:
- Redirect all background processes to log files instead of piping
through tee, so $! captures the correct PID (matches SGLang pattern)
- Redirect etcd launcher's stderr to prevent pipe leak
- Add pkill fallback cleanup for proxy, vllm, and etcd processes
- Increase barrier grace period to handle node setup time variance
- Increase container creation barrier timeout from 300s to 600s
* [AMD] Enable MoRI-IO READ mode by default for vLLM disagg
* [AMD] Fix CI checkout failure caused by root-owned __pycache__ files
Fix per-node Docker privilege detection in vLLM disagg job.slurm
* [AMD] Fix CI checkout EACCES by redirecting Python bytecache off NFS
Docker containers run as root, so __pycache__/*.pyc files created
during benchmark_serving.py import end up root-owned on the NFS
workspace. The CI runner cannot delete them, breaking checkout.
Set PYTHONPYCACHEPREFIX=/tmp/pycache in the Docker env so bytecache
stays inside the container. Remove the previous server.sh find-and-
delete workaround since the root cause is now addressed.
* [AMD] Fix KV reaper deadlock on high-ISL disagg workloads
The idle KV block reaper only fired when both running=0 AND waiting=0.
Under 8K ISL at C64+, leaked blocks filled the prefill KV cache while
new requests queued in WAITING state — the non-empty wait queue
prevented the reaper from ever triggering, causing a permanent hang.
Remove the waiting-queue check so the reaper fires whenever no requests
are actively running, which is precisely when leaked blocks can be
safely reclaimed.
Verified with 8K/1K sweep (C32–C512) completing without hangs.
* [AMD] Enable reading PREFILL_TP,PREFILL_EP,PREFILL_DP_ATTN,DECODE_TP,DECODE_EP,DECODE_DP_ATTN from amd-master.yaml config.
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* [AMD] Upgrade vLLM disagg image from v0.17.1 to v0.18.0
Bump vllm/vllm-openai-rocm to v0.18.0 for the dsr1-fp8-mi355x-vllm-disagg
config. Changes required by the new image:
- setup_deps.sh: drop aiohttp/pyzmq installs (now pre-installed in v0.18.0);
move install_mori_proxy_deps before patches and run on all nodes so msgpack
is available when patch scripts import MoRI-IO connector modules
- moriio_proxy.py: populate transfer_id in kv_transfer_params dicts (new
required field in v0.18.0's moriio_connector.update_state_after_alloc)
- MoRI PCI topology bug persists in v0.18.0; rebuild from b645fc8 retained
Tested: 1K1K C8,16,32,64,128,256 on mia1 3-node (1P+2D)
CONC512 is ongoing but it seems good so far
* [AMD] Add Kimi-K2.5-MXFP4 disagg inference config (1P2D)
Enable vLLM disagg serving for amd/Kimi-K2.5-MXFP4 on MI355X
with a 1P2D node topology (TP=8, decode EP=8).
Changes:
- amd-master.yaml: add kimik2.5-fp4-mi355x-vllm-disagg config with
three seq-len scenarios (1K1K, 8K1K), READ mode enabled
- models.yaml: add Kimi-K2.5-MXFP4 server flags (PIECEWISE cudagraph,
--gpu-memory-utilization 0.90, --mm-encoder-tp-mode data)
- bench.sh: add --trust-remote-code for models with custom code
- setup_deps.sh: install amd-quark for MXFP4 quantization support
- Add kimik2.5_fp4_mi355x_vllm-disagg.sh entry script
Verified with full 1K/1K sweep (CONC 8–512) on SA4N and mia1 9N
cluster; all concurrency levels completed without hang.
* feat: add MiniMax M2.5 PD disaggregation recipe (1P2D, MoRI-EP + MoRI-IO)
Cherry-picked from ChuanLi1101/InferenceMAX:chuali/minimax-m25-vllm-disagg
(commit 72a0002). Resolved conflict in models.yaml to keep both
Kimi-K2.5-MXFP4 and MiniMax-M2.5 entries.
Add multi-node vLLM PD disaggregation support for MiniMax-M2.5 (FP8),
following the DeepSeek R1 disagg recipe pattern. Includes:
- models.yaml: MiniMax-M2.5 config with TP8 prefill / TP8+EP8+MoRI decode
- Entry script: minimaxm25_fp8_mi355x_vllm-disagg.sh
- amd-master.yaml: e2e test entry for 1P2D on MI355X (1k1k, 8k1k, 1k8k)
MiniMax M2.5 (230B, 256 experts, top-8 sigmoid routing, GQA) uses the
same disagg infrastructure as DSR1. Unlike DeepSeek MLA models, M2.5
uses standard GQA attention so AITER paged attention is fully supported
and no block-size/cudagraph workarounds are needed.
Co-authored-by: ChuanLi1101 <Chuan.Li2@amd.com>
Co-authored-by: Claude
Made-with: Cursor
* feat: add Dockerfile and runtime patch for MiniMax M2.5 WideEP + MoRI
Cherry-picked from ChuanLi1101/InferenceMAX:chuali/minimax-m25-vllm-disagg
(commit bb6bd0e). Adapted for v0.18.0 base: kept vllm/vllm-openai-rocm:v0.18.0
image (runtime patch via setup_deps.sh is sufficient; custom Docker image
available in docker/minimax-m25-disagg/ if needed).
Two deployment options for getting vLLM minimax_m2.py changes into the container:
Option A -- Custom Docker image (docker/minimax-m25-disagg/):
Builds from the public vLLM ROCm image and pre-installs UCX, etcd, RIXL,
and patched minimax_m2.py with WideEP + MoRI + EPLB support baked in.
Option B -- Runtime patch (setup_deps.sh):
patch_minimax_m2_wideep_mori() copies patched minimax_m2.py from the
mounted InferenceX repo into the container's vLLM installation at startup.
Co-authored-by: ChuanLi1101 <Chuan.Li2@amd.com>
Co-authored-by: Claude
Made-with: Cursor
* Fix: rename minimaxm25 to minimaxm2.5 for CI naming consistency
Align MiniMax M2.5 disagg naming with existing single-node configs
(minimaxm2.5_fp8_mi355x.sh, minimaxm2.5_fp8_mi300x.sh, etc.).
- amd-master.yaml: minimaxm25 -> minimaxm2.5 in config key + model-prefix
- Rename entry script: minimaxm25_fp8_mi355x_vllm-disagg.sh ->
minimaxm2.5_fp8_mi355x_vllm-disagg.sh
- Dockerfile: update COPY path to match renamed script
* Optimize: add --gpu-memory-utilization 0.95 and --block-size 32 to MiniMax M2.5 disagg
Align MiniMax M2.5 disagg serve parameters with the proven single-node
config (minimaxm2.5_fp8_mi355x.sh). MiniMax M2.5 uses GQA (not MLA),
so block-size 32 is optimal (vs block-size 1 for DeepSeek/Kimi MLA).
The extra 5% GPU memory (0.95 vs default 0.9) increases KV cache
capacity for high-concurrency sweeps (C256/C512).
* Fix: MiniMax M2.5 disagg — require EP=8 for prefill, fix ROCm gate dtype
MiniMax M2.5 has expert intermediate_size=1536; with TP=8 and no EP the
sharded dimension (192) is not divisible by FP8 block_n=128, crashing
the prefill node. Set prefill EP=8 (matching decode and single-node)
and add --enable-expert-parallel --all2all-backend mori to prefill_flags.
Fix GateLinear to use out_dtype=torch.float32 instead of
params_dtype=torch.float32 so the GEMM runs in bf16 (ROCm compatible)
and only the output is cast to fp32 for routing precision.
Remove the 1K/8K benchmark scenario (not needed).
* Remove unused docker/minimax-m25-disagg/ directory
The Dockerfile, build.sh, and duplicate minimax_m2.py patch were never
used by the CI pipeline or local tests.
* remove vllm disagg for dpsr1 and dpv3
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* consolidate amd_utils for sglang and vllm
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* use vLLM router as default router for vllm disagg
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* fix bugs
Signed-off-by: Chun Fang <chun.fang@amd.com>
* [AMD] Bump to nightly vllm and vllm-router images (#1208)
---------
Signed-off-by: Simon Danielsson <pedaniel@amd.com>
* update vllm image and vllm router image
* update the interface prefix for tw cluster
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* add deps for ib device auto-detection
Signed-off-by: Shan Theresa <theresa.shan@amd.com>
* update vllm image
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* fix indentation and add missing finally block in async_request_openai_chat_completions
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix tw-eth interface detection pattern in env.sh
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix vllm-disagg config schema: use scenarios.fixed-seq-len
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix vllm-disagg routing to multi_node benchmark subdir
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix result collection to use FRAMEWORK as log directory prefix
The inline collect_latest_results.py hardcoded "sglang" as the log
directory prefix, causing "No logs directory found" for vllm-disagg
runs where bench.sh creates directories named vllm-disagg_isl_X_osl_Y.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* suppress tokenizer warnings and debug output in bench.sh
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix vllm-disagg deadlock: stop router after rank 0 container exits
The vllm-router runs as a separate container on node 0. After node 0's
main container finishes the benchmark and exits, decode nodes remain
stuck waiting for the router port to close. The router cleanup in
job.slurm can't run until srun completes, but srun can't complete
because decode nodes are blocked — deadlock.
Fix: skip exec on rank 0 for vllm-disagg so the srun bash script
continues after docker exits and can stop the router container,
allowing decode nodes to detect the port closure and exit.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* reduce vllm-disagg concurrency sweep to single point for faster iteration
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* preserve slurm logs on failure and print stderr inline
The EXIT trap deleted benchmark_logs/ before saving artifacts, making
it impossible to debug container startup failures. Now the trap always
copies slurm .out/.err to the artifact directory and prints the last
100 lines of .err inline in the CI output.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* enable set -x around docker privilege detection for CI debugging
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix docker detection: test on compute node, not batch host
The batch host has docker socket permissions but the compute nodes
do not, causing "permission denied" on all srun tasks. Move the
detection after SELECTED_NODES is known and probe via srun.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix docker detection: per-node probe since group membership varies
Export DOCKER_CMD_DETECT as a shell snippet that each srun participant
evaluates locally, instead of testing a single node and assuming all
nodes have the same docker socket permissions.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* add vllm-disagg changelog entries and update kimi conc-list
- Add perf-changelog entries for kimik2.5-fp4-mi355x-vllm-disagg and
minimaxm2.5-fp8-mi355x-vllm-disagg to trigger CI benchmarks
- Update kimi 1k1k conc-list from [8] to [16]
- Comment out kimi 8k1k config until eval pipeline is wired up
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* switch vllm-disagg to 8k1k config to trigger multi-node eval
Comment out 1k1k config and enable 8k1k with conc-list [16] so
mark_eval_entries picks it up for the eval pipeline.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* add multi-node eval feature
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* remove start_etcd.sh
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* change decode to 1, easier for testing
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* add --served-model-name to vllm serve commands and wire up eval
Set --served-model-name on all prefill/decode vllm serve commands so
the model name matches what run_lm_eval sends in API requests. Also
add eval pipeline support (health check, run_eval, artifact staging)
mirroring server_sglang.sh.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix model name consistency between vllm serve and bench client
bench.sh now uses MODEL_NAME for vllm-disagg to match
--served-model-name, and MODEL_PATH for sglang to match its default.
Simplified SERVED_MODEL to use MODEL_NAME directly since MODEL env
var is not available inside the container.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* add token patch to bench for vllm
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* add --tokenizer passthrough to run_benchmark_serving
benchmark_lib.sh rejected unknown flags — add --tokenizer support so
vllm-disagg bench can resolve the tokenizer from the local model path
instead of attempting an HF download with the short model name.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* update vllm image for kimi2.5 and Minimax disagg.
Signed-off-by: Shan Theresa <theresa.shan@amd.com>
* Update setup_deps.sh
* Update amd-master.yaml
restore the kimi k2.5 settings
* update req rate for vllm.
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* make the sglang env consistent with upstream
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* node blacklist
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* fix: remove faulty minimax patch
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: remove unneeded commented-out code from setup_deps.sh
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: bump to latest nightly vllm image on minimax
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: temporarily mount /coredumps
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* tmp: add bette r debugging capabilities
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: disable custom all-reduce for minimax
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: minimax segfault by avoiding M=8K fmoe kernel shape
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* revert: fix: temporarily mount /coredumps
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* feat: add VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 as in single node example
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
* fix: use FRAMEWORK arg in collect_latest_results.py to match vllm-disagg log dirs
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
* remove unused vllm_disagg_utils directory
No external references to this folder exist in the codebase.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* revert: restore backend_request_func.py to match main
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* revert: restore benchmark_serving.py to match main
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* revert: fully restore benchmark_serving.py to match main
Restores import order and failure-rate check block.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* revert: fully restore backend_request_func.py to match main
Restores _resolve helper and tokenizer fix logic.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* add pr-link to vllm-disagg changelog entries
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix: sync env.sh with upstream main
- Fix IBDEVICES detection log: move info message inside success branch,
exit 1 on failure instead of silently propagating empty strings
- Add missing SGLANG_USE_AITER=1
- Set SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0 to match upstream
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix: restore SGLANG_MORI_COMBINE_DTYPE in server launch commands
The refactored server_sglang.sh dropped the per-role COMBINE_DTYPE
mapping that the old server.sh had. SGLang reads SGLANG_MORI_COMBINE_DTYPE
internally, so map it from MORI_COMBINE_DTYPE_PREFILL (fp8_direct_cast)
on prefill and MORI_COMBINE_DTYPE_DECODE (fp8) on decode.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* refactor: move static vLLM env vars to env.sh, remove dead etcd code
Move VLLM_USE_V1, VLLM_SERVER_DEV_MODE, VLLM_DISABLE_REQUEST_ID_RANDOMIZATION
to env.sh alongside other engine-specific config. Remove commented-out
etcd setup block that is no longer used.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix: pass IS_MULTINODE into Docker container
The refactored DOCKER_ENV_COMMON array dropped -e IS_MULTINODE that
the old job.slurm had. Without it, eval metadata tagging inside the
container sees an empty value.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix: improve vllm-disagg changelog descriptions
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
* fix: restore DP+EP override blocks and trailing newline in server_sglang.sh
Add BENCH_MAX_CONC_VALUE extraction and the two DP+EP override blocks
that the refactor from server.sh dropped. These adjust max-running-requests,
dispatch tokens, and MOE input tokens when both DP and EP are enabled.
Also add trailing newline for POSIX compliance. server_sglang.sh now
matches upstream server.sh exactly.
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
---------
Signed-off-by: Theresa Shan <theresa.shan@amd.com>
Signed-off-by: Chun Fang <chun.fang@amd.com>
Signed-off-by: Simon Danielsson <pedaniel@amd.com>
Signed-off-by: Shan Theresa <theresa.shan@amd.com>
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Co-authored-by: Chun Fang <chun.fang@amd.com>
Co-authored-by: ChuanLi1101 <Chuan.Li2@amd.com>
Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com>
Co-authored-by: Claude Opus 4 <noreply@anthropic.com>
Co-authored-by: simondanielsson <simon.danielsson99@hotmail.com>1 parent c5ff8da commit bb00055
18 files changed
Lines changed: 2984 additions & 1165 deletions
File tree
- .github/configs
- benchmarks
- multi_node
- amd_utils
- runners
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1350 | 1350 | | |
1351 | 1351 | | |
1352 | 1352 | | |
| 1353 | + | |
| 1354 | + | |
| 1355 | + | |
| 1356 | + | |
| 1357 | + | |
| 1358 | + | |
| 1359 | + | |
| 1360 | + | |
| 1361 | + | |
| 1362 | + | |
| 1363 | + | |
| 1364 | + | |
| 1365 | + | |
| 1366 | + | |
| 1367 | + | |
| 1368 | + | |
| 1369 | + | |
| 1370 | + | |
| 1371 | + | |
| 1372 | + | |
| 1373 | + | |
| 1374 | + | |
| 1375 | + | |
| 1376 | + | |
| 1377 | + | |
| 1378 | + | |
| 1379 | + | |
| 1380 | + | |
| 1381 | + | |
| 1382 | + | |
| 1383 | + | |
| 1384 | + | |
| 1385 | + | |
| 1386 | + | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
| 1396 | + | |
| 1397 | + | |
| 1398 | + | |
| 1399 | + | |
| 1400 | + | |
| 1401 | + | |
| 1402 | + | |
| 1403 | + | |
| 1404 | + | |
| 1405 | + | |
| 1406 | + | |
| 1407 | + | |
| 1408 | + | |
| 1409 | + | |
| 1410 | + | |
| 1411 | + | |
| 1412 | + | |
| 1413 | + | |
| 1414 | + | |
| 1415 | + | |
| 1416 | + | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
| 1423 | + | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
| 1427 | + | |
| 1428 | + | |
| 1429 | + | |
| 1430 | + | |
| 1431 | + | |
| 1432 | + | |
| 1433 | + | |
| 1434 | + | |
| 1435 | + | |
| 1436 | + | |
| 1437 | + | |
| 1438 | + | |
| 1439 | + | |
| 1440 | + | |
| 1441 | + | |
| 1442 | + | |
| 1443 | + | |
| 1444 | + | |
| 1445 | + | |
| 1446 | + | |
| 1447 | + | |
| 1448 | + | |
| 1449 | + | |
| 1450 | + | |
| 1451 | + | |
| 1452 | + | |
| 1453 | + | |
| 1454 | + | |
| 1455 | + | |
| 1456 | + | |
| 1457 | + | |
| 1458 | + | |
| 1459 | + | |
| 1460 | + | |
| 1461 | + | |
1353 | 1462 | | |
1354 | 1463 | | |
1355 | 1464 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
210 | 210 | | |
211 | 211 | | |
212 | 212 | | |
| 213 | + | |
213 | 214 | | |
214 | 215 | | |
215 | 216 | | |
| |||
278 | 279 | | |
279 | 280 | | |
280 | 281 | | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
281 | 286 | | |
282 | 287 | | |
283 | 288 | | |
| |||
385 | 390 | | |
386 | 391 | | |
387 | 392 | | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
388 | 397 | | |
389 | 398 | | |
390 | 399 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
2 | 15 | | |
3 | 16 | | |
4 | 17 | | |
5 | 18 | | |
6 | 19 | | |
7 | 20 | | |
8 | 21 | | |
9 | | - | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
10 | 29 | | |
11 | 30 | | |
12 | 31 | | |
13 | 32 | | |
14 | 33 | | |
15 | | - | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
16 | 39 | | |
17 | 40 | | |
18 | 41 | | |
19 | 42 | | |
20 | 43 | | |
21 | | - | |
| 44 | + | |
22 | 45 | | |
23 | | - | |
24 | | - | |
| 46 | + | |
| 47 | + | |
25 | 48 | | |
| 49 | + | |
26 | 50 | | |
27 | | - | |
28 | | - | |
| 51 | + | |
| 52 | + | |
29 | 53 | | |
30 | 54 | | |
31 | 55 | | |
32 | | - | |
33 | 56 | | |
34 | 57 | | |
35 | | - | |
| 58 | + | |
36 | 59 | | |
37 | 60 | | |
38 | 61 | | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
39 | 67 | | |
40 | 68 | | |
41 | 69 | | |
42 | 70 | | |
43 | | - | |
| 71 | + | |
44 | 72 | | |
45 | 73 | | |
| 74 | + | |
46 | 75 | | |
47 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
48 | 87 | | |
49 | 88 | | |
50 | | - | |
51 | | - | |
| 89 | + | |
| 90 | + | |
52 | 91 | | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
57 | 96 | | |
58 | 97 | | |
59 | 98 | | |
60 | | - | |
| 99 | + | |
61 | 100 | | |
62 | 101 | | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
63 | 108 | | |
0 commit comments