Skip to content

Commit f9a1343

Browse files
aryguptclaude
andcommitted
Merge origin/main into feat/measured-power-multinode
Resolve perf-changelog.yaml append conflict by keeping all three new entries: main's #1579 (qwen3.5-fp4-mi355x-sglang-disagg) plus this branch's #1574 re-trigger and the AMD multinode measured-power entry. Append-only file (process_changelog rejects deletions); no lines removed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2 parents f407f4b + 34528e3 commit f9a1343

4 files changed

Lines changed: 177 additions & 0 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -494,6 +494,60 @@ qwen3.5-fp4-mi355x-sglang-mtp:
494494
- { tp: 2, conc-start: 4, conc-end: 256, spec-decoding: mtp }
495495
- { tp: 4, conc-start: 4, conc-end: 16, spec-decoding: mtp }
496496

497+
qwen3.5-fp4-mi355x-sglang-disagg:
498+
image: lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523
499+
model: amd/Qwen3.5-397B-A17B-MXFP4
500+
model-prefix: qwen3.5
501+
runner: mi355x-disagg
502+
precision: fp4
503+
framework: sglang-disagg
504+
multinode: true
505+
disagg: true
506+
scenarios:
507+
fixed-seq-len:
508+
- isl: 1024
509+
osl: 1024
510+
search-space:
511+
# 1P1D TP8/EP1, dp-attn false; MoRI conn.py overlay via job.slurm.
512+
- spec-decoding: "none"
513+
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
514+
prefill:
515+
num-worker: 1
516+
tp: 8
517+
ep: 1
518+
dp-attn: false
519+
additional-settings:
520+
- "PREFILL_NODES=1"
521+
decode:
522+
num-worker: 1
523+
tp: 8
524+
ep: 1
525+
dp-attn: false
526+
additional-settings:
527+
- "DECODE_NODES=1"
528+
- "DECODE_MTP_SIZE=0"
529+
530+
- isl: 8192
531+
osl: 1024
532+
search-space:
533+
- spec-decoding: "none"
534+
conc-list: [ 8, 16, 32, 64, 128, 256, 512 ]
535+
prefill:
536+
num-worker: 1
537+
tp: 8
538+
ep: 1
539+
dp-attn: false
540+
additional-settings:
541+
- "PREFILL_NODES=1"
542+
decode:
543+
num-worker: 1
544+
tp: 8
545+
ep: 1
546+
dp-attn: false
547+
additional-settings:
548+
- "DECODE_NODES=1"
549+
- "DECODE_MTP_SIZE=0"
550+
497551
qwen3.5-fp8-mi300x-sglang:
498552
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
499553
model: Qwen/Qwen3.5-397B-A17B-FP8

benchmarks/multi_node/amd_utils/models.yaml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,37 @@ DeepSeek-R1-0528:
161161
chunked_prefill_size: 262144
162162
cuda_graph_bs_range: "1-128"
163163

164+
Qwen3.5-397B-A17B-MXFP4:
165+
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori --moe-dense-tp-size 1"
166+
mtp_flags: ""
167+
dp_flags: "--moe-a2a-backend mori --enable-dp-attention --enable-dp-lm-head"
168+
prefill:
169+
mem_fraction_static: 0.8
170+
disable_radix_cache: true
171+
dp:
172+
max_running_requests: 24
173+
chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_PREFILL * PREFILL_TP_SIZE"
174+
cuda_graph_bs: "1 2 3"
175+
no_dp:
176+
max_running_requests: 128
177+
chunked_prefill_size: 262144
178+
cuda_graph_bs_range: "1-128"
179+
decode:
180+
mem_fraction_static: 0.85
181+
prefill_round_robin_balance: true
182+
dp:
183+
max_running_requests: 4096
184+
chunked_prefill_size: "MORI_MAX_DISPATCH_TOKENS_DECODE * DECODE_TP_SIZE"
185+
cuda_graph_bs_range: "1-160"
186+
ep_only:
187+
max_running_requests: 256
188+
chunked_prefill_size: 262144
189+
cuda_graph_bs_range: "1-256"
190+
no_dp:
191+
max_running_requests: 128
192+
chunked_prefill_size: 262144
193+
cuda_graph_bs_range: "1-128"
194+
164195
Qwen3.5-397B-A17B-FP8:
165196
base_flags: "--decode-log-interval 1000 --log-level warning --watchdog-timeout 3600 --load-balance-method round_robin --kv-cache-dtype fp8_e4m3 --attention-backend aiter --disaggregation-transfer-backend mori --moe-dense-tp-size 1"
166197
mtp_flags: ""
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
#!/usr/bin/env bash
2+
3+
source "$(dirname "$0")/../benchmark_lib.sh"
4+
5+
check_env_vars \
6+
CONC_LIST \
7+
ISL \
8+
OSL \
9+
IMAGE \
10+
SPEC_DECODING \
11+
MODEL_PATH \
12+
PREFILL_NUM_WORKERS \
13+
PREFILL_TP \
14+
PREFILL_EP \
15+
PREFILL_DP_ATTN \
16+
DECODE_NUM_WORKERS \
17+
DECODE_TP \
18+
DECODE_EP \
19+
DECODE_DP_ATTN \
20+
PREFILL_NODES \
21+
DECODE_NODES \
22+
RANDOM_RANGE_RATIO \
23+
FRAMEWORK
24+
25+
if [[ -n "$SLURM_JOB_ID" ]]; then
26+
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
27+
fi
28+
29+
set -x
30+
31+
# Use upstreamed multi_node scripts (no external clone needed)
32+
cd "$GITHUB_WORKSPACE/benchmarks/multi_node/amd_utils" || exit 1
33+
34+
# Set up SGL launch script-specific environment variables
35+
export TIME_LIMIT="08:00:00"
36+
export MODEL_PATH=$MODEL_PATH
37+
export MODEL_NAME=$MODEL_NAME
38+
export CONTAINER_IMAGE=$IMAGE
39+
40+
if [[ "${PREFILL_EP:-1}" -eq 1 ]]; then
41+
export PREFILL_ENABLE_EP=false
42+
else
43+
export PREFILL_ENABLE_EP=true
44+
fi
45+
46+
if [[ "$PREFILL_DP_ATTN" == "true" ]]; then
47+
export PREFILL_ENABLE_DP=true
48+
else
49+
export PREFILL_ENABLE_DP=false
50+
fi
51+
52+
if [[ "${DECODE_EP:-1}" -eq 1 ]]; then
53+
export DECODE_ENABLE_EP=false
54+
else
55+
export DECODE_ENABLE_EP=true
56+
fi
57+
58+
if [[ "$DECODE_DP_ATTN" == "true" ]]; then
59+
export DECODE_ENABLE_DP=true
60+
else
61+
export DECODE_ENABLE_DP=false
62+
fi
63+
64+
# Launch jobs based on ISL/OSL
65+
# Replace ' ' in CONC_LIST with 'x' such that the concurrency list is represented
66+
# by a list of numbers delimited by 'x'. This is because of how the underlying launch script
67+
# expects the concurrencies.
68+
JOB_ID=$(bash ./submit.sh $PREFILL_NODES \
69+
$PREFILL_NUM_WORKERS \
70+
$DECODE_NODES \
71+
$DECODE_NUM_WORKERS \
72+
$ISL $OSL "${CONC_LIST// /x}" inf \
73+
${PREFILL_ENABLE_EP} ${PREFILL_ENABLE_DP} \
74+
${DECODE_ENABLE_EP} ${DECODE_ENABLE_DP} \
75+
${PREFILL_TP} ${DECODE_TP} \
76+
${RANDOM_RANGE_RATIO} \
77+
${NODE_LIST:-})
78+
79+
if [[ $? -ne 0 ]]; then
80+
echo "Failed to submit job" >&2
81+
exit 1
82+
fi
83+
84+
echo "$JOB_ID"

perf-changelog.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3213,3 +3213,11 @@
32133213
- "The AMD amd_utils SLURM job has no orchestrator perfmon, so each SGLang/vLLM disagg node starts its own amd-smi monitor via start_perf_monitor (benchmarks/benchmark_lib.sh), writing perf_samples_<role>_w<idx>_<host>.csv into the NFS-shared /benchmark_logs/perfmon mount (wired in amd_utils/job.slurm). launch_mi355x-amds.sh collects the per-node CSVs into the GH workspace before the EXIT trap wipes the logs dir and sets GPU_METRICS_CSV_GLOB so the existing Process-result step runs the same vendor-agnostic utils/aggregate_power.py used for NVIDIA: per-source GPU-id namespacing (8 GPUs/node on MI355X, so a TP16 worker over 2 nodes counts 16 GPUs not 8), per-stage prefill/decode energy attribution, and per-worker temp/util/mem when amd-smi exposes those columns."
32143214
- "Covers both engine paths: SGLang disagg (server_sglang.sh role = NODE_RANK bucketed by PREFILL_NODES_PER_WORKER / NODE_OFFSET) and vLLM disagg (server_vllm.sh one worker per node, ranks [0,xP) prefill / [xP,xP+yD) decode). Monitoring is best-effort end-to-end — a missing amd-smi or empty CSV skips power patching without failing the benchmark upload; DISAGG=true threads through to per-stage attribution while agg/non-disagg runs still get cluster-wide power."
32153215
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1574
3216+
3217+
- config-keys:
3218+
- qwen3.5-fp4-mi355x-sglang-disagg
3219+
description:
3220+
- "Add Qwen3.5-397B-A17B-MXFP4 MI355X SGLang PD-disaggregation"
3221+
- "Bump image to lmsysorg/sglang-rocm:v0.5.12.post1-rocm720-mi35x-20260523, 1P1D TP8/EP1, dp-attn false, conc [8..512]"
3222+
- "MoRI conn.py overlay (48e459bd) via job.slurm; launcher qwen3.5_fp4_mi355x_sglang-disagg.sh"
3223+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1579

0 commit comments

Comments
 (0)