Skip to content

Commit f407f4b

Browse files
aryguptclaude
andcommitted
feat(power): AMD multi-node measured-power telemetry (mi355x disagg)
Mirror the NVIDIA gb300/srt-slurm measured-power path on the AMD multi-node disaggregated inference path. With no orchestrator perfmon, each SGLang/vLLM disagg node starts its own amd-smi monitor via start_perf_monitor (benchmark_lib.sh), writing perf_samples_<role>_w<idx>_<host>.csv into the NFS-shared /benchmark_logs/perfmon mount; launch_mi355x-amds.sh collects them and exports GPU_METRICS_CSV_GLOB so the existing vendor-agnostic utils/aggregate_power.py produces per-worker + per-stage power. AMD perfmon wiring: - benchmark_lib.sh: start_perf_monitor helper; case-insensitive amd-smi header filter; log captured CSV header for schema-mismatch visibility - amd_utils/job.slurm: PERFMON_OUTPUT_DIR + interval into each container - amd_utils/server_sglang.sh / server_vllm.sh: per-node role + worker-idx classification (matches each engine's own placement); monitor start + stop on every exit path - runners/launch_mi355x-amds.sh: collect per-node CSVs immediately after job completion (before result-processing early-exits / EXIT-trap wipe), export GPU_METRICS_CSV_GLOB - utils/aggregate_power.py: docstring documents the AMD source (logic already vendor-agnostic) - utils/test_aggregate_power.py: AMD amd-smi multinode tests (per-worker, per-stage J/token, multi-node-per-worker collapse, vLLM topology) - perf-changelog.yaml: trigger the 6 mi355x disagg sweeps (sglang+vllm) Also lands the concurrent per-metric telemetry extension in aggregate_power.py / tests: temp/util/mem aggregation, workers[] schema, and flat per-stage scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode). Verified locally: 107 utils tests pass; bash syntax + shellcheck clean; role mapping + filename contract + full amd-smi->agg pipeline validated; adversarial review findings addressed (CSV collection moved ahead of early exits; case-insensitive amd-smi header). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 5b3bcbb commit f407f4b

5 files changed

Lines changed: 179 additions & 3 deletions

File tree

benchmarks/benchmark_lib.sh

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,18 @@ start_gpu_monitor() {
4141
GPU_MONITOR_PID=$!
4242
echo "[GPU Monitor] Started NVIDIA (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"
4343
elif command -v amd-smi &>/dev/null; then
44-
# Use amd-smi native watch mode (-w) which includes timestamps automatically.
45-
# Pipe through awk to: skip preamble lines, keep first CSV header, skip repeated headers.
44+
# amd-smi metric flags: -p power, -c clocks, -t temperature, -u usage,
45+
# -w <interval> native watch mode (emits a timestamp column per sample),
46+
# --csv. The awk filter keeps the first CSV header line and drops
47+
# amd-smi's preamble / repeated headers. Header match is case-insensitive
48+
# (tolower) so a capitalized "Timestamp," header — should amd-smi ever
49+
# emit one — still passes through; aggregate_power's column detection is
50+
# case-insensitive too. NOTE: amd-smi timestamps are node-local wall
51+
# clock, so multinode aggregation assumes cluster clocks are NTP-synced
52+
# (same assumption as nvidia-smi; aggregate_power windows by absolute
53+
# epoch from benchmark_serving.py).
4654
amd-smi metric -p -c -t -u -w "$interval" --csv 2>/dev/null \
47-
| awk '/^timestamp,/{if(!h){print;h=1};next} h{print}' > "$output" &
55+
| awk 'tolower($0) ~ /^timestamp,/{if(!h){print;h=1};next} h{print}' > "$output" &
4856
GPU_MONITOR_PID=$!
4957
echo "[GPU Monitor] Started AMD (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"
5058
else
@@ -63,11 +71,75 @@ stop_gpu_monitor() {
6371
local lines
6472
lines=$(wc -l < "$GPU_METRICS_CSV")
6573
echo "[GPU Monitor] Collected $lines rows -> $GPU_METRICS_CSV"
74+
# Echo the captured header so a vendor-SMI schema mismatch (the one
75+
# thing that silently yields 0 usable power samples downstream) is
76+
# visible in CI logs without re-running on hardware.
77+
echo "[GPU Monitor] CSV header: $(head -1 "$GPU_METRICS_CSV" 2>/dev/null)"
6678
fi
6779
fi
6880
GPU_MONITOR_PID=""
6981
}
7082

83+
# Start a per-node GPU power monitor for multi-node disaggregated runs.
84+
#
85+
# This is the AMD/SGLang/vLLM analogue of NVIDIA srt-slurm's per-node perfmon
86+
# (PR #35): there is no orchestrator to spawn nvidia-smi on each node, so each
87+
# node starts its own amd-smi/nvidia-smi monitor here. The output filename
88+
# encodes the worker role and index in exactly the format
89+
# utils/aggregate_power.py's _parse_perfmon_label expects:
90+
#
91+
# perf_samples_<role>_w<worker_idx>_<host>.csv
92+
#
93+
# so the downstream aggregation can attribute energy per worker and (for disagg)
94+
# per stage. role must be one of: prefill, decode, agg, frontend.
95+
#
96+
# Output goes to $PERFMON_OUTPUT_DIR, which job.slurm points at the NFS-shared
97+
# /benchmark_logs/perfmon mount so every node's CSV lands in one directory the
98+
# runner can collect. The monitor runs for the whole server lifetime;
99+
# aggregate_power.py windows the samples down to each concurrency's benchmark
100+
# load window using the timestamps benchmark_serving.py writes.
101+
#
102+
# Best-effort by design: an unset output dir, an unknown role, or a missing
103+
# amd-smi/nvidia-smi is a no-op that returns 0 — a monitoring hiccup must never
104+
# fail the benchmark.
105+
#
106+
# Usage: start_perf_monitor <role> <worker_idx> [interval_seconds]
107+
start_perf_monitor() {
108+
local role="$1"
109+
local worker_idx="$2"
110+
local interval="${3:-${PERFMON_SAMPLE_INTERVAL:-1}}"
111+
112+
local out_dir="${PERFMON_OUTPUT_DIR:-}"
113+
if [[ -z "$out_dir" ]]; then
114+
echo "[perfmon] PERFMON_OUTPUT_DIR unset — skipping per-node power monitor"
115+
return 0
116+
fi
117+
case "$role" in
118+
prefill|decode|agg|frontend) ;;
119+
*)
120+
echo "[perfmon] unknown role '$role' (expected prefill|decode|agg|frontend) — skipping monitor"
121+
return 0
122+
;;
123+
esac
124+
if ! mkdir -p "$out_dir" 2>/dev/null; then
125+
echo "[perfmon] cannot create $out_dir — skipping per-node power monitor"
126+
return 0
127+
fi
128+
129+
# Sanitize the host component so the filename stays parseable by
130+
# aggregate_power's regex (role/idx anchors are unambiguous, but keep the
131+
# host free of separators that could confuse a future tightening). Prefer
132+
# the short hostname; fall back to the FQDN.
133+
local host
134+
host=$(hostname -s 2>/dev/null || hostname)
135+
host=$(printf '%s' "$host" | tr -c 'A-Za-z0-9.-' '_')
136+
137+
local out="${out_dir}/perf_samples_${role}_w${worker_idx}_${host}.csv"
138+
echo "[perfmon] starting per-node power monitor: role=$role worker=$worker_idx host=$host interval=${interval}s -> $out"
139+
start_gpu_monitor --output "$out" --interval "$interval"
140+
return 0
141+
}
142+
71143
# Check if required environment variables are set
72144
# Usage: check_env_vars VAR1 VAR2 VAR3 ...
73145
# Exits with code 1 if any variable is not set

benchmarks/multi_node/amd_utils/job.slurm

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,16 @@ export BENCHMARK_LOGS_DIR="${BENCHMARK_LOGS_DIR:-$(pwd)/benchmark_logs}"
298298
export KEEP_CONTAINERS="${KEEP_CONTAINERS:-0}"
299299
export ENGINE=$ENGINE
300300

301+
# Per-node measured-power monitoring. Each node's server script starts an
302+
# amd-smi/nvidia-smi monitor (start_perf_monitor in benchmark_lib.sh) that
303+
# writes perf_samples_<role>_w<idx>_<host>.csv into PERFMON_OUTPUT_DIR. That
304+
# dir is the /benchmark_logs/perfmon mount, which maps to BENCHMARK_LOGS_DIR
305+
# on the (NFS-shared) host so every node's CSV lands in one place the runner
306+
# can collect. Pre-create it on the host so the directory exists before any
307+
# container writes to it.
308+
export PERFMON_SAMPLE_INTERVAL="${PERFMON_SAMPLE_INTERVAL:-1}"
309+
mkdir -p "${BENCHMARK_LOGS_DIR}/perfmon" 2>/dev/null || true
310+
301311
# Eval-related env vars (threaded from submit.sh)
302312
export RUN_EVAL="${RUN_EVAL:-false}"
303313
export EVAL_ONLY="${EVAL_ONLY:-false}"
@@ -375,6 +385,8 @@ DOCKER_ENV_COMMON=(
375385
-e RUNNER_TYPE=\$RUNNER_TYPE
376386
-e RESULT_FILENAME=\$RESULT_FILENAME
377387
-e SPEC_DECODING=\$SPEC_DECODING
388+
-e PERFMON_OUTPUT_DIR=/benchmark_logs/perfmon
389+
-e PERFMON_SAMPLE_INTERVAL=\$PERFMON_SAMPLE_INTERVAL
378390
-e PREFILL_TP_SIZE=\$PREFILL_TP_SIZE
379391
-e PREFILL_ENABLE_EP=\$PREFILL_ENABLE_EP
380392
-e PREFILL_ENABLE_DP=\$PREFILL_ENABLE_DP

benchmarks/multi_node/amd_utils/server_sglang.sh

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,9 @@ GPUS_PER_NODE="${GPUS_PER_NODE:-8}"
4848
# =============================================================================
4949
source $SGLANG_WS_PATH/setup_deps.sh
5050
source $SGLANG_WS_PATH/env.sh
51+
# Power-monitoring helpers (start_perf_monitor / stop_gpu_monitor). WS_PATH is
52+
# .../benchmarks/multi_node/amd_utils, so the shared lib is two levels up.
53+
source "$SGLANG_WS_PATH/../../benchmark_lib.sh"
5154

5255
host_ip=$(ip route get 1.1.1.1 | awk '/src/ {print $7}')
5356
host_name=$(hostname)
@@ -279,6 +282,27 @@ done
279282
echo "Prefill worker headnode list: ${PREFILL_HEADNODE_URLS[@]}"
280283
echo "Decode worker headnode list: ${DECODE_HEADNODE_URLS[@]}"
281284

285+
# =============================================================================
286+
# Per-node measured-power monitor (best-effort)
287+
# =============================================================================
288+
# Classify this node into the same worker buckets the role branches below use:
289+
# NODE_RANK in [0, NODE_OFFSET) -> prefill, worker = NODE_RANK / PREFILL_NODES_PER_WORKER
290+
# NODE_RANK >= NODE_OFFSET -> decode, worker = (NODE_RANK - NODE_OFFSET) / DECODE_NODES_PER_WORKER
291+
# (NODE_OFFSET = PREFILL_NODES_PER_WORKER * xP.) Node 0 is the proxy too, but
292+
# its GPUs run the prefill head, so labeling it prefill attributes its energy
293+
# to the right stage. The monitor runs for the whole server lifetime;
294+
# aggregate_power.py windows the samples down to each concurrency's load window.
295+
if [ "$NODE_RANK" -lt "$NODE_OFFSET" ]; then
296+
PERF_ROLE="prefill"
297+
PERF_WORKER_IDX=$(( NODE_RANK / PREFILL_NODES_PER_WORKER ))
298+
else
299+
PERF_ROLE="decode"
300+
PERF_WORKER_IDX=$(( (NODE_RANK - NODE_OFFSET) / DECODE_NODES_PER_WORKER ))
301+
fi
302+
if [[ "$DRY_RUN" -ne 1 ]]; then
303+
start_perf_monitor "$PERF_ROLE" "$PERF_WORKER_IDX"
304+
fi
305+
282306
# =============================================================================
283307
# Configuration Builder Functions
284308
# =============================================================================
@@ -636,6 +660,7 @@ if [ "$NODE_RANK" -eq 0 ]; then
636660

637661
if [[ "${EVAL_FAILED:-0}" -eq 1 ]]; then
638662
echo "ERROR: eval failed; exiting node-0 with rc=1"
663+
stop_gpu_monitor
639664
exit 1
640665
fi
641666

@@ -777,5 +802,8 @@ else
777802

778803
fi
779804

805+
# Stop the per-node power monitor and flush its CSV before the container exits.
806+
stop_gpu_monitor
807+
780808
echo "Script completed successfully"
781809
exit 0

benchmarks/multi_node/amd_utils/server_vllm.sh

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ MODEL_PATH="${MODEL_PATH:-${MODEL_DIR}/${MODEL_NAME}}"
5050
# Dependencies and Environment Setup
5151
# =============================================================================
5252
source $WS_PATH/env.sh
53+
# Power-monitoring helpers (start_perf_monitor / stop_gpu_monitor). WS_PATH is
54+
# .../benchmarks/multi_node/amd_utils, so the shared lib is two levels up.
55+
source "$WS_PATH/../../benchmark_lib.sh"
5356

5457
host_ip=$(ip route get 1.1.1.1 2>/dev/null | awk '/src/ {print $7}')
5558
# RDMA IP for Nixl KV transfer (prefer 192.168.x.x subnet if available)
@@ -214,6 +217,25 @@ done
214217
echo "Prefill node IPs: ${PREFILL_ARGS}"
215218
echo "Decode node IPs: ${DECODE_ARGS}"
216219

220+
# =============================================================================
221+
# Per-node measured-power monitor (best-effort)
222+
# =============================================================================
223+
# vLLM places one worker per node: ranks [0, xP) are prefill (kv_producer),
224+
# ranks [xP, xP+yD) are decode (kv_consumer) — see the role branches below.
225+
# Node 0 is the proxy too, but its GPUs run the first prefill worker, so it is
226+
# correctly labeled prefill. The monitor runs for the whole server lifetime;
227+
# aggregate_power.py windows the samples down to each concurrency's load window.
228+
if [ "$NODE_RANK" -lt "$xP" ]; then
229+
PERF_ROLE="prefill"
230+
PERF_WORKER_IDX=$NODE_RANK
231+
else
232+
PERF_ROLE="decode"
233+
PERF_WORKER_IDX=$(( NODE_RANK - xP ))
234+
fi
235+
if [[ "$DRY_RUN" -ne 1 ]]; then
236+
start_perf_monitor "$PERF_ROLE" "$PERF_WORKER_IDX"
237+
fi
238+
217239
# MoRI-IO proxy ZMQ registration port (must match vllm-router --vllm-discovery-address)
218240
PROXY_PING_PORT="${PROXY_PING_PORT:-36367}"
219241

@@ -408,6 +430,7 @@ if [ "$NODE_RANK" -eq 0 ]; then
408430

409431
if [[ "${EVAL_FAILED:-0}" -eq 1 ]]; then
410432
echo "ERROR: eval failed; exiting node-0 with rc=1"
433+
stop_gpu_monitor
411434
exit 1
412435
fi
413436

@@ -523,5 +546,8 @@ fi
523546
# kill $etcd_pid 2>/dev/null || true
524547
# pkill -f etcd 2>/dev/null || true
525548

549+
# Stop the per-node power monitor and flush its CSV before the container exits.
550+
stop_gpu_monitor
551+
526552
echo "Script completed successfully"
527553
exit 0

runners/launch_mi355x-amds.sh

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,43 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
117117

118118
set -x
119119

120+
# ── Per-node measured-power CSVs ──────────────────────────────────────
121+
# Collect these FIRST — immediately after the job completes and before the
122+
# result-processing block below, which has early `exit 1` paths (e.g. no
123+
# logs dir found). Any early exit fires the EXIT trap (cleanup_and_save_logs),
124+
# which `sudo rm -rf`s the whole $BENCHMARK_LOGS_DIR — so anything that needs
125+
# to survive must be copied out before then. This mirrors launch_gb300-cw.sh,
126+
# which collects srt-slurm's perfmon CSVs right after the job completes.
127+
#
128+
# Each node's server script (server_sglang.sh / server_vllm.sh) wrote
129+
# perf_samples_<role>_w<idx>_<host>.csv into $BENCHMARK_LOGS_DIR/perfmon
130+
# (NFS-shared, one file per node). Copy them into the GH workspace and point
131+
# the downstream "Process result" step at them via GPU_METRICS_CSV_GLOB so
132+
# utils/aggregate_power.py can do the multi-CSV per-worker / per-stage
133+
# aggregation. Best-effort: a monitoring hiccup must never fail the upload.
134+
PERFMON_SRC_DIR="$BENCHMARK_LOGS_DIR/perfmon"
135+
if ls "$PERFMON_SRC_DIR"/perf_samples_*.csv >/dev/null 2>&1; then
136+
PERFMON_DST_DIR="$GITHUB_WORKSPACE/perfmon"
137+
mkdir -p "$PERFMON_DST_DIR"
138+
cp "$PERFMON_SRC_DIR"/perf_samples_*.csv "$PERFMON_DST_DIR"/ 2>/dev/null \
139+
|| sudo cp "$PERFMON_SRC_DIR"/perf_samples_*.csv "$PERFMON_DST_DIR"/ 2>/dev/null \
140+
|| true
141+
# CSVs may be root-owned on NFS (containers run as root); make them
142+
# readable by the runner user for the Process result step.
143+
sudo chown -R "$(id -u):$(id -g)" "$PERFMON_DST_DIR" 2>/dev/null || true
144+
perf_csv_count=$(ls "$PERFMON_DST_DIR"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
145+
if [ "$perf_csv_count" -gt 0 ]; then
146+
echo "[perfmon] Collected $perf_csv_count per-node perf_samples_*.csv -> $PERFMON_DST_DIR"
147+
if [ -n "${GITHUB_ENV:-}" ]; then
148+
echo "GPU_METRICS_CSV_GLOB=$PERFMON_DST_DIR/perf_samples_*.csv" >> "$GITHUB_ENV"
149+
fi
150+
else
151+
echo "[perfmon] WARNING: perf_samples_*.csv present under $PERFMON_SRC_DIR but none copied to $PERFMON_DST_DIR — measured power aggregation will be skipped"
152+
fi
153+
else
154+
echo "[perfmon] No perf_samples_*.csv found under $PERFMON_SRC_DIR — measured power aggregation will be skipped"
155+
fi
156+
120157
# FIXME: The below is bad and is a result of the indirection of the ways in which
121158
# Dynamo jobs are launched. In a follow-up PR, the location of the result file should not
122159
# depend on the runner, it should always be in the same spot in the GH workspace.
@@ -182,6 +219,7 @@ PY
182219
fi
183220

184221
echo "All result files processed"
222+
185223
# Use sync scancel to ensure nfs file handle is released in time
186224
set +x
187225
scancel_sync $JOB_ID

0 commit comments

Comments
 (0)