feat(power): AMD multi-node measured-power telemetry (mi355x disagg)

arygupt · claude · arygupt · commit f407f4b4d653 · 2026-05-28T13:30:59.000-07:00
Mirror the NVIDIA gb300/srt-slurm measured-power path on the AMD
multi-node disaggregated inference path. With no orchestrator perfmon,
each SGLang/vLLM disagg node starts its own amd-smi monitor via
start_perf_monitor (benchmark_lib.sh), writing
perf_samples_&lt;role&gt;_w&lt;idx&gt;_&lt;host&gt;.csv into the NFS-shared
/benchmark_logs/perfmon mount; launch_mi355x-amds.sh collects them and
exports GPU_METRICS_CSV_GLOB so the existing vendor-agnostic
utils/aggregate_power.py produces per-worker + per-stage power.

AMD perfmon wiring:
- benchmark_lib.sh: start_perf_monitor helper; case-insensitive amd-smi
  header filter; log captured CSV header for schema-mismatch visibility
- amd_utils/job.slurm: PERFMON_OUTPUT_DIR + interval into each container
- amd_utils/server_sglang.sh / server_vllm.sh: per-node role + worker-idx
  classification (matches each engine's own placement); monitor start +
  stop on every exit path
- runners/launch_mi355x-amds.sh: collect per-node CSVs immediately after
  job completion (before result-processing early-exits / EXIT-trap wipe),
  export GPU_METRICS_CSV_GLOB
- utils/aggregate_power.py: docstring documents the AMD source (logic
  already vendor-agnostic)
- utils/test_aggregate_power.py: AMD amd-smi multinode tests (per-worker,
  per-stage J/token, multi-node-per-worker collapse, vLLM topology)
- perf-changelog.yaml: trigger the 6 mi355x disagg sweeps (sglang+vllm)

Also lands the concurrent per-metric telemetry extension in
aggregate_power.py / tests: temp/util/mem aggregation, workers[] schema,
and flat per-stage scalars (prefill_avg_power_w, decode_avg_power_w,
joules_per_input_token, joules_per_output_token_decode).

Verified locally: 107 utils tests pass; bash syntax + shellcheck clean;
role mapping + filename contract + full amd-smi-&gt;agg pipeline validated;
adversarial review findings addressed (CSV collection moved ahead of
early exits; case-insensitive amd-smi header).

Co-Authored-By: Claude Opus 4.8 &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh
@@ -41,10 +41,18 @@ start_gpu_monitor() {
         GPU_MONITOR_PID=$!
         echo "[GPU Monitor] Started NVIDIA (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"
     elif command -v amd-smi &>/dev/null; then
-        # Use amd-smi native watch mode (-w) which includes timestamps automatically.
-        # Pipe through awk to: skip preamble lines, keep first CSV header, skip repeated headers.
+        # amd-smi metric flags: -p power, -c clocks, -t temperature, -u usage,
+        # -w <interval> native watch mode (emits a timestamp column per sample),
+        # --csv. The awk filter keeps the first CSV header line and drops
+        # amd-smi's preamble / repeated headers. Header match is case-insensitive
+        # (tolower) so a capitalized "Timestamp," header — should amd-smi ever
+        # emit one — still passes through; aggregate_power's column detection is
+        # case-insensitive too. NOTE: amd-smi timestamps are node-local wall
+        # clock, so multinode aggregation assumes cluster clocks are NTP-synced
+        # (same assumption as nvidia-smi; aggregate_power windows by absolute
+        # epoch from benchmark_serving.py).
         amd-smi metric -p -c -t -u -w "$interval" --csv 2>/dev/null \
-            | awk '/^timestamp,/{if(!h){print;h=1};next} h{print}' > "$output" &
+            | awk 'tolower($0) ~ /^timestamp,/{if(!h){print;h=1};next} h{print}' > "$output" &
         GPU_MONITOR_PID=$!
         echo "[GPU Monitor] Started AMD (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"
     else
@@ -63,11 +71,75 @@ stop_gpu_monitor() {
             local lines
             lines=$(wc -l < "$GPU_METRICS_CSV")
             echo "[GPU Monitor] Collected $lines rows -> $GPU_METRICS_CSV"
+            # Echo the captured header so a vendor-SMI schema mismatch (the one
+            # thing that silently yields 0 usable power samples downstream) is
+            # visible in CI logs without re-running on hardware.
+            echo "[GPU Monitor] CSV header: $(head -1 "$GPU_METRICS_CSV" 2>/dev/null)"
         fi
     fi
     GPU_MONITOR_PID=""
 }
 
+# Start a per-node GPU power monitor for multi-node disaggregated runs.
+#
+# This is the AMD/SGLang/vLLM analogue of NVIDIA srt-slurm's per-node perfmon
+# (PR #35): there is no orchestrator to spawn nvidia-smi on each node, so each
+# node starts its own amd-smi/nvidia-smi monitor here. The output filename
+# encodes the worker role and index in exactly the format
+# utils/aggregate_power.py's _parse_perfmon_label expects:
+#
+#     perf_samples_<role>_w<worker_idx>_<host>.csv
+#
+# so the downstream aggregation can attribute energy per worker and (for disagg)
+# per stage. role must be one of: prefill, decode, agg, frontend.
+#
+# Output goes to $PERFMON_OUTPUT_DIR, which job.slurm points at the NFS-shared
+# /benchmark_logs/perfmon mount so every node's CSV lands in one directory the
+# runner can collect. The monitor runs for the whole server lifetime;
+# aggregate_power.py windows the samples down to each concurrency's benchmark
+# load window using the timestamps benchmark_serving.py writes.
+#
+# Best-effort by design: an unset output dir, an unknown role, or a missing
+# amd-smi/nvidia-smi is a no-op that returns 0 — a monitoring hiccup must never
+# fail the benchmark.
+#
+# Usage: start_perf_monitor <role> <worker_idx> [interval_seconds]
+start_perf_monitor() {
+    local role="$1"
+    local worker_idx="$2"
+    local interval="${3:-${PERFMON_SAMPLE_INTERVAL:-1}}"
+
+    local out_dir="${PERFMON_OUTPUT_DIR:-}"
+    if [[ -z "$out_dir" ]]; then
+        echo "[perfmon] PERFMON_OUTPUT_DIR unset — skipping per-node power monitor"
+        return 0
+    fi
+    case "$role" in
+        prefill|decode|agg|frontend) ;;
+        *)
+            echo "[perfmon] unknown role '$role' (expected prefill|decode|agg|frontend) — skipping monitor"
+            return 0
+            ;;
+    esac
+    if ! mkdir -p "$out_dir" 2>/dev/null; then
+        echo "[perfmon] cannot create $out_dir — skipping per-node power monitor"
+        return 0
+    fi
+
+    # Sanitize the host component so the filename stays parseable by
+    # aggregate_power's regex (role/idx anchors are unambiguous, but keep the
+    # host free of separators that could confuse a future tightening). Prefer
+    # the short hostname; fall back to the FQDN.
+    local host
+    host=$(hostname -s 2>/dev/null || hostname)
+    host=$(printf '%s' "$host" | tr -c 'A-Za-z0-9.-' '_')
+
+    local out="${out_dir}/perf_samples_${role}_w${worker_idx}_${host}.csv"
+    echo "[perfmon] starting per-node power monitor: role=$role worker=$worker_idx host=$host interval=${interval}s -> $out"
+    start_gpu_monitor --output "$out" --interval "$interval"
+    return 0
+}
+
 # Check if required environment variables are set
 # Usage: check_env_vars VAR1 VAR2 VAR3 ...
 # Exits with code 1 if any variable is not set
diff --git a/benchmarks/multi_node/amd_utils/job.slurm b/benchmarks/multi_node/amd_utils/job.slurm
@@ -298,6 +298,16 @@ export BENCHMARK_LOGS_DIR="${BENCHMARK_LOGS_DIR:-$(pwd)/benchmark_logs}"
 export KEEP_CONTAINERS="${KEEP_CONTAINERS:-0}"
 export ENGINE=$ENGINE
 
+# Per-node measured-power monitoring. Each node's server script starts an
+# amd-smi/nvidia-smi monitor (start_perf_monitor in benchmark_lib.sh) that
+# writes perf_samples_<role>_w<idx>_<host>.csv into PERFMON_OUTPUT_DIR. That
+# dir is the /benchmark_logs/perfmon mount, which maps to BENCHMARK_LOGS_DIR
+# on the (NFS-shared) host so every node's CSV lands in one place the runner
+# can collect. Pre-create it on the host so the directory exists before any
+# container writes to it.
+export PERFMON_SAMPLE_INTERVAL="${PERFMON_SAMPLE_INTERVAL:-1}"
+mkdir -p "${BENCHMARK_LOGS_DIR}/perfmon" 2>/dev/null || true
+
 # Eval-related env vars (threaded from submit.sh)
 export RUN_EVAL="${RUN_EVAL:-false}"
 export EVAL_ONLY="${EVAL_ONLY:-false}"
@@ -375,6 +385,8 @@ DOCKER_ENV_COMMON=(
     -e RUNNER_TYPE=\$RUNNER_TYPE
     -e RESULT_FILENAME=\$RESULT_FILENAME
     -e SPEC_DECODING=\$SPEC_DECODING
+    -e PERFMON_OUTPUT_DIR=/benchmark_logs/perfmon
+    -e PERFMON_SAMPLE_INTERVAL=\$PERFMON_SAMPLE_INTERVAL
     -e PREFILL_TP_SIZE=\$PREFILL_TP_SIZE
     -e PREFILL_ENABLE_EP=\$PREFILL_ENABLE_EP
     -e PREFILL_ENABLE_DP=\$PREFILL_ENABLE_DP
diff --git a/benchmarks/multi_node/amd_utils/server_sglang.sh b/benchmarks/multi_node/amd_utils/server_sglang.sh
@@ -48,6 +48,9 @@ GPUS_PER_NODE="${GPUS_PER_NODE:-8}"
 # =============================================================================
 source $SGLANG_WS_PATH/setup_deps.sh
 source $SGLANG_WS_PATH/env.sh
+# Power-monitoring helpers (start_perf_monitor / stop_gpu_monitor). WS_PATH is
+# .../benchmarks/multi_node/amd_utils, so the shared lib is two levels up.
+source "$SGLANG_WS_PATH/../../benchmark_lib.sh"
 
 host_ip=$(ip route get 1.1.1.1 | awk '/src/ {print $7}')
 host_name=$(hostname)
@@ -279,6 +282,27 @@ done
 echo "Prefill worker headnode list: ${PREFILL_HEADNODE_URLS[@]}"
 echo "Decode  worker headnode list: ${DECODE_HEADNODE_URLS[@]}"
 
+# =============================================================================
+# Per-node measured-power monitor (best-effort)
+# =============================================================================
+# Classify this node into the same worker buckets the role branches below use:
+#   NODE_RANK in [0, NODE_OFFSET)  -> prefill, worker = NODE_RANK / PREFILL_NODES_PER_WORKER
+#   NODE_RANK >= NODE_OFFSET       -> decode,  worker = (NODE_RANK - NODE_OFFSET) / DECODE_NODES_PER_WORKER
+# (NODE_OFFSET = PREFILL_NODES_PER_WORKER * xP.) Node 0 is the proxy too, but
+# its GPUs run the prefill head, so labeling it prefill attributes its energy
+# to the right stage. The monitor runs for the whole server lifetime;
+# aggregate_power.py windows the samples down to each concurrency's load window.
+if [ "$NODE_RANK" -lt "$NODE_OFFSET" ]; then
+    PERF_ROLE="prefill"
+    PERF_WORKER_IDX=$(( NODE_RANK / PREFILL_NODES_PER_WORKER ))
+else
+    PERF_ROLE="decode"
+    PERF_WORKER_IDX=$(( (NODE_RANK - NODE_OFFSET) / DECODE_NODES_PER_WORKER ))
+fi
+if [[ "$DRY_RUN" -ne 1 ]]; then
+    start_perf_monitor "$PERF_ROLE" "$PERF_WORKER_IDX"
+fi
+
 # =============================================================================
 # Configuration Builder Functions
 # =============================================================================
@@ -636,6 +660,7 @@ if [ "$NODE_RANK" -eq 0 ]; then
 
     if [[ "${EVAL_FAILED:-0}" -eq 1 ]]; then
         echo "ERROR: eval failed; exiting node-0 with rc=1"
+        stop_gpu_monitor
         exit 1
     fi
 
@@ -777,5 +802,8 @@ else
 
 fi
 
+# Stop the per-node power monitor and flush its CSV before the container exits.
+stop_gpu_monitor
+
 echo "Script completed successfully"
 exit 0
diff --git a/benchmarks/multi_node/amd_utils/server_vllm.sh b/benchmarks/multi_node/amd_utils/server_vllm.sh
@@ -50,6 +50,9 @@ MODEL_PATH="${MODEL_PATH:-${MODEL_DIR}/${MODEL_NAME}}"
 # Dependencies and Environment Setup
 # =============================================================================
 source $WS_PATH/env.sh
+# Power-monitoring helpers (start_perf_monitor / stop_gpu_monitor). WS_PATH is
+# .../benchmarks/multi_node/amd_utils, so the shared lib is two levels up.
+source "$WS_PATH/../../benchmark_lib.sh"
 
 host_ip=$(ip route get 1.1.1.1 2>/dev/null | awk '/src/ {print $7}')
 # RDMA IP for Nixl KV transfer (prefer 192.168.x.x subnet if available)
@@ -214,6 +217,25 @@ done
 echo "Prefill node IPs: ${PREFILL_ARGS}"
 echo "Decode  node IPs: ${DECODE_ARGS}"
 
+# =============================================================================
+# Per-node measured-power monitor (best-effort)
+# =============================================================================
+# vLLM places one worker per node: ranks [0, xP) are prefill (kv_producer),
+# ranks [xP, xP+yD) are decode (kv_consumer) — see the role branches below.
+# Node 0 is the proxy too, but its GPUs run the first prefill worker, so it is
+# correctly labeled prefill. The monitor runs for the whole server lifetime;
+# aggregate_power.py windows the samples down to each concurrency's load window.
+if [ "$NODE_RANK" -lt "$xP" ]; then
+    PERF_ROLE="prefill"
+    PERF_WORKER_IDX=$NODE_RANK
+else
+    PERF_ROLE="decode"
+    PERF_WORKER_IDX=$(( NODE_RANK - xP ))
+fi
+if [[ "$DRY_RUN" -ne 1 ]]; then
+    start_perf_monitor "$PERF_ROLE" "$PERF_WORKER_IDX"
+fi
+
 # MoRI-IO proxy ZMQ registration port (must match vllm-router --vllm-discovery-address)
 PROXY_PING_PORT="${PROXY_PING_PORT:-36367}"
 
@@ -408,6 +430,7 @@ if [ "$NODE_RANK" -eq 0 ]; then
 
     if [[ "${EVAL_FAILED:-0}" -eq 1 ]]; then
         echo "ERROR: eval failed; exiting node-0 with rc=1"
+        stop_gpu_monitor
         exit 1
     fi
 
@@ -523,5 +546,8 @@ fi
 # kill $etcd_pid 2>/dev/null || true
 # pkill -f etcd 2>/dev/null || true
 
+# Stop the per-node power monitor and flush its CSV before the container exits.
+stop_gpu_monitor
+
 echo "Script completed successfully"
 exit 0
diff --git a/runners/launch_mi355x-amds.sh b/runners/launch_mi355x-amds.sh
@@ -117,6 +117,43 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
 
     set -x
 
+    # ── Per-node measured-power CSVs ──────────────────────────────────────
+    # Collect these FIRST — immediately after the job completes and before the
+    # result-processing block below, which has early `exit 1` paths (e.g. no
+    # logs dir found). Any early exit fires the EXIT trap (cleanup_and_save_logs),
+    # which `sudo rm -rf`s the whole $BENCHMARK_LOGS_DIR — so anything that needs
+    # to survive must be copied out before then. This mirrors launch_gb300-cw.sh,
+    # which collects srt-slurm's perfmon CSVs right after the job completes.
+    #
+    # Each node's server script (server_sglang.sh / server_vllm.sh) wrote
+    # perf_samples_<role>_w<idx>_<host>.csv into $BENCHMARK_LOGS_DIR/perfmon
+    # (NFS-shared, one file per node). Copy them into the GH workspace and point
+    # the downstream "Process result" step at them via GPU_METRICS_CSV_GLOB so
+    # utils/aggregate_power.py can do the multi-CSV per-worker / per-stage
+    # aggregation. Best-effort: a monitoring hiccup must never fail the upload.
+    PERFMON_SRC_DIR="$BENCHMARK_LOGS_DIR/perfmon"
+    if ls "$PERFMON_SRC_DIR"/perf_samples_*.csv >/dev/null 2>&1; then
+        PERFMON_DST_DIR="$GITHUB_WORKSPACE/perfmon"
+        mkdir -p "$PERFMON_DST_DIR"
+        cp "$PERFMON_SRC_DIR"/perf_samples_*.csv "$PERFMON_DST_DIR"/ 2>/dev/null \
+            || sudo cp "$PERFMON_SRC_DIR"/perf_samples_*.csv "$PERFMON_DST_DIR"/ 2>/dev/null \
+            || true
+        # CSVs may be root-owned on NFS (containers run as root); make them
+        # readable by the runner user for the Process result step.
+        sudo chown -R "$(id -u):$(id -g)" "$PERFMON_DST_DIR" 2>/dev/null || true
+        perf_csv_count=$(ls "$PERFMON_DST_DIR"/perf_samples_*.csv 2>/dev/null | wc -l | tr -d ' ')
+        if [ "$perf_csv_count" -gt 0 ]; then
+            echo "[perfmon] Collected $perf_csv_count per-node perf_samples_*.csv -> $PERFMON_DST_DIR"
+            if [ -n "${GITHUB_ENV:-}" ]; then
+                echo "GPU_METRICS_CSV_GLOB=$PERFMON_DST_DIR/perf_samples_*.csv" >> "$GITHUB_ENV"
+            fi
+        else
+            echo "[perfmon] WARNING: perf_samples_*.csv present under $PERFMON_SRC_DIR but none copied to $PERFMON_DST_DIR — measured power aggregation will be skipped"
+        fi
+    else
+        echo "[perfmon] No perf_samples_*.csv found under $PERFMON_SRC_DIR — measured power aggregation will be skipped"
+    fi
+
     # FIXME: The below is bad and is a result of the indirection of the ways in which
     # Dynamo jobs are launched. In a follow-up PR, the location of the result file should not
     # depend on the runner, it should always be in the same spot in the GH workspace.
@@ -182,6 +219,7 @@ PY
     fi
 
     echo "All result files processed"
+
     # Use sync scancel to ensure nfs file handle is released in time
     set +x
     scancel_sync $JOB_ID