Skip to content

Commit 6849229

Browse files
aryguptclaude
andcommitted
feat(power): capture AMD temp/util/mem (gfx_activity, used_vram, hotspot)
The amd-smi monitor only ran `metric -p -c -t -u`, so no VRAM column was emitted and avg_mem_used_mb never populated on AMD. It also used util/mem column matchers tuned for NVIDIA/srt-slurm names, which miss amd-smi's conventions — so avg_util_pct and avg_temp_c silently dropped too. - benchmark_lib.sh: add `-m` (mem-usage) to the amd-smi command so a used_vram column is captured. - aggregate_power.py column detection: - util: also match amd-smi `gfx_activity` (umc/mm_activity excluded). - mem: match positively on memory/vram + "used" instead of broad "mem" minus a growing exclude list — picks memory.used / mem_used_mb / used_vram while rejecting mem_temperature, mem_voltage, total/free_vram, the memory clock, and utilization.memory. - temp: prefer hotspot/junction over the first temp column, since edge temperature reads N/A on data-center AMD parts (MI300/MI355). NVIDIA and srt-slurm detection is unchanged (verified by existing tests). Adds AMD-header detection tests; full suite 111 passed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 488ce46 commit 6849229

3 files changed

Lines changed: 82 additions & 24 deletions

File tree

benchmarks/benchmark_lib.sh

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,17 +41,19 @@ start_gpu_monitor() {
4141
GPU_MONITOR_PID=$!
4242
echo "[GPU Monitor] Started NVIDIA (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"
4343
elif command -v amd-smi &>/dev/null; then
44-
# amd-smi metric flags: -p power, -c clocks, -t temperature, -u usage,
45-
# -w <interval> native watch mode (emits a timestamp column per sample),
46-
# --csv. The awk filter keeps the first CSV header line and drops
44+
# amd-smi metric flags: -p power, -c clocks, -t temperature, -u usage
45+
# (gfx_activity), -m mem-usage (used_vram), -w <interval> native watch
46+
# mode (emits a timestamp column per sample), --csv. Without -m there is
47+
# no VRAM column, so avg_mem_used_mb would never populate on AMD.
48+
# The awk filter keeps the first CSV header line and drops
4749
# amd-smi's preamble / repeated headers. Header match is case-insensitive
4850
# (tolower) so a capitalized "Timestamp," header — should amd-smi ever
4951
# emit one — still passes through; aggregate_power's column detection is
5052
# case-insensitive too. NOTE: amd-smi timestamps are node-local wall
5153
# clock, so multinode aggregation assumes cluster clocks are NTP-synced
5254
# (same assumption as nvidia-smi; aggregate_power windows by absolute
5355
# epoch from benchmark_serving.py).
54-
amd-smi metric -p -c -t -u -w "$interval" --csv 2>/dev/null \
56+
amd-smi metric -p -c -t -u -m -w "$interval" --csv 2>/dev/null \
5557
| awk 'tolower($0) ~ /^timestamp,/{if(!h){print;h=1};next} h{print}' > "$output" &
5658
GPU_MONITOR_PID=$!
5759
echo "[GPU Monitor] Started AMD (PID=$GPU_MONITOR_PID, interval=${interval}s, output=$output)"

utils/aggregate_power.py

Lines changed: 36 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -57,14 +57,18 @@
5757
- Power: timestamp + column whose name contains "power" (excluding
5858
"limit"/"cap"/"max"/"min"). NVIDIA: "power.draw [W]". AMD: "socket_power".
5959
srt-slurm: "power_w".
60-
- Temperature: column name contains "temp". NVIDIA: "temperature.gpu". AMD:
61-
"temperature". srt-slurm: "temp_c". Unit: Celsius.
62-
- Utilization: column name starts with "utilization" or contains "util".
60+
- Temperature: column name contains "temp"; hotspot/junction columns are
61+
preferred over the first match because data-center AMD parts report edge
62+
temperature as N/A. NVIDIA: "temperature.gpu". AMD amd-smi: "edge_temperature"
63+
/ "hotspot_temperature" (junction picked). srt-slurm: "temp_c". Unit: Celsius.
64+
- Utilization: column starts with "utilization" or contains "util", or is
65+
amd-smi's "gfx_activity" (umc_activity / mm_activity are not matched).
6366
NVIDIA: "utilization.gpu". srt-slurm: "util_pct". Unit: percent.
64-
- Memory: column name contains "mem" but not "total"/"clock"/"util" — so
65-
"memory.total", "clocks.current.memory" (a frequency), and
66-
"utilization.memory" (a percent) are all rejected; only memory *used* is
67-
picked. NVIDIA: "memory.used [MiB]". srt-slurm: "mem_used_mb". Unit: MiB/MB.
67+
- Memory used: column mentions memory/vram AND "used" — picks NVIDIA
68+
"memory.used [MiB]", srt-slurm "mem_used_mb", amd-smi "used_vram"; rejects
69+
decoys lacking "used" (memory.total / total_vram / free_vram, the memory
70+
*clock* "clocks.current.memory", utilization.memory, mem_temperature,
71+
mem_voltage). Unit: MiB/MB.
6872
6973
Power is required for aggregation to fire; the other metrics degrade gracefully
7074
when their columns are absent (those fields are simply omitted from the output).
@@ -90,14 +94,24 @@
9094
_POWER_COL_RE = re.compile(r"power", re.IGNORECASE)
9195
_POWER_EXCLUDE_RE = re.compile(r"limit|cap|max|min", re.IGNORECASE)
9296
_TEMP_COL_RE = re.compile(r"temp", re.IGNORECASE)
93-
_UTIL_COL_RE = re.compile(r"^utilization|util", re.IGNORECASE)
94-
_MEM_COL_RE = re.compile(r"mem", re.IGNORECASE)
95-
# Exclude "total" (memory.total), "clock" (clocks.current.memory — a frequency,
96-
# not memory used), and "util" (utilization.memory — a percent). nvidia-smi's
97-
# query emits clocks.current.memory BEFORE any used-memory column, so without
98-
# these excludes _MEM_COL_RE would grab the memory *clock* (~2500 MHz) as
99-
# avg_mem_used_mb.
100-
_MEM_EXCLUDE_RE = re.compile(r"total|clock|util", re.IGNORECASE)
97+
# Data-center AMD parts (MI300/MI355) report edge temperature as N/A and expose
98+
# the real die temperature as hotspot/junction; prefer those when present so
99+
# avg_temp_c isn't computed over an all-N/A edge column. NVIDIA's single
100+
# "temperature.gpu" and srt-slurm's "temp_c" have neither token and fall through
101+
# to the first temperature column unchanged.
102+
_TEMP_PREFER_RE = re.compile(r"hotspot|junction", re.IGNORECASE)
103+
# Utilization: NVIDIA "utilization.gpu", srt-slurm "util_pct", AMD amd-smi
104+
# "gfx_activity" (the GPU/graphics-engine busy percent). amd-smi's other usage
105+
# columns — umc_activity (memory controller), mm_activity (multimedia) — are
106+
# intentionally NOT matched so gfx_activity is the one picked.
107+
_UTIL_COL_RE = re.compile(r"^utilization|util|gfx_activity", re.IGNORECASE)
108+
# Memory *used*: match positively on a column that mentions both memory/vram and
109+
# "used" rather than broad "mem" + a growing exclude list. This naturally picks
110+
# NVIDIA "memory.used [MiB]", srt-slurm "mem_used_mb", and amd-smi "used_vram"
111+
# while rejecting same-prefix decoys that lack "used": memory.total / total_vram /
112+
# free_vram, clocks.current.memory (a frequency), utilization.memory (a percent),
113+
# and amd-smi's mem_temperature / mem_voltage.
114+
_MEM_COL_RE = re.compile(r"(?:mem|vram).*used|used.*(?:mem|vram)", re.IGNORECASE)
101115
_TIMESTAMP_COL_RE = re.compile(r"time", re.IGNORECASE)
102116
_GPU_INDEX_COL_RE = re.compile(r"^(index|gpu|gpu_id|gpu_index|card|device)$", re.IGNORECASE)
103117
_NUMBER_RE = re.compile(r"-?\d+(?:\.\d+)?")
@@ -208,12 +222,14 @@ def _detect_all_columns(header: list[str]) -> dict[str, str | None]:
208222
(c for c in header if _POWER_COL_RE.search(c) and not _POWER_EXCLUDE_RE.search(c)),
209223
None,
210224
)
211-
temp_col = next((c for c in header if _TEMP_COL_RE.search(c)), None)
212-
util_col = next((c for c in header if _UTIL_COL_RE.search(c)), None)
213-
mem_col = next(
214-
(c for c in header if _MEM_COL_RE.search(c) and not _MEM_EXCLUDE_RE.search(c)),
215-
None,
225+
temp_cols = [c for c in header if _TEMP_COL_RE.search(c)]
226+
# Prefer hotspot/junction (the real die temp on data-center AMD parts) over
227+
# the first temperature column (edge on AMD, temperature.gpu on NVIDIA).
228+
temp_col = next((c for c in temp_cols if _TEMP_PREFER_RE.search(c)), None) or (
229+
temp_cols[0] if temp_cols else None
216230
)
231+
util_col = next((c for c in header if _UTIL_COL_RE.search(c)), None)
232+
mem_col = next((c for c in header if _MEM_COL_RE.search(c)), None)
217233
gpu_col = next((c for c in header if _GPU_INDEX_COL_RE.match(c.strip())), None)
218234
return {
219235
"timestamp": timestamp_col,

utils/test_aggregate_power.py

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1196,6 +1196,46 @@ def test_detect_all_columns_amd_style():
11961196
assert cols["mem"] is None
11971197

11981198

1199+
def test_detect_all_columns_amd_smi_full():
1200+
"""Real amd-smi `metric -p -c -t -u -m --csv` header on a data-center part.
1201+
1202+
Exercises: gfx_activity as util (not umc/mm_activity), used_vram as mem (not
1203+
total/free_vram, mem_clock, mem_voltage, or mem_temperature), and
1204+
hotspot_temperature preferred over the N/A edge_temperature.
1205+
"""
1206+
header = [
1207+
"timestamp", "gpu",
1208+
"socket_power", "mem_voltage", # -p
1209+
"gfx_clock", "mem_clock", # -c
1210+
"edge_temperature", "hotspot_temperature", "mem_temperature", # -t
1211+
"gfx_activity", "umc_activity", "mm_activity", # -u
1212+
"total_vram", "used_vram", "free_vram", # -m
1213+
]
1214+
cols = _detect_all_columns(header)
1215+
assert cols["power"] == "socket_power"
1216+
assert cols["util"] == "gfx_activity"
1217+
assert cols["mem"] == "used_vram"
1218+
# Hotspot/junction preferred over edge (edge reads N/A on MI300/MI355).
1219+
assert cols["temp"] == "hotspot_temperature"
1220+
1221+
1222+
def test_detect_all_columns_temp_prefers_junction():
1223+
"""junction_temperature wins over a leading edge_temperature column."""
1224+
header = ["timestamp", "gpu", "socket_power",
1225+
"edge_temperature", "junction_temperature"]
1226+
assert _detect_all_columns(header)["temp"] == "junction_temperature"
1227+
1228+
1229+
def test_detect_all_columns_mem_vram_used_variants():
1230+
"""Both used_vram and vram_used resolve; total/free_vram never do."""
1231+
assert _detect_all_columns(
1232+
["timestamp", "power_w", "total_vram", "vram_used", "free_vram"]
1233+
)["mem"] == "vram_used"
1234+
assert _detect_all_columns(
1235+
["timestamp", "power_w", "total_vram", "free_vram"]
1236+
)["mem"] is None
1237+
1238+
11991239
def test_detect_all_columns_excludes_memory_total():
12001240
"""memory.total must not be picked as the memory column (we want USED memory)."""
12011241
header = ["timestamp", "index", "power.draw [W]", "memory.total [MiB]", "memory.used [MiB]"]

0 commit comments

Comments
 (0)