Skip to content

Commit 5b3bcbb

Browse files
aryguptclaude
andcommitted
feat(power): realign agg JSON fields with InferenceX-app METRIC_KEYS + add temp/util/mem
Realigns the per-worker / per-stage schema introduced in 06558b9 to match the canonical METRIC_KEYS already declared in InferenceX-app (packages/app/src/lib/metric-keys.ts). Previously this PR overrode cluster-wide joules_per_output_token for disagg runs, which would silently shift the meaning of a shared field. New per-stage values are emitted as separate flat scalars so the cluster keys stay byte-stable. Schema changes: - Revert disagg override on joules_per_output_token and joules_per_total_token — both are now ALWAYS cluster-wide (total_system_energy / token_count), matching single-node math and the frontend's existing axis labels. - Add new disagg-only flat scalars (already in frontend METRIC_KEYS): prefill_avg_power_w cluster mean across prefill workers decode_avg_power_w cluster mean across decode workers joules_per_output_token_decode decode_energy / output_tokens joules_per_input_token unchanged (prefill_energy / input_tokens). - Rename power_by_worker[] -> workers[] to match InferenceX-app's BenchmarkRow.workers / WorkerPower interface. - Each workers[] entry extended with per-worker telemetry: avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb - Add matching cluster-wide telemetry scalars (per-GPU mean, omitted when CSV lacks the column). Implementation: - _read_samples + _aggregate_rows refactored to extract all metric columns in one pass (single-vendor regex per metric, gracefully degrades when a column is absent). - aggregate_power() preserved as a thin compat wrapper returning the old (power, num_gpus) tuple so external callers don't break. - Per-stage prefill_avg_power_w / decode_avg_power_w use weighted mean by num_gpus (matches how cluster avg_power_w is computed). - Frontend-labeled CSVs still excluded from per-stage energy attribution; included in cluster totals. Tests: 107/107 pass (88 existing baseline preserved, 14 new telemetry tests, 5 schema-renamed tests updated in place). New coverage: temp / util / mem extraction across NVIDIA + AMD + srt-slurm CSV schemas, peak vs avg distinction, missing-column graceful degradation, per- worker telemetry, per-stage weighted-mean scalars. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 1af17ab commit 5b3bcbb

4 files changed

Lines changed: 1053 additions & 182 deletions

File tree

perf-changelog.yaml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3198,5 +3198,18 @@
31983198
description:
31993199
- "Smoke run validating multinode measured-power aggregation (PR #1574). No config change; entry exists to trigger a sweep that produces the first multinode agg JSON with avg_power_w + joules_per_*_token populated from per-node srt-slurm perfmon CSVs. Validates per-source GPU-id namespacing in aggregate_power.py (without it, 14 nodes × 4 GPUs would report num_gpus=4 instead of 56) and the GPU_METRICS_CSV_GLOB env var bridge in process_result.py. Only the gb300-cw runner has the perfmon launcher changes; any gb300-nv runs in the sweep will succeed normally without power fields, which the dashboard handles gracefully (chart gates on field presence)."
32003200
- "Re-run after launcher recurse-glob fix (6da2f1b6) — prior sweep (#26548110246) completed green at the workflow level but produced 0 measured-power rows because the flat *.yaml glob in the monitoring-injection loop matched zero recipes (recipes live in 8k1k/ subdir). Fix uses `find -type f -name '*.yaml'`. Also re-pointed SemiAnalysisAI/srt-slurm@feat/inferencex-perfmon onto current NVIDIA/srt-slurm main so the launcher's `default_bash_preamble:` srtslurm.yaml field is accepted by srtctl schema."
3201-
- "Re-run after per-worker aggregation (24f46ffe) — validates new agg JSON fields: power_by_worker[] with role labels (prefill/decode/agg/frontend) parsed from srt-slurm perfmon CSV filenames, and joules_per_input_token using per-stage energy attribution (prefill_energy / input_tokens). joules_per_output_token and joules_per_total_token now use per-stage math for disagg runs. Backward compatible: single-node and non-disagg multinode keep cluster-wide ratios."
3201+
- "Re-run after per-worker aggregation (24f46ffe) — validates new agg JSON fields: workers[] with role labels (prefill/decode/agg/frontend) parsed from srt-slurm perfmon CSV filenames, plus per-stage scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token = prefill_energy / input_tokens, joules_per_output_token_decode = decode_energy / output_tokens). joules_per_output_token and joules_per_total_token stay cluster-wide on all topologies so the metric is comparable across single-node, multinode-agg, and multinode-disagg. Per-stage scalars emitted only for disagg runs with both prefill and decode workers present. workers[] entries also carry per-worker avg_temp_c/peak_temp_c/avg_util_pct/avg_mem_used_mb when the CSV exposes those columns."
3202+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1574
3203+
3204+
- config-keys:
3205+
- qwen3.5-fp8-mi355x-sglang-disagg
3206+
- glm5-fp8-mi355x-sglang-disagg
3207+
- dsr1-fp8-mi355x-sglang-disagg
3208+
- dsr1-fp4-mi355x-sglang-disagg
3209+
- kimik2.5-fp4-mi355x-vllm-disagg
3210+
- minimaxm2.5-fp8-mi355x-vllm-disagg
3211+
description:
3212+
- "Smoke run validating AMD multinode measured-power aggregation — the AMD analogue of the NVIDIA gb300/srt-slurm path (PR #1574). No config change; entry exists to trigger a sweep that produces the first AMD multinode agg JSONs with avg_power_w + joules_per_*_token + per-worker workers[] populated from per-node amd-smi perfmon CSVs."
3213+
- "The AMD amd_utils SLURM job has no orchestrator perfmon, so each SGLang/vLLM disagg node starts its own amd-smi monitor via start_perf_monitor (benchmarks/benchmark_lib.sh), writing perf_samples_<role>_w<idx>_<host>.csv into the NFS-shared /benchmark_logs/perfmon mount (wired in amd_utils/job.slurm). launch_mi355x-amds.sh collects the per-node CSVs into the GH workspace before the EXIT trap wipes the logs dir and sets GPU_METRICS_CSV_GLOB so the existing Process-result step runs the same vendor-agnostic utils/aggregate_power.py used for NVIDIA: per-source GPU-id namespacing (8 GPUs/node on MI355X, so a TP16 worker over 2 nodes counts 16 GPUs not 8), per-stage prefill/decode energy attribution, and per-worker temp/util/mem when amd-smi exposes those columns."
3214+
- "Covers both engine paths: SGLang disagg (server_sglang.sh role = NODE_RANK bucketed by PREFILL_NODES_PER_WORKER / NODE_OFFSET) and vLLM disagg (server_vllm.sh one worker per node, ranks [0,xP) prefill / [xP,xP+yD) decode). Monitoring is best-effort end-to-end — a missing amd-smi or empty CSV skips power patching without failing the benchmark upload; DISAGG=true threads through to per-stage attribution while agg/non-disagg runs still get cluster-wide power."
32023215
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1574

0 commit comments

Comments
 (0)