Skip to content

Commit 24f46ff

Browse files
aryguptclaude
andcommitted
feat(power): per-worker power + per-stage J/token for disagg
Extends multinode measured-power aggregation with per-worker breakdown and per-stage joules attribution. The cluster-wide avg_power_w + joules_per_*_token fields stay backward-compatible; new disagg-only fields layer on top. New agg JSON fields: - power_by_worker: list of {role, worker_idx, hosts, num_gpus, avg_power_w} parsed from srt-slurm perfmon CSV filenames (`perf_samples_<role>_w<idx>_<host>.csv`). Roles: prefill, decode, agg, frontend. Workers spanning N nodes collapse one entry whose num_gpus is the cross-node sum. - joules_per_input_token: prefill_energy / total_input_tokens (disagg only — meaningless without a prefill stage). Per-stage attribution (disagg only) replaces cluster-wide ratios for existing fields: - joules_per_output_token = decode_energy / output_tokens - joules_per_total_token = (prefill + decode) / all_tokens Frontend-labeled CSVs are excluded from per-stage energy but still listed for observability. Falls back to cluster-wide math if only one stage's CSVs survived. process_result.py now passes DISAGG through to aggregate_power.run(). launch_gb300-cw.sh's recipe-injection loop reports found/injected counts so a zero-recipes-found bug is distinguishable from the benign all-already-monitored case. Tests: 88/88 pass (68 existing + 20 new). New coverage: label parsing across host formats, multi-node-per-worker collapse, per-stage J/token math, frontend exclusion, single-stage fallback, zero-input degenerate, end-to-end disagg wiring through process_result. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0cf62f1 commit 24f46ff

5 files changed

Lines changed: 803 additions & 92 deletions

File tree

runners/launch_gb300-cw.sh

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -123,16 +123,25 @@ cp -rT "$SRT_RECIPE_SRC" "$SRT_RECIPE_DST"
123123
# and perfmon never spawns. PR #1574's first real sweep (#26548110246) hit
124124
# exactly this: completed "success" with no power data because the glob
125125
# matched nothing and the failure was silent end-to-end.
126+
FOUND_COUNT=0
126127
INJECTED_COUNT=0
127128
while IFS= read -r recipe; do
129+
FOUND_COUNT=$((FOUND_COUNT + 1))
128130
if ! grep -q '^monitoring:' "$recipe"; then
129131
printf '\nmonitoring:\n enabled: true\n sample_interval: 1.0\n' >> "$recipe"
130132
echo "[perfmon] enabled monitoring in recipe: $recipe"
131133
INJECTED_COUNT=$((INJECTED_COUNT + 1))
132134
fi
133135
done < <(find "$SRT_RECIPE_DST" -type f -name '*.yaml')
134-
if [ "$INJECTED_COUNT" -eq 0 ]; then
135-
echo "[perfmon] WARNING: zero recipes received monitoring injection under $SRT_RECIPE_DST. Either every recipe already had it, or the directory layout changed — power data will be MISSING from this run." >&2
136+
# Distinguish "found 0 recipes" (real bug — directory wrong/empty) from "all
137+
# already had monitoring:" (benign — happens on reruns or if a recipe author
138+
# pre-declared the block). Only the former is a missing-power-data risk.
139+
if [ "$FOUND_COUNT" -eq 0 ]; then
140+
echo "[perfmon] WARNING: zero recipe YAMLs found under $SRT_RECIPE_DST. The directory layout may have changed — power data will be MISSING from this run." >&2
141+
elif [ "$INJECTED_COUNT" -eq 0 ]; then
142+
echo "[perfmon] all $FOUND_COUNT recipes already declared monitoring: — no injection needed."
143+
else
144+
echo "[perfmon] injected monitoring: into $INJECTED_COUNT of $FOUND_COUNT recipes."
136145
fi
137146

138147
echo "Installing srtctl..."

0 commit comments

Comments
 (0)