Commit 06558b9
feat(power): per-worker power + per-stage J/token for disagg
Extends multinode measured-power aggregation with per-worker breakdown
and per-stage joules attribution. The cluster-wide avg_power_w +
joules_per_*_token fields stay backward-compatible; new disagg-only
fields layer on top.
New agg JSON fields:
- power_by_worker: list of {role, worker_idx, hosts, num_gpus,
avg_power_w} parsed from srt-slurm perfmon CSV filenames
(`perf_samples_<role>_w<idx>_<host>.csv`). Roles: prefill, decode,
agg, frontend. Workers spanning N nodes collapse one entry whose
num_gpus is the cross-node sum.
- joules_per_input_token: prefill_energy / total_input_tokens
(disagg only — meaningless without a prefill stage).
Per-stage attribution (disagg only) replaces cluster-wide ratios for
existing fields:
- joules_per_output_token = decode_energy / output_tokens
- joules_per_total_token = (prefill + decode) / all_tokens
Frontend-labeled CSVs are excluded from per-stage energy but still
listed for observability. Falls back to cluster-wide math if only one
stage's CSVs survived.
process_result.py now passes DISAGG through to aggregate_power.run().
launch_gb300-cw.sh's recipe-injection loop reports found/injected
counts so a zero-recipes-found bug is distinguishable from the
benign all-already-monitored case.
Tests: 88/88 pass (68 existing + 20 new). New coverage: label parsing
across host formats, multi-node-per-worker collapse, per-stage J/token
math, frontend exclusion, single-stage fallback, zero-input degenerate,
end-to-end disagg wiring through process_result.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent 317049d commit 06558b9
5 files changed
Lines changed: 803 additions & 92 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
123 | 123 | | |
124 | 124 | | |
125 | 125 | | |
| 126 | + | |
126 | 127 | | |
127 | 128 | | |
| 129 | + | |
128 | 130 | | |
129 | 131 | | |
130 | 132 | | |
131 | 133 | | |
132 | 134 | | |
133 | 135 | | |
134 | | - | |
135 | | - | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
136 | 145 | | |
137 | 146 | | |
138 | 147 | | |
| |||
0 commit comments