Commit 5b3bcbb
feat(power): realign agg JSON fields with InferenceX-app METRIC_KEYS + add temp/util/mem
Realigns the per-worker / per-stage schema introduced in 06558b9 to
match the canonical METRIC_KEYS already declared in InferenceX-app
(packages/app/src/lib/metric-keys.ts). Previously this PR overrode
cluster-wide joules_per_output_token for disagg runs, which would
silently shift the meaning of a shared field. New per-stage values are
emitted as separate flat scalars so the cluster keys stay byte-stable.
Schema changes:
- Revert disagg override on joules_per_output_token and
joules_per_total_token — both are now ALWAYS cluster-wide
(total_system_energy / token_count), matching single-node math
and the frontend's existing axis labels.
- Add new disagg-only flat scalars (already in frontend METRIC_KEYS):
prefill_avg_power_w cluster mean across prefill workers
decode_avg_power_w cluster mean across decode workers
joules_per_output_token_decode decode_energy / output_tokens
joules_per_input_token unchanged (prefill_energy / input_tokens).
- Rename power_by_worker[] -> workers[] to match
InferenceX-app's BenchmarkRow.workers / WorkerPower interface.
- Each workers[] entry extended with per-worker telemetry:
avg_temp_c, peak_temp_c, avg_util_pct, avg_mem_used_mb
- Add matching cluster-wide telemetry scalars (per-GPU mean, omitted
when CSV lacks the column).
Implementation:
- _read_samples + _aggregate_rows refactored to extract all metric
columns in one pass (single-vendor regex per metric, gracefully
degrades when a column is absent).
- aggregate_power() preserved as a thin compat wrapper returning the
old (power, num_gpus) tuple so external callers don't break.
- Per-stage prefill_avg_power_w / decode_avg_power_w use weighted
mean by num_gpus (matches how cluster avg_power_w is computed).
- Frontend-labeled CSVs still excluded from per-stage energy
attribution; included in cluster totals.
Tests: 107/107 pass (88 existing baseline preserved, 14 new telemetry
tests, 5 schema-renamed tests updated in place). New coverage: temp /
util / mem extraction across NVIDIA + AMD + srt-slurm CSV schemas,
peak vs avg distinction, missing-column graceful degradation, per-
worker telemetry, per-stage weighted-mean scalars.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent 1af17ab commit 5b3bcbb
4 files changed
Lines changed: 1053 additions & 182 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3198 | 3198 | | |
3199 | 3199 | | |
3200 | 3200 | | |
3201 | | - | |
| 3201 | + | |
| 3202 | + | |
| 3203 | + | |
| 3204 | + | |
| 3205 | + | |
| 3206 | + | |
| 3207 | + | |
| 3208 | + | |
| 3209 | + | |
| 3210 | + | |
| 3211 | + | |
| 3212 | + | |
| 3213 | + | |
| 3214 | + | |
3202 | 3215 | | |
0 commit comments