Skip to content

Commit 183f9b1

Browse files
aryguptclaude
andauthored
feat(power): measured-power multinode support (workers[] + per-stage joules) (#405)
* feat(inference): plumb multinode measured-power fields through transform layer The multinode runner emits per-role power scalars (prefill_avg_power_w, decode_avg_power_w, joules_per_input_token, joules_per_output_token_decode) and a per-worker breakdown array on disagg runs. Surface them on BenchmarkRow -> AggDataEntry -> InferenceData so chart features can consume them without further transform changes. Workers ride as a sibling of the scalar metrics dict to keep the existing Record<string, number> index intact. * feat(power): measured-power multinode support in the app App-side support for the InferenceX runner's multinode measured-power telemetry: per-worker power breakdown (workers[] with role / num_gpus / hosts), per-stage joules (input = prefill energy, output = decode energy), and cluster-wide temp / util / mem. - db: migration 006_benchmark_results_workers.sql; ETL mapper + ingest + queries + json-provider carry workers[] and the new metric scalars - constants: metric-keys lists the measured-power fields; joules_per_output_token is per-stage decode on disagg (no separate _decode key) - app: benchmark-transform + inference types surface the fields; unofficial-run route passes them through; cypress measured-power-overlay e2e - drop joules_per_output_token_decode (folded into joules_per_output_token) to match the runner's per-stage change Validated: app 60 + db 223 unit tests pass; tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(test): scroll measured-power dropdown options into view before asserting visibility The measured-power-overlay e2e asserted `.should('be.visible')` on the gated "Measured Energy" group + its options right after opening the Y-axis dropdown. That list is a scroll container (searchable-select: max-h-72 overflow-y-auto) and the group sits below the fold, so the items are in the DOM but clipped by the overflow ancestor. Cypress `be.visible` does not auto-scroll, so it failed deterministically on both the chrome and firefox e2e shard-2 jobs. The sibling test passed only because `.click()` auto-scrolls its target into view. Add `.scrollIntoView()` before each visibility assertion. Verified locally: measured-power-overlay.cy.ts 2/2 passing against the real run 26312107787 artifact with E2E_FIXTURES=1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(db): alias BenchmarkWorkerRow to WorkerPower to drop in-package dup The read-side BenchmarkWorkerRow interface in queries/benchmarks.ts was a verbatim redeclaration of the ingest-side WorkerPower in etl/benchmark-mapper.ts. Alias it to the canonical definition so the per-worker shape can't drift within the package, keeping the BenchmarkWorkerRow name its consumers reference. Addresses the Cursor Bugbot "triplicate WorkerPower interface" finding on #405. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 4c3f35e commit 183f9b1

14 files changed

Lines changed: 667 additions & 8 deletions

File tree

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
// Verifies the new measured-power Y-axis options render on the unofficial-run
2+
// overlay path against a real GitHub Actions artifact (run 26312107787 — the
3+
// on-PR sweep for PR #1558 / qwen3.5-fp8-h200-sglang). This is the canonical
4+
// "preview before merge" test path per CLAUDE.md's overlay requirement.
5+
6+
describe('Measured power on unofficial-run overlay', () => {
7+
beforeEach(() => {
8+
cy.visit('/inference?unofficialrun=26312107787', {
9+
onBeforeLoad(win) {
10+
win.localStorage.setItem('inferencex-star-modal-dismissed', String(Date.now()));
11+
win.localStorage.setItem('inferencex-feature-gate', '1');
12+
},
13+
});
14+
cy.get('[data-testid="inference-chart-display"]', { timeout: 30_000 }).should('exist');
15+
});
16+
17+
it('exposes the Measured Energy dropdown group and renders overlay points', () => {
18+
// Open Y-axis dropdown
19+
cy.get('[data-testid="yaxis-metric-selector"]').click();
20+
cy.get('[data-slot="select-content"]').should('exist');
21+
22+
// Verify the gated "Measured Energy" group + both options. The select list is a
23+
// scroll container (max-h-72 overflow-y-auto), and this group sits below the fold,
24+
// so scroll each target into view before asserting visibility.
25+
cy.contains('[data-slot="select-content"]', 'Measured Energy')
26+
.scrollIntoView()
27+
.should('be.visible');
28+
cy.contains('[role="option"]', 'Measured Average Power per GPU')
29+
.scrollIntoView()
30+
.should('be.visible');
31+
cy.contains('[role="option"]', 'Measured Joules per Output Token')
32+
.scrollIntoView()
33+
.should('be.visible');
34+
35+
// Select the power option
36+
cy.contains('[role="option"]', 'Measured Average Power per GPU').click();
37+
cy.get('[data-slot="select-content"]').should('not.exist');
38+
39+
// Initial-load screenshot
40+
cy.screenshot('measured-power-selected', { capture: 'viewport' });
41+
42+
// The chart should now contain SVG <path> + <circle>/<polygon> elements
43+
// (overlay points typically render as triangles). Existence is enough —
44+
// visual correctness is reviewed in the screenshot.
45+
cy.get('[data-testid="inference-chart-display"] svg', { timeout: 10_000 }).should('exist');
46+
});
47+
48+
it('switches to Measured Joules per Output Token without errors', () => {
49+
cy.get('[data-testid="yaxis-metric-selector"]').click();
50+
cy.contains('[role="option"]', 'Measured Joules per Output Token').click();
51+
cy.get('[data-slot="select-content"]').should('not.exist');
52+
cy.screenshot('measured-joules-selected', { capture: 'viewport' });
53+
cy.get('[data-testid="inference-chart-display"] svg').should('exist');
54+
});
55+
});

packages/app/src/app/api/unofficial-run/route.test.ts

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,36 @@ describe('normalizeArtifactRows', () => {
206206
);
207207
expect(rows.every((r) => r.date === '2026-03-11')).toBe(true);
208208
});
209+
210+
it('surfaces the per-worker measured-power array on the BenchmarkRow', () => {
211+
const workers = [
212+
{
213+
role: 'prefill',
214+
worker_idx: 0,
215+
hosts: ['pn0'],
216+
num_gpus: 4,
217+
avg_power_w: 612.3,
218+
avg_temp_c: 71.2,
219+
},
220+
{
221+
role: 'decode',
222+
worker_idx: 0,
223+
hosts: ['dn0', 'dn1'],
224+
num_gpus: 8,
225+
avg_power_w: 712.1,
226+
},
227+
];
228+
const rows = normalizeArtifactRows([rawRow({ workers })], '2026-03-01');
229+
expect(rows[0].workers).toHaveLength(2);
230+
expect(rows[0].workers![0].hosts).toEqual(['pn0']);
231+
expect(rows[0].workers![0].avg_temp_c).toBe(71.2);
232+
expect(rows[0].workers![1].role).toBe('decode');
233+
});
234+
235+
it('leaves workers undefined when the artifact omits the field', () => {
236+
const rows = normalizeArtifactRows([rawRow()], '2026-03-01');
237+
expect(rows[0].workers).toBeUndefined();
238+
});
209239
});
210240

211241
describe('normalizeEvalArtifactRows', () => {

packages/app/src/app/api/unofficial-run/route.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,9 @@ export function normalizeArtifactRows(
5555
conc: params.conc,
5656
image: params.image,
5757
metrics: params.metrics,
58+
// Surface the same per-worker payload the DB path emits so unofficial
59+
// overlays carry the multinode measured-power breakdown too.
60+
workers: params.workers,
5861
date,
5962
run_url: runUrl,
6063
});

packages/app/src/components/inference/types.ts

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,50 @@ import type React from 'react';
33
import type { HardwareEntry } from '@/lib/constants';
44
import type { Model, Sequence } from '@/lib/data-mappings';
55

6+
/**
7+
* Role of a single worker process in a multinode / disaggregated deployment.
8+
* - `prefill` / `decode`: the two halves of a disaggregated serving setup
9+
* - `agg`: an aggregated (non-disagg) worker that handles both phases
10+
* - `frontend`: a router / load-balancer process (typically zero GPUs)
11+
*
12+
* Carried on `WorkerPower.role` as `string` (not the literal union) because
13+
* the runner emits the role at the JSONB boundary — we can't statically
14+
* guarantee the value at the type system level. Consumers that switch on the
15+
* role should narrow via `if (role === 'prefill') ...` or a `WorkerRole`
16+
* cast at the point of use.
17+
*/
18+
export type WorkerRole = 'prefill' | 'decode' | 'agg' | 'frontend';
19+
20+
/**
21+
* Per-worker measured power entry emitted by the runner's aggregate_power.py
22+
* for multinode and disaggregated runs. The chart layer can use these to
23+
* surface a stacked breakdown of where energy is spent across worker types.
24+
*
25+
* `hosts` lists the node hostnames whose perfmon CSVs were rolled up into
26+
* this worker entry (a single-node worker has one host; a multinode decode
27+
* worker spanning 4 nodes has four). Optional because pre-multinode versions
28+
* of aggregate_power.py didn't emit it.
29+
*
30+
* `avg_temp_c`, `peak_temp_c`, `avg_util_pct`, `avg_mem_used_mb` mirror the
31+
* cluster-wide telemetry scalars and are only present when the perfmon CSVs
32+
* include the corresponding sample columns. Each is optional so callers can
33+
* distinguish "field absent from this run" from "field present and equal to 0".
34+
*/
35+
export interface WorkerPower {
36+
// `string` rather than `WorkerRole` so the type lines up with what we get
37+
// from the JSONB column without an unsafe cast at every boundary. Chart
38+
// code can still narrow on the literal values it understands.
39+
role: string;
40+
worker_idx: number;
41+
hosts?: string[];
42+
num_gpus: number;
43+
avg_power_w: number;
44+
avg_temp_c?: number;
45+
peak_temp_c?: number;
46+
avg_util_pct?: number;
47+
avg_mem_used_mb?: number;
48+
}
49+
650
/**
751
* Represents an aggregated data entry, typically from a raw data source.
852
* This interface contains various performance metrics.
@@ -72,6 +116,31 @@ export interface AggDataEntry {
72116
avg_power_w?: number;
73117
joules_per_output_token?: number;
74118
joules_per_total_token?: number;
119+
// Multinode / disagg-only measured power. The aggregate_power.py runner
120+
// emits per-role energy splits when the deployment has separate prefill
121+
// and decode workers (single-node disagg or multinode disagg). Single-node
122+
// aggregated configs leave these undefined.
123+
// - prefill_avg_power_w / decode_avg_power_w: mean per-GPU draw (W) within each role
124+
// - joules_per_input_token: prefill_energy / total_input_tokens (prefill GPUs only)
125+
// The disagg decode-only J/output is carried by joules_per_output_token above
126+
// (the runner overrides it to decode_energy / total_output_tokens on disagg) —
127+
// there is no separate _decode field.
128+
prefill_avg_power_w?: number;
129+
decode_avg_power_w?: number;
130+
joules_per_input_token?: number;
131+
// Cluster-wide GPU telemetry beyond power (temperature, utilization, memory).
132+
// Emitted by aggregate_power.py when the perfmon CSVs include the matching
133+
// sample columns. Optional because older runs (and runs without the relevant
134+
// perfmon samples) leave them unset — the chart layer must distinguish "no
135+
// measurement" from "0".
136+
avg_temp_c?: number;
137+
peak_temp_c?: number;
138+
avg_util_pct?: number;
139+
avg_mem_used_mb?: number;
140+
// Per-worker measured power breakdown. Each entry is one worker process
141+
// (a prefill, decode, agg, or frontend role). Optional because pre-multinode
142+
// and pre-aggregate_power.py runs don't emit it.
143+
workers?: WorkerPower[];
75144
disagg: boolean;
76145
num_prefill_gpu: number;
77146
num_decode_gpu: number;

packages/app/src/lib/api.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33
* Each function is a thin fetch wrapper returning typed data.
44
*/
55

6+
import type { WorkerPower } from '@/components/inference/types';
7+
68
import type { SubmissionsResponse } from './submissions-types';
79

810
export interface BenchmarkRow {
@@ -28,6 +30,15 @@ export interface BenchmarkRow {
2830
conc: number;
2931
image: string | null;
3032
metrics: Record<string, number>;
33+
/**
34+
* Per-worker measured power for multinode / disagg runs. The runner emits
35+
* this as a JSONB sibling of the scalar metrics; the API layer surfaces it
36+
* as a separate field here so the scalar `metrics` index signature can stay
37+
* `Record<string, number>` and existing `m.x ?? 0` call sites keep narrowing
38+
* cleanly. Undefined for single-node runs and any run predating
39+
* aggregate_power.py.
40+
*/
41+
workers?: WorkerPower[];
3142
date: string;
3243
run_url: string | null;
3344
}

packages/app/src/lib/benchmark-transform.test.ts

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,6 +133,116 @@ describe('rowToAggDataEntry', () => {
133133
expect(entry.avg_power_w).toBeUndefined();
134134
expect(entry.joules_per_output_token).toBeUndefined();
135135
});
136+
137+
it('passes through multinode / disagg role-split power scalars when present', () => {
138+
const entry = rowToAggDataEntry(
139+
makeRow({
140+
metrics: {
141+
tput_per_gpu: 100,
142+
prefill_avg_power_w: 612.3,
143+
decode_avg_power_w: 701.5,
144+
joules_per_input_token: 1.2,
145+
// disagg: joules_per_output_token IS the per-stage decode value.
146+
joules_per_output_token: 9.7,
147+
},
148+
}),
149+
);
150+
expect(entry.prefill_avg_power_w).toBe(612.3);
151+
expect(entry.decode_avg_power_w).toBe(701.5);
152+
expect(entry.joules_per_input_token).toBe(1.2);
153+
expect(entry.joules_per_output_token).toBe(9.7);
154+
});
155+
156+
it('passes through per-worker measured power array intact', () => {
157+
const workers = [
158+
{ role: 'prefill' as const, worker_idx: 0, num_gpus: 4, avg_power_w: 588.4 },
159+
{ role: 'prefill' as const, worker_idx: 1, num_gpus: 4, avg_power_w: 601.2 },
160+
{ role: 'decode' as const, worker_idx: 0, num_gpus: 8, avg_power_w: 712.1 },
161+
{ role: 'frontend' as const, worker_idx: 0, num_gpus: 0, avg_power_w: 0 },
162+
];
163+
const entry = rowToAggDataEntry(makeRow({ workers }));
164+
expect(entry.workers).toEqual(workers);
165+
});
166+
167+
it('defensively drops a non-array workers payload', () => {
168+
// The DB JSONB column is untyped at the wire boundary, so guard against a
169+
// malformed row reaching downstream consumers.
170+
const entry = rowToAggDataEntry(
171+
// eslint-disable-next-line @typescript-eslint/no-explicit-any
172+
makeRow({ workers: 'oops' as any }),
173+
);
174+
expect(entry.workers).toBeUndefined();
175+
});
176+
177+
it('leaves multinode role-split scalars and workers undefined for legacy rows', () => {
178+
// Single-node configs predating the multinode runner don't emit any of
179+
// the role-split fields; transform must yield undefined (not 0) so the
180+
// chart layer can distinguish "no measurement" from a real zero.
181+
const entry = rowToAggDataEntry(makeRow({ metrics: {} }));
182+
expect(entry.prefill_avg_power_w).toBeUndefined();
183+
expect(entry.decode_avg_power_w).toBeUndefined();
184+
expect(entry.joules_per_input_token).toBeUndefined();
185+
expect(entry.workers).toBeUndefined();
186+
});
187+
188+
it('passes through cluster-wide temp/util/mem scalars when present', () => {
189+
const entry = rowToAggDataEntry(
190+
makeRow({
191+
metrics: {
192+
tput_per_gpu: 100,
193+
avg_temp_c: 68.4,
194+
peak_temp_c: 79.2,
195+
avg_util_pct: 88.5,
196+
avg_mem_used_mb: 71234.5,
197+
},
198+
}),
199+
);
200+
expect(entry.avg_temp_c).toBe(68.4);
201+
expect(entry.peak_temp_c).toBe(79.2);
202+
expect(entry.avg_util_pct).toBe(88.5);
203+
expect(entry.avg_mem_used_mb).toBe(71234.5);
204+
});
205+
206+
it('leaves cluster-wide temp/util/mem fields undefined when absent (legacy rows)', () => {
207+
// Same undefined-vs-zero distinction as the measured-power scalars —
208+
// historic rows predate the perfmon CSV scrape, so missing values must
209+
// not be silently coerced to 0.
210+
const entry = rowToAggDataEntry(makeRow({ metrics: {} }));
211+
expect(entry.avg_temp_c).toBeUndefined();
212+
expect(entry.peak_temp_c).toBeUndefined();
213+
expect(entry.avg_util_pct).toBeUndefined();
214+
expect(entry.avg_mem_used_mb).toBeUndefined();
215+
});
216+
217+
it('preserves new optional WorkerPower fields (hosts, telemetry) on workers entries', () => {
218+
const workers = [
219+
{
220+
role: 'prefill' as const,
221+
worker_idx: 0,
222+
hosts: ['pn0'],
223+
num_gpus: 4,
224+
avg_power_w: 612.3,
225+
avg_temp_c: 71.2,
226+
peak_temp_c: 78,
227+
avg_util_pct: 92.1,
228+
avg_mem_used_mb: 65432,
229+
},
230+
{
231+
role: 'decode' as const,
232+
worker_idx: 0,
233+
hosts: ['dn0', 'dn1', 'dn2', 'dn3'],
234+
num_gpus: 16,
235+
avg_power_w: 712.1,
236+
},
237+
];
238+
const entry = rowToAggDataEntry(makeRow({ workers }));
239+
expect(entry.workers).toEqual(workers);
240+
expect(entry.workers![0].hosts).toEqual(['pn0']);
241+
expect(entry.workers![0].avg_temp_c).toBe(71.2);
242+
expect(entry.workers![1].hosts).toEqual(['dn0', 'dn1', 'dn2', 'dn3']);
243+
// Optional telemetry fields stay undefined when source omits them.
244+
expect(entry.workers![1].avg_temp_c).toBeUndefined();
245+
});
136246
});
137247

138248
describe('transformBenchmarkRows', () => {

packages/app/src/lib/benchmark-transform.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,23 @@ export function rowToAggDataEntry(row: BenchmarkRow): AggDataEntry {
5555
avg_power_w: m.avg_power_w,
5656
joules_per_output_token: m.joules_per_output_token,
5757
joules_per_total_token: m.joules_per_total_token,
58+
// Multinode / disagg-only role splits — same undefined-for-legacy pattern.
59+
// (disagg's decode-only J/output is carried by joules_per_output_token above,
60+
// which the runner overrides to the per-stage value — no separate _decode key.)
61+
prefill_avg_power_w: m.prefill_avg_power_w,
62+
decode_avg_power_w: m.decode_avg_power_w,
63+
joules_per_input_token: m.joules_per_input_token,
64+
// Cluster-wide GPU telemetry beyond power. Emitted when the perfmon CSVs
65+
// include the corresponding sample columns; left undefined otherwise so
66+
// the chart layer can distinguish "no measurement" from a real zero.
67+
avg_temp_c: m.avg_temp_c,
68+
peak_temp_c: m.peak_temp_c,
69+
avg_util_pct: m.avg_util_pct,
70+
avg_mem_used_mb: m.avg_mem_used_mb,
71+
// Per-worker measured power. Surfaced on BenchmarkRow as a sibling of the
72+
// scalar `metrics` dict (see api.ts). Narrow defensively so a malformed
73+
// payload can't poison downstream consumers.
74+
workers: Array.isArray(row.workers) ? row.workers : undefined,
5875
disagg: row.disagg,
5976
num_prefill_gpu: row.num_prefill_gpu,
6077
num_decode_gpu: row.num_decode_gpu,

packages/constants/src/metric-keys.ts

Lines changed: 28 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,10 +45,35 @@ export const METRIC_KEYS = new Set([
4545
'std_intvty',
4646
// measured power / energy (emitted by runner's aggregate_power.py)
4747
// avg_power_w: mean per-GPU draw (W) during the load window
48-
// joules_per_output_token: avg_power_w * num_gpus * duration / total_output_tokens
49-
// joules_per_total_token: avg_power_w * num_gpus * duration / (total_input + total_output)
50-
// — workload-shape-fair view that doesn't treat prompt as free
48+
// joules_per_output_token: energy / total_output_tokens. CLUSTER-WIDE on
49+
// single-node / non-disagg (total_system_energy);
50+
// PER-STAGE decode_energy on disagg (decode GPUs only),
51+
// symmetric with joules_per_input_token below.
52+
// joules_per_total_token: total_system_energy / (total_input + total_output)
53+
// — cluster-wide; workload-shape-fair view that
54+
// doesn't treat prompt as free.
5155
'avg_power_w',
5256
'joules_per_output_token',
5357
'joules_per_total_token',
58+
// multinode / disagg role splits (emitted only when the deployment has
59+
// distinct prefill / decode workers)
60+
// prefill_avg_power_w / decode_avg_power_w: mean per-GPU draw within each role
61+
// joules_per_input_token: prefill_energy / total_input_tokens (prefill GPUs only).
62+
// The disagg output counterpart is joules_per_output_token above (decode GPUs
63+
// only) — there is no separate _decode key.
64+
'prefill_avg_power_w',
65+
'decode_avg_power_w',
66+
'joules_per_input_token',
67+
// cluster-wide GPU telemetry beyond power (emitted by aggregate_power.py when
68+
// the perfmon CSVs include temperature, utilization, or memory samples).
69+
// avg_temp_c: mean per-GPU temperature (Celsius) during load window
70+
// peak_temp_c: max instantaneous per-GPU temperature in window
71+
// avg_util_pct: mean per-GPU GPU-utilization percent (0-100)
72+
// avg_mem_used_mb: mean per-GPU memory used (MiB / MB)
73+
// Single-node and multinode runs both surface these as flat scalars; the
74+
// per-worker breakdown carries the same fields on each entry in workers[].
75+
'avg_temp_c',
76+
'peak_temp_c',
77+
'avg_util_pct',
78+
'avg_mem_used_mb',
5479
]);

0 commit comments

Comments
 (0)