Skip to content

Commit a90d154

Browse files
FrankD412claude
andauthored
feat: Initial implementation of power metrics in aiperf. (#803)
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent abf5d23 commit a90d154

23 files changed

Lines changed: 4209 additions & 14 deletions

docs/dev/patterns.md

Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,102 @@ value" fallback that both call sites (`_converter_endpoint` and
392392
helper still works (returns `None`); the caller is responsible for
393393
raising instead of substituting a literal.
394394

395+
## Externally-Injected Derived Metric Pattern
396+
397+
A normal `BaseDerivedMetric` computes its value from peer metrics in the
398+
`MetricResultsDict` via `_derive_value`. Some derived metrics, however, are
399+
computed from data that never lives in the `MetricResultsDict` at all —
400+
GPU power and energy come from telemetry scrapes, not from request
401+
records, so their values must be injected by the accumulator that owns
402+
the sensor data rather than derived by the standard registry walk.
403+
404+
Reference file: [`src/aiperf/metrics/types/power_efficiency_metrics.py`](../../src/aiperf/metrics/types/power_efficiency_metrics.py).
405+
Injection site: `GPUTelemetryAccumulator.compute_efficiency_metrics`
406+
([`src/aiperf/gpu_telemetry/accumulator.py`](../../src/aiperf/gpu_telemetry/accumulator.py)).
407+
408+
### The three-part contract
409+
410+
A metric class that participates in registry listings but is computed
411+
externally must spell out the contract in three places so future agents
412+
don't copy-paste the shape as the canonical derived-metric pattern.
413+
414+
**1. `Invariant:` paragraph in the class docstring.** Name the injection
415+
site and the catching path explicitly:
416+
417+
```python
418+
class TotalGpuEnergyMetric(BaseDerivedMetric[float]):
419+
"""Sum of GPU energy consumed across all GPUs during the benchmark phase, in joules.
420+
421+
Invariant: externally injected by
422+
`GPUTelemetryAccumulator.compute_efficiency_metrics` from
423+
energy_consumption counter deltas. `_derive_value` is intentionally
424+
non-functional; `MetricResultsProcessor.update_derived_metrics` is
425+
expected to catch NoMetricValue and skip the tag during its
426+
derivation walk.
427+
"""
428+
```
429+
430+
**2. `_derive_value` returns `NoReturn`.** The body unconditionally
431+
raises, so the truthful annotation is `NoReturn` from `typing`. Returning
432+
`float` would lie to type-checkers and downstream code that assumes the
433+
derivation succeeded.
434+
435+
```python
436+
from typing import NoReturn
437+
from aiperf.common.exceptions import NoMetricValue
438+
from aiperf.metrics.metric_dicts import MetricResultsDict
439+
440+
def _derive_value(self, metric_results: MetricResultsDict) -> NoReturn:
441+
raise NoMetricValue(
442+
"Cannot derive 'total_gpu_energy' from MetricResultsDict: this metric "
443+
"is externally injected by "
444+
"GPUTelemetryAccumulator.compute_efficiency_metrics. If this exception "
445+
"surfaces, the derivation walk is missing its NoMetricValue handler "
446+
"(see MetricResultsProcessor.update_derived_metrics)."
447+
)
448+
```
449+
450+
**3. Error message names the operation, the injection site, and the catching
451+
path.** A message that only names the source ("X is computed by the GPU
452+
telemetry accumulator") gives debugging agents no clue where the contract
453+
is enforced. The recommended shape is:
454+
455+
- *Operation*: what derivation was attempted (`Cannot derive 'X' from
456+
MetricResultsDict`).
457+
- *Injection site*: which method is the source of truth
458+
(`GPUTelemetryAccumulator.compute_efficiency_metrics`).
459+
- *Catching path*: where the exception is expected to be absorbed
460+
(`MetricResultsProcessor.update_derived_metrics`). If this fires in
461+
production, the catching path has a bug.
462+
463+
### Why not just skip the class entirely?
464+
465+
The class is still required because the rest of the system reads class
466+
attributes (`tag`, `header`, `unit`, `display_order`, `flags`) when
467+
emitting `MetricResult`s, ordering the console table, and gating display
468+
behavior. The registry entry is structural metadata; the *value* is the
469+
external injection.
470+
471+
### Where the injection happens
472+
473+
`RecordsManager._apply_gpu_efficiency_metrics` calls
474+
`GPUTelemetryAccumulator.compute_efficiency_metrics`, which constructs
475+
`MetricResult` objects directly with the relevant tags and appends them
476+
to the records list before `ProcessRecordsResult` is built. The standard
477+
`update_derived_metrics` walk sees these tags too, raises `NoMetricValue`
478+
via `_derive_value`, catches it, and skips — so the externally-injected
479+
values are not overwritten.
480+
481+
### Test contract
482+
483+
The error-message invariants are pinned by
484+
[`tests/unit/metrics/test_power_efficiency_metrics.py`](../../tests/unit/metrics/test_power_efficiency_metrics.py)
485+
(parametrized over the three classes): every `_derive_value` call must
486+
raise `NoMetricValue` with a message that names the tag, the operation
487+
source (`MetricResultsDict`), and the injection site
488+
(`compute_efficiency_metrics`). A future weakening of any message fails
489+
the test rather than silently drifting.
490+
395491
## Testing Pattern
396492

397493
```python

docs/environment-variables.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ GPU telemetry collection configuration. Controls GPU metrics collection frequenc
8282
| `AIPERF_GPU_COLLECTION_INTERVAL` | `0.333` | ≥ 0.01, ≤ 300.0 | GPU telemetry metrics collection interval in seconds (default: 333ms, ~3Hz) |
8383
| `AIPERF_GPU_DEFAULT_DCGM_ENDPOINTS` | `['http://localhost:9400/metrics', 'http://localhost:9401/metrics']` || Default DCGM endpoint URLs to check for GPU telemetry (comma-separated string or JSON array) |
8484
| `AIPERF_GPU_EXPORT_BATCH_SIZE` | `100` | ≥ 1, ≤ 1000000 | Batch size for telemetry record export results processor |
85+
| `AIPERF_GPU_FINAL_SCRAPE_GRACE_NS` | `666000000` | ≥ 0, ≤ 60000000000 | Grace window in nanoseconds appended to phase end_ns when computing the GPU energy-counter delta. Energy is scraped on a cadence (see COLLECTION_INTERVAL), so the trailing scrape often lands after the phase ends; this grace lets it be included while bounding the window so cooldown/idle samples and subsequent-phase samples don't leak into the delta. Default 666_000_000 ns ~= 2x the default 333 ms COLLECTION_INTERVAL; raise this if you also raise COLLECTION_INTERVAL. |
8586
| `AIPERF_GPU_REACHABILITY_TIMEOUT` | `10` | ≥ 1, ≤ 300 | Timeout in seconds for checking GPU telemetry endpoint reachability during init |
8687
| `AIPERF_GPU_SHUTDOWN_DELAY` | `5.0` | ≥ 1.0, ≤ 300.0 | Delay in seconds before shutting down GPU telemetry service to allow command response transmission |
8788
| `AIPERF_GPU_THREAD_JOIN_TIMEOUT` | `5.0` | ≥ 1.0, ≤ 300.0 | Timeout in seconds for joining GPU telemetry collection threads during shutdown |

docs/metrics-reference.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,11 @@ This document provides a comprehensive reference of all metrics available in AIP
112112
- [HTTP Connection Reused](#http-connection-reused)
113113
- [HTTP Chunks Sent](#http-chunks-sent)
114114
- [HTTP Chunks Received](#http-chunks-received)
115+
- [GPU Power Efficiency Metrics](#gpu-power-efficiency-metrics)
116+
- [Total GPU Power](#total-gpu-power)
117+
- [Total GPU Energy](#total-gpu-energy)
118+
- [Output Tokens per Joule](#output-tokens-per-joule)
119+
- [Energy per User](#energy-per-user)
115120
- [Metric Flags Reference](#metric-flags-reference)
116121

117122
---
@@ -1731,6 +1736,105 @@ http_req_chunks_received = trace.response_chunks_count
17311736

17321737
---
17331738

1739+
## GPU Power Efficiency Metrics
1740+
1741+
> [!NOTE]
1742+
> All metrics in this section require `--gpu-telemetry` to be enabled and the underlying collector (DCGM, pynvml, or amdsmi) to expose the relevant signal (`gpu_power_usage` and/or `energy_consumption`). They are computed once per profiling phase by `GPUTelemetryAccumulator.compute_efficiency_metrics`, not by the standard derivation walk — see the [Externally-Injected Derived Metric pattern](dev/patterns.md#externally-injected-derived-metric-pattern).
1743+
1744+
Each metric's header surfaces the number of GPUs that contributed valid data (e.g. `Total GPU Power (8 GPUs)`), so a partial-cohort run (where one or more GPUs failed to report) is distinguishable from a full run. Tags are emitted in this order when present: `total_gpu_power`, `total_gpu_energy`, `output_tokens_per_joule`, `energy_per_user`. Each tag is independently omitted when its underlying signal is unavailable.
1745+
1746+
### Total GPU Power
1747+
1748+
**Type:** [Derived Metric](#derived-metrics) (externally injected)
1749+
1750+
Sum of average GPU power across all reporting GPUs during the profiling phase, in watts. Useful as a baseline for cross-run power comparisons.
1751+
1752+
**Formula:**
1753+
```python
1754+
# Per GPU: average of gpu_power_usage gauge samples in the profiling window
1755+
# (warmup excluded). Summed across all GPUs that reported valid samples.
1756+
total_gpu_power_w = sum(
1757+
avg(gpu_power_usage[start_ns:end_ns])
1758+
for gpu in reporting_gpus
1759+
)
1760+
```
1761+
1762+
**Notes:**
1763+
- Unit: watts (`W`).
1764+
- Time-filtered to the profiling-phase window; warmup samples are excluded.
1765+
- Power is a gauge, so the window stays bounded — post-bench idle samples don't drag the average down.
1766+
- Omitted when no GPU reports `gpu_power_usage` in the window.
1767+
1768+
---
1769+
1770+
### Total GPU Energy
1771+
1772+
**Type:** [Derived Metric](#derived-metrics) (externally injected)
1773+
1774+
Sum of energy consumed across all reporting GPUs during the profiling phase, in joules. Computed as a counter delta (`final − baseline`) per GPU and summed.
1775+
1776+
**Formula:**
1777+
```python
1778+
# Per GPU: delta of the energy_consumption monotonic counter over the
1779+
# profiling window, widened on the end by FINAL_SCRAPE_GRACE_NS so the
1780+
# trailing scrape that lands just after requests_end_ns is captured.
1781+
grace_ns = Environment.GPU.FINAL_SCRAPE_GRACE_NS # default 666_000_000 (~666 ms)
1782+
total_gpu_energy_j = sum(
1783+
delta(energy_consumption[start_ns : end_ns + grace_ns])
1784+
for gpu in reporting_gpus
1785+
)
1786+
# Negative deltas are clamped to 0 to handle counter resets (DCGM restart).
1787+
```
1788+
1789+
**Notes:**
1790+
- Unit: joules (`J`). Source samples are reported in megajoules and converted via `EnergyMetricUnit.MEGAJOULE.joules`.
1791+
- The end-of-window grace is bounded (not open-ended) so cooldown samples and any subsequent-phase samples cannot leak into the delta. Tune via `AIPERF_GPU_FINAL_SCRAPE_GRACE_NS` if you also tune `AIPERF_GPU_COLLECTION_INTERVAL` — keep grace at roughly `2x` the collection cadence.
1792+
- Per-GPU deltas use the nearest non-NaN baseline and the nearest non-NaN final sample; arrays containing transient NaN sensor failures still yield a meaningful delta.
1793+
- Omitted when no GPU reports `energy_consumption` in the window.
1794+
1795+
---
1796+
1797+
### Output Tokens per Joule
1798+
1799+
**Type:** [Derived Metric](#derived-metrics) (externally injected)
1800+
1801+
Inference energy efficiency: number of output tokens produced per joule of GPU energy consumed during the profiling phase. Higher is better.
1802+
1803+
**Formula:**
1804+
```python
1805+
output_tokens_per_joule = total_output_tokens / total_gpu_energy
1806+
```
1807+
1808+
**Notes:**
1809+
- Unit: `tokens/J`.
1810+
- Flagged `LARGER_IS_BETTER | PRODUCES_TOKENS_ONLY`.
1811+
- Numerator comes from the request records (`total_output_tokens`); denominator comes from the GPU telemetry counter delta above. The header reports the energy-side GPU count, since that's the cohort the metric depends on.
1812+
- Omitted when `total_output_tokens` is absent from the records or aggregate `total_gpu_energy` is zero.
1813+
1814+
---
1815+
1816+
### Energy per User
1817+
1818+
**Type:** [Derived Metric](#derived-metrics) (externally injected)
1819+
1820+
Per-user energy footprint during the profiling phase: total GPU energy consumed divided by the configured concurrency. Lower is better — a more efficient deployment serves the same load for less energy per concurrent user.
1821+
1822+
**Formula:**
1823+
```python
1824+
# concurrency from the resolved profiling phase config
1825+
# (run.cfg.get_profiling_phases()[0].concurrency).
1826+
energy_per_user_j = total_gpu_energy / concurrency
1827+
```
1828+
1829+
**Notes:**
1830+
- Unit: `joules/user`.
1831+
- Flagged `MetricFlags.NONE` — smaller-is-better is the default for unflagged metrics.
1832+
- Denominator is the profiling phase's configured `concurrency`. The resolver defaults this to `1` when `--concurrency` isn't specified in concurrency-mode runs, so the metric is emitted in the common case.
1833+
- Header reports the energy-side GPU count (the same cohort `total_gpu_energy` reports), e.g. `Energy per User (8 GPUs)`.
1834+
- Omitted when concurrency is unset (e.g. pure `--request-rate` mode) or aggregate GPU energy is unavailable.
1835+
1836+
---
1837+
17341838
## Multi-Run Aggregate Metrics
17351839

17361840
> [!NOTE]

src/aiperf/common/enums/metric_enums.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,10 +191,12 @@ class GenericMetricUnit(BaseMetricUnit):
191191
ERRORS = _unit("errors")
192192
IMAGE = _unit("image")
193193
IMAGES = _unit("images")
194+
JOULES_PER_USER = _unit("joules/user")
194195
PERCENT = _unit("%")
195196
RATIO = _unit("ratio")
196197
REQUESTS = _unit("requests")
197198
TOKENS = _unit("tokens")
199+
TOKENS_PER_JOULE = _unit("tokens/J")
198200
USER = _unit("user")
199201
USERS = _unit("users")
200202
VIDEO = _unit("video")

src/aiperf/common/environment.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,6 +297,21 @@ class _GPUSettings(BaseSettings):
297297
default=100,
298298
description="Batch size for telemetry record export results processor",
299299
)
300+
FINAL_SCRAPE_GRACE_NS: int = Field(
301+
ge=0,
302+
le=60_000_000_000,
303+
default=666_000_000,
304+
description=(
305+
"Grace window in nanoseconds appended to phase end_ns when computing "
306+
"the GPU energy-counter delta. Energy is scraped on a cadence "
307+
"(see COLLECTION_INTERVAL), so the trailing scrape often lands after "
308+
"the phase ends; this grace lets it be included while bounding the "
309+
"window so cooldown/idle samples and subsequent-phase samples don't "
310+
"leak into the delta. Default 666_000_000 ns ~= 2x the default "
311+
"333 ms COLLECTION_INTERVAL; raise this if you also raise "
312+
"COLLECTION_INTERVAL."
313+
),
314+
)
300315
REACHABILITY_TIMEOUT: int = Field(
301316
ge=1,
302317
le=300,

0 commit comments

Comments
 (0)