You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`AIPERF_GPU_DEFAULT_DCGM_ENDPOINTS`|`['http://localhost:9400/metrics', 'http://localhost:9401/metrics']`| — | Default DCGM endpoint URLs to check for GPU telemetry (comma-separated string or JSON array) |
84
84
|`AIPERF_GPU_EXPORT_BATCH_SIZE`|`100`| ≥ 1, ≤ 1000000 | Batch size for telemetry record export results processor |
85
+
|`AIPERF_GPU_FINAL_SCRAPE_GRACE_NS`|`666000000`| ≥ 0, ≤ 60000000000 | Grace window in nanoseconds appended to phase end_ns when computing the GPU energy-counter delta. Energy is scraped on a cadence (see COLLECTION_INTERVAL), so the trailing scrape often lands after the phase ends; this grace lets it be included while bounding the window so cooldown/idle samples and subsequent-phase samples don't leak into the delta. Default 666_000_000 ns ~= 2x the default 333 ms COLLECTION_INTERVAL; raise this if you also raise COLLECTION_INTERVAL. |
85
86
|`AIPERF_GPU_REACHABILITY_TIMEOUT`|`10`| ≥ 1, ≤ 300 | Timeout in seconds for checking GPU telemetry endpoint reachability during init |
86
87
|`AIPERF_GPU_SHUTDOWN_DELAY`|`5.0`| ≥ 1.0, ≤ 300.0 | Delay in seconds before shutting down GPU telemetry service to allow command response transmission |
87
88
|`AIPERF_GPU_THREAD_JOIN_TIMEOUT`|`5.0`| ≥ 1.0, ≤ 300.0 | Timeout in seconds for joining GPU telemetry collection threads during shutdown |
> All metrics in this section require `--gpu-telemetry` to be enabled and the underlying collector (DCGM, pynvml, or amdsmi) to expose the relevant signal (`gpu_power_usage` and/or `energy_consumption`). They are computed once per profiling phase by `GPUTelemetryAccumulator.compute_efficiency_metrics`, not by the standard derivation walk — see the [Externally-Injected Derived Metric pattern](dev/patterns.md#externally-injected-derived-metric-pattern).
1743
+
1744
+
Each metric's header surfaces the number of GPUs that contributed valid data (e.g. `Total GPU Power (8 GPUs)`), so a partial-cohort run (where one or more GPUs failed to report) is distinguishable from a full run. Tags are emitted in this order when present: `total_gpu_power`, `total_gpu_energy`, `output_tokens_per_joule`, `energy_per_user`. Each tag is independently omitted when its underlying signal is unavailable.
Sum of energy consumed across all reporting GPUs during the profiling phase, in joules. Computed as a counter delta (`final − baseline`) per GPU and summed.
1775
+
1776
+
**Formula:**
1777
+
```python
1778
+
# Per GPU: delta of the energy_consumption monotonic counter over the
1779
+
# profiling window, widened on the end by FINAL_SCRAPE_GRACE_NS so the
1780
+
# trailing scrape that lands just after requests_end_ns is captured.
# Negative deltas are clamped to 0 to handle counter resets (DCGM restart).
1787
+
```
1788
+
1789
+
**Notes:**
1790
+
- Unit: joules (`J`). Source samples are reported in megajoules and converted via `EnergyMetricUnit.MEGAJOULE.joules`.
1791
+
- The end-of-window grace is bounded (not open-ended) so cooldown samples and any subsequent-phase samples cannot leak into the delta. Tune via `AIPERF_GPU_FINAL_SCRAPE_GRACE_NS` if you also tune `AIPERF_GPU_COLLECTION_INTERVAL` — keep grace at roughly `2x` the collection cadence.
1792
+
- Per-GPU deltas use the nearest non-NaN baseline and the nearest non-NaN final sample; arrays containing transient NaN sensor failures still yield a meaningful delta.
1793
+
- Omitted when no GPU reports `energy_consumption` in the window.
- Numerator comes from the request records (`total_output_tokens`); denominator comes from the GPU telemetry counter delta above. The header reports the energy-side GPU count, since that's the cohort the metric depends on.
1812
+
- Omitted when `total_output_tokens` is absent from the records or aggregate `total_gpu_energy` is zero.
Per-user energy footprint during the profiling phase: total GPU energy consumed divided by the configured concurrency. Lower is better — a more efficient deployment serves the same load for less energy per concurrent user.
1821
+
1822
+
**Formula:**
1823
+
```python
1824
+
# concurrency from the resolved profiling phase config
- Flagged `MetricFlags.NONE` — smaller-is-better is the default for unflagged metrics.
1832
+
- Denominator is the profiling phase's configured `concurrency`. The resolver defaults this to `1` when `--concurrency` isn't specified in concurrency-mode runs, so the metric is emitted in the common case.
1833
+
- Header reports the energy-side GPU count (the same cohort `total_gpu_energy` reports), e.g. `Energy per User (8 GPUs)`.
1834
+
- Omitted when concurrency is unset (e.g. pure `--request-rate` mode) or aggregate GPU energy is unavailable.
0 commit comments