ai-dynamo
diff --git a/‎docs/dev/patterns.md‎
Lines changed: 96 additions & 0 deletions b/‎docs/dev/patterns.md‎
Lines changed: 96 additions & 0 deletions
diff --git a/‎docs/environment-variables.md‎
Lines changed: 1 addition & 0 deletions b/‎docs/environment-variables.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/metrics-reference.md‎
Lines changed: 104 additions & 0 deletions b/‎docs/metrics-reference.md‎
Lines changed: 104 additions & 0 deletions
diff --git a/‎src/aiperf/common/enums/metric_enums.py‎
Lines changed: 2 additions & 0 deletions b/‎src/aiperf/common/enums/metric_enums.py‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/aiperf/common/environment.py‎
Lines changed: 15 additions & 0 deletions b/‎src/aiperf/common/environment.py‎
Lines changed: 15 additions & 0 deletions
@@ -392,6 +392,102 @@ value" fallback that both call sites (`_converter_endpoint` and
   helper still works (returns `None`); the caller is responsible for
   raising instead of substituting a literal.
 
+## Externally-Injected Derived Metric Pattern
+
+A normal `BaseDerivedMetric` computes its value from peer metrics in the
+`MetricResultsDict` via `_derive_value`. Some derived metrics, however, are
+computed from data that never lives in the `MetricResultsDict` at all —
+GPU power and energy come from telemetry scrapes, not from request
+records, so their values must be injected by the accumulator that owns
+the sensor data rather than derived by the standard registry walk.
+
+Reference file: [`src/aiperf/metrics/types/power_efficiency_metrics.py`](../../src/aiperf/metrics/types/power_efficiency_metrics.py).
+Injection site: `GPUTelemetryAccumulator.compute_efficiency_metrics`
+([`src/aiperf/gpu_telemetry/accumulator.py`](../../src/aiperf/gpu_telemetry/accumulator.py)).
+
+### The three-part contract
+
+A metric class that participates in registry listings but is computed
+externally must spell out the contract in three places so future agents
+don't copy-paste the shape as the canonical derived-metric pattern.
+
+**1. `Invariant:` paragraph in the class docstring.** Name the injection
+site and the catching path explicitly:
+
+```python
+class TotalGpuEnergyMetric(BaseDerivedMetric[float]):
+    """Sum of GPU energy consumed across all GPUs during the benchmark phase, in joules.
+
+    Invariant: externally injected by
+    `GPUTelemetryAccumulator.compute_efficiency_metrics` from
+    energy_consumption counter deltas. `_derive_value` is intentionally
+    non-functional; `MetricResultsProcessor.update_derived_metrics` is
+    expected to catch NoMetricValue and skip the tag during its
+    derivation walk.
+    """
+```
+
+**2. `_derive_value` returns `NoReturn`.** The body unconditionally
+raises, so the truthful annotation is `NoReturn` from `typing`. Returning
+`float` would lie to type-checkers and downstream code that assumes the
+derivation succeeded.
+
+```python
+from typing import NoReturn
+from aiperf.common.exceptions import NoMetricValue
+from aiperf.metrics.metric_dicts import MetricResultsDict
+
+def _derive_value(self, metric_results: MetricResultsDict) -> NoReturn:
+    raise NoMetricValue(
+        "Cannot derive 'total_gpu_energy' from MetricResultsDict: this metric "
+        "is externally injected by "
+        "GPUTelemetryAccumulator.compute_efficiency_metrics. If this exception "
+        "surfaces, the derivation walk is missing its NoMetricValue handler "
+        "(see MetricResultsProcessor.update_derived_metrics)."
+    )
+```
+
+**3. Error message names the operation, the injection site, and the catching
+path.** A message that only names the source ("X is computed by the GPU
+telemetry accumulator") gives debugging agents no clue where the contract
+is enforced. The recommended shape is:
+
+- *Operation*: what derivation was attempted (`Cannot derive 'X' from
+  MetricResultsDict`).
+- *Injection site*: which method is the source of truth
+  (`GPUTelemetryAccumulator.compute_efficiency_metrics`).
+- *Catching path*: where the exception is expected to be absorbed
+  (`MetricResultsProcessor.update_derived_metrics`). If this fires in
+  production, the catching path has a bug.
+
+### Why not just skip the class entirely?
+
+The class is still required because the rest of the system reads class
+attributes (`tag`, `header`, `unit`, `display_order`, `flags`) when
+emitting `MetricResult`s, ordering the console table, and gating display
+behavior. The registry entry is structural metadata; the *value* is the
+external injection.
+
+### Where the injection happens
+
+`RecordsManager._apply_gpu_efficiency_metrics` calls
+`GPUTelemetryAccumulator.compute_efficiency_metrics`, which constructs
+`MetricResult` objects directly with the relevant tags and appends them
+to the records list before `ProcessRecordsResult` is built. The standard
+`update_derived_metrics` walk sees these tags too, raises `NoMetricValue`
+via `_derive_value`, catches it, and skips — so the externally-injected
+values are not overwritten.
+
+### Test contract
+
+The error-message invariants are pinned by
+[`tests/unit/metrics/test_power_efficiency_metrics.py`](../../tests/unit/metrics/test_power_efficiency_metrics.py)
+(parametrized over the three classes): every `_derive_value` call must
+raise `NoMetricValue` with a message that names the tag, the operation
+source (`MetricResultsDict`), and the injection site
+(`compute_efficiency_metrics`). A future weakening of any message fails
+the test rather than silently drifting.
+
 ## Testing Pattern
 
 ```python
 
@@ -82,6 +82,7 @@ GPU telemetry collection configuration. Controls GPU metrics collection frequenc
 | `AIPERF_GPU_COLLECTION_INTERVAL` | `0.333` | ≥ 0.01, ≤ 300.0 | GPU telemetry metrics collection interval in seconds (default: 333ms, ~3Hz) |
 | `AIPERF_GPU_DEFAULT_DCGM_ENDPOINTS` | `['http://localhost:9400/metrics', 'http://localhost:9401/metrics']` | — | Default DCGM endpoint URLs to check for GPU telemetry (comma-separated string or JSON array) |
 | `AIPERF_GPU_EXPORT_BATCH_SIZE` | `100` | ≥ 1, ≤ 1000000 | Batch size for telemetry record export results processor |
+| `AIPERF_GPU_FINAL_SCRAPE_GRACE_NS` | `666000000` | ≥ 0, ≤ 60000000000 | Grace window in nanoseconds appended to phase end_ns when computing the GPU energy-counter delta. Energy is scraped on a cadence (see COLLECTION_INTERVAL), so the trailing scrape often lands after the phase ends; this grace lets it be included while bounding the window so cooldown/idle samples and subsequent-phase samples don't leak into the delta. Default 666_000_000 ns ~= 2x the default 333 ms COLLECTION_INTERVAL; raise this if you also raise COLLECTION_INTERVAL. |
 | `AIPERF_GPU_REACHABILITY_TIMEOUT` | `10` | ≥ 1, ≤ 300 | Timeout in seconds for checking GPU telemetry endpoint reachability during init |
 | `AIPERF_GPU_SHUTDOWN_DELAY` | `5.0` | ≥ 1.0, ≤ 300.0 | Delay in seconds before shutting down GPU telemetry service to allow command response transmission |
 | `AIPERF_GPU_THREAD_JOIN_TIMEOUT` | `5.0` | ≥ 1.0, ≤ 300.0 | Timeout in seconds for joining GPU telemetry collection threads during shutdown |
 
@@ -112,6 +112,11 @@ This document provides a comprehensive reference of all metrics available in AIP
     - [HTTP Connection Reused](#http-connection-reused)
     - [HTTP Chunks Sent](#http-chunks-sent)
     - [HTTP Chunks Received](#http-chunks-received)
+  - [GPU Power Efficiency Metrics](#gpu-power-efficiency-metrics)
+    - [Total GPU Power](#total-gpu-power)
+    - [Total GPU Energy](#total-gpu-energy)
+    - [Output Tokens per Joule](#output-tokens-per-joule)
+    - [Energy per User](#energy-per-user)
 - [Metric Flags Reference](#metric-flags-reference)
 
 ---
@@ -1731,6 +1736,105 @@ http_req_chunks_received = trace.response_chunks_count
 
 ---
 
+## GPU Power Efficiency Metrics
+
+> [!NOTE]
+> All metrics in this section require `--gpu-telemetry` to be enabled and the underlying collector (DCGM, pynvml, or amdsmi) to expose the relevant signal (`gpu_power_usage` and/or `energy_consumption`). They are computed once per profiling phase by `GPUTelemetryAccumulator.compute_efficiency_metrics`, not by the standard derivation walk — see the [Externally-Injected Derived Metric pattern](dev/patterns.md#externally-injected-derived-metric-pattern).
+
+Each metric's header surfaces the number of GPUs that contributed valid data (e.g. `Total GPU Power (8 GPUs)`), so a partial-cohort run (where one or more GPUs failed to report) is distinguishable from a full run. Tags are emitted in this order when present: `total_gpu_power`, `total_gpu_energy`, `output_tokens_per_joule`, `energy_per_user`. Each tag is independently omitted when its underlying signal is unavailable.
+
+### Total GPU Power
+
+**Type:** [Derived Metric](#derived-metrics) (externally injected)
+
+Sum of average GPU power across all reporting GPUs during the profiling phase, in watts. Useful as a baseline for cross-run power comparisons.
+
+**Formula:**
+```python
+# Per GPU: average of gpu_power_usage gauge samples in the profiling window
+# (warmup excluded). Summed across all GPUs that reported valid samples.
+total_gpu_power_w = sum(
+    avg(gpu_power_usage[start_ns:end_ns])
+    for gpu in reporting_gpus
+)
+```
+
+**Notes:**
+- Unit: watts (`W`).
+- Time-filtered to the profiling-phase window; warmup samples are excluded.
+- Power is a gauge, so the window stays bounded — post-bench idle samples don't drag the average down.
+- Omitted when no GPU reports `gpu_power_usage` in the window.
+
+---
+
+### Total GPU Energy
+
+**Type:** [Derived Metric](#derived-metrics) (externally injected)
+
+Sum of energy consumed across all reporting GPUs during the profiling phase, in joules. Computed as a counter delta (`final − baseline`) per GPU and summed.
+
+**Formula:**
+```python
+# Per GPU: delta of the energy_consumption monotonic counter over the
+# profiling window, widened on the end by FINAL_SCRAPE_GRACE_NS so the
+# trailing scrape that lands just after requests_end_ns is captured.
+grace_ns = Environment.GPU.FINAL_SCRAPE_GRACE_NS  # default 666_000_000 (~666 ms)
+total_gpu_energy_j = sum(
+    delta(energy_consumption[start_ns : end_ns + grace_ns])
+    for gpu in reporting_gpus
+)
+# Negative deltas are clamped to 0 to handle counter resets (DCGM restart).
+```
+
+**Notes:**
+- Unit: joules (`J`). Source samples are reported in megajoules and converted via `EnergyMetricUnit.MEGAJOULE.joules`.
+- The end-of-window grace is bounded (not open-ended) so cooldown samples and any subsequent-phase samples cannot leak into the delta. Tune via `AIPERF_GPU_FINAL_SCRAPE_GRACE_NS` if you also tune `AIPERF_GPU_COLLECTION_INTERVAL` — keep grace at roughly `2x` the collection cadence.
+- Per-GPU deltas use the nearest non-NaN baseline and the nearest non-NaN final sample; arrays containing transient NaN sensor failures still yield a meaningful delta.
+- Omitted when no GPU reports `energy_consumption` in the window.
+
+---
+
+### Output Tokens per Joule
+
+**Type:** [Derived Metric](#derived-metrics) (externally injected)
+
+Inference energy efficiency: number of output tokens produced per joule of GPU energy consumed during the profiling phase. Higher is better.
+
+**Formula:**
+```python
+output_tokens_per_joule = total_output_tokens / total_gpu_energy
+```
+
+**Notes:**
+- Unit: `tokens/J`.
+- Flagged `LARGER_IS_BETTER | PRODUCES_TOKENS_ONLY`.
+- Numerator comes from the request records (`total_output_tokens`); denominator comes from the GPU telemetry counter delta above. The header reports the energy-side GPU count, since that's the cohort the metric depends on.
+- Omitted when `total_output_tokens` is absent from the records or aggregate `total_gpu_energy` is zero.
+
+---
+
+### Energy per User
+
+**Type:** [Derived Metric](#derived-metrics) (externally injected)
+
+Per-user energy footprint during the profiling phase: total GPU energy consumed divided by the configured concurrency. Lower is better — a more efficient deployment serves the same load for less energy per concurrent user.
+
+**Formula:**
+```python
+# concurrency from the resolved profiling phase config
+# (run.cfg.get_profiling_phases()[0].concurrency).
+energy_per_user_j = total_gpu_energy / concurrency
+```
+
+**Notes:**
+- Unit: `joules/user`.
+- Flagged `MetricFlags.NONE` — smaller-is-better is the default for unflagged metrics.
+- Denominator is the profiling phase's configured `concurrency`. The resolver defaults this to `1` when `--concurrency` isn't specified in concurrency-mode runs, so the metric is emitted in the common case.
+- Header reports the energy-side GPU count (the same cohort `total_gpu_energy` reports), e.g. `Energy per User (8 GPUs)`.
+- Omitted when concurrency is unset (e.g. pure `--request-rate` mode) or aggregate GPU energy is unavailable.
+
+---
+
 ## Multi-Run Aggregate Metrics
 
 > [!NOTE]
 
@@ -191,10 +191,12 @@ class GenericMetricUnit(BaseMetricUnit):
     ERRORS = _unit("errors")
     IMAGE = _unit("image")
     IMAGES = _unit("images")
+    JOULES_PER_USER = _unit("joules/user")
     PERCENT = _unit("%")
     RATIO = _unit("ratio")
     REQUESTS = _unit("requests")
     TOKENS = _unit("tokens")
+    TOKENS_PER_JOULE = _unit("tokens/J")
     USER = _unit("user")
     USERS = _unit("users")
     VIDEO = _unit("video")
 
@@ -297,6 +297,21 @@ class _GPUSettings(BaseSettings):
         default=100,
         description="Batch size for telemetry record export results processor",
     )
+    FINAL_SCRAPE_GRACE_NS: int = Field(
+        ge=0,
+        le=60_000_000_000,
+        default=666_000_000,
+        description=(
+            "Grace window in nanoseconds appended to phase end_ns when computing "
+            "the GPU energy-counter delta. Energy is scraped on a cadence "
+            "(see COLLECTION_INTERVAL), so the trailing scrape often lands after "
+            "the phase ends; this grace lets it be included while bounding the "
+            "window so cooldown/idle samples and subsequent-phase samples don't "
+            "leak into the delta. Default 666_000_000 ns ~= 2x the default "
+            "333 ms COLLECTION_INTERVAL; raise this if you also raise "
+            "COLLECTION_INTERVAL."
+        ),
+    )
     REACHABILITY_TIMEOUT: int = Field(
         ge=1,
         le=300,