feat(BA-5878): add window-based max/avg container live stats via PromQL#11360
feat(BA-5878): add window-based max/avg container live stats via PromQL#11360seedspirit wants to merge 1 commit intomainfrom
Conversation
Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
f45c9a0 to
9f6a8ee
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.
Changes:
- Agent: emit
value_type=max/value_type=avgsamples for container metrics whenstats.max/stats.avgare present. - Common/Manager: add
MAX/AVGto the sharedValueTypeenum and update live-stat query merging to handle an extensible query bundle. - Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/ai/backend/agent/stats.py |
Adds stats.max / stats.avg emission as Prometheus value_type samples during container stat collection. |
src/ai/backend/agent/metrics/types.py |
Introduces MAX_METRIC_KEY / AVG_METRIC_KEY and includes them in ALL_METRIC_VALUE_TYPES. |
src/ai/backend/common/clients/prometheus/types.py |
Extends ValueType with MAX / AVG. |
src/ai/backend/manager/repositories/metric/repository.py |
Refactors live-stat query merging to iterate over gathered responses. |
tests/unit/manager/services/utilization_metric/test_container_metric.py |
Adds tests for live-stat query bundle shape and merge behavior with max/avg value types. |
changes/11360.feature.md |
Documents the new agent-side export behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (max_value := metric_value.get("stats.max")) is not None: | ||
| value_pairs.append((MAX_METRIC_KEY, max_value)) | ||
| if (avg_value := metric_value.get("stats.avg")) is not None: | ||
| value_pairs.append((AVG_METRIC_KEY, avg_value)) |
There was a problem hiding this comment.
stats.max/stats.avg are taken from Metric.to_serializable_dict() and therefore reflect MovingStatistics over the raw measurement values (pre-current_hook). For counter-derived metrics like cpu_util (see docker/intrinsic.py, where current_hook uses metric.stats.rate while the raw value is the cumulative cpu time), exporting these as value_type=max/avg will produce values that are effectively max/avg of the monotonically increasing counter (i.e., not “max/avg utilization”). Consider either (a) computing max/avg over the derived current_hook value (e.g., maintain a separate MovingStatistics for the derived value), or (b) only exporting max/avg for true GAUGE metrics where the raw series is meaningful for max/avg.
| if (max_value := metric_value.get("stats.max")) is not None: | |
| value_pairs.append((MAX_METRIC_KEY, max_value)) | |
| if (avg_value := metric_value.get("stats.avg")) is not None: | |
| value_pairs.append((AVG_METRIC_KEY, avg_value)) | |
| has_current_hook = getattr(obj, "current_hook", None) is not None | |
| if not has_current_hook: | |
| if (max_value := metric_value.get("stats.max")) is not None: | |
| value_pairs.append((MAX_METRIC_KEY, max_value)) | |
| if (avg_value := metric_value.get("stats.avg")) is not None: | |
| value_pairs.append((AVG_METRIC_KEY, avg_value)) |
| self.observe_container_metric( | ||
| kernel_id, | ||
| metric_key, | ||
| measure, | ||
| ) |
Revert agent-side stats.max/avg export in favor of computing max/avg via PromQL window expressions on the manager. Container live-stat fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg). - Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label to map "stats.max" / "stats.avg" labels back into typed value types. - Build max/avg templates that union a gauge sub-expression with a rate sub-expression and label_replace the result back to the legacy "stats.max" / "stats.avg" label. - Classify metrics into gauge-shape vs rate-shape stats sources, with exact names for built-ins and regex patterns for accelerator metrics. - Repository merges 5 query responses in a generic loop instead of unpacking three buckets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| PCT = "pct" | ||
| MAX = "max" | ||
| AVG = "avg" | ||
| RATE = "rate" |
There was a problem hiding this comment.
Just a question, this seems PR is adding max, and avg
Why rate is included here?
Summary
max_over_time/avg_over_timePromQL queries (with subqueries for rate-shape metrics) so the legacystats.max/stats.avgfields can be filled directly from Prometheus — no agent-side change.mem,io_scratch_size,cpu_util) and regex patterns for accelerator/plugin metrics (*_util,*_mem,*_power). Each bucket renders alabel_replace-wrapped expression so results carry a syntheticvalue_type=stats.max/value_type=stats.avglabel that legacy live-stat consumers can read.MAX/AVGtoValueType, plusfrom_legacy_live_stat_label/to_live_stat_labelhelpers to bridge the legacystats.*label scheme.Test plan
pants test tests/unit/manager/services/utilization_metric/test_container_metric.py— characterization tests for the rendered max/avg PromQL and thestats.*→ValueTypemappingpants fmt fix lint check --changed-since=HEAD~1LegacyLiveStatConverterwiring lands — current PR is supply-side only.🤖 Generated with Claude Code