feat(BA-5878): add window-based max/avg container live stats via PromQL by seedspirit · Pull Request #11360 · lablup/backend.ai

seedspirit · 2026-04-28T01:18:21Z

Summary

Manager renders dedicated max_over_time / avg_over_time PromQL queries (with subqueries for rate-shape metrics) so the legacy stats.max / stats.avg fields can be filled directly from Prometheus — no agent-side change.
Container live-stat fan-out grows from 3 to 5 PromQL queries (gauge / diff / rate / max / avg). The repository merges responses in a generic loop instead of unpacking three buckets.
Metrics are split into gauge-shape vs rate-shape stats sources: exact names for built-ins (mem, io_scratch_size, cpu_util) and regex patterns for accelerator/plugin metrics (*_util, *_mem, *_power). Each bucket renders a label_replace-wrapped expression so results carry a synthetic value_type=stats.max / value_type=stats.avg label that legacy live-stat consumers can read.
Adds MAX / AVG to ValueType, plus from_legacy_live_stat_label / to_live_stat_label helpers to bridge the legacy stats.* label scheme.

Test plan

pants test tests/unit/manager/services/utilization_metric/test_container_metric.py — characterization tests for the rendered max/avg PromQL and the stats.* → ValueType mapping
pants fmt fix lint check --changed-since=HEAD~1
Live verification against Prometheus once LegacyLiveStatConverter wiring lands — current PR is supply-side only.

🤖 Generated with Claude Code

Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.

Changes:

Agent: emit value_type=max / value_type=avg samples for container metrics when stats.max / stats.avg are present.
Common/Manager: add MAX / AVG to the shared ValueType enum and update live-stat query merging to handle an extensible query bundle.
Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/ai/backend/agent/stats.py`	Adds `stats.max` / `stats.avg` emission as Prometheus `value_type` samples during container stat collection.
`src/ai/backend/agent/metrics/types.py`	Introduces `MAX_METRIC_KEY` / `AVG_METRIC_KEY` and includes them in `ALL_METRIC_VALUE_TYPES`.
`src/ai/backend/common/clients/prometheus/types.py`	Extends `ValueType` with `MAX` / `AVG`.
`src/ai/backend/manager/repositories/metric/repository.py`	Refactors live-stat query merging to iterate over gathered responses.
`tests/unit/manager/services/utilization_metric/test_container_metric.py`	Adds tests for live-stat query bundle shape and merge behavior with `max`/`avg` value types.
`changes/11360.feature.md`	Documents the new agent-side export behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-29T09:08:28Z

+                if (max_value := metric_value.get("stats.max")) is not None:
+                    value_pairs.append((MAX_METRIC_KEY, max_value))
+                if (avg_value := metric_value.get("stats.avg")) is not None:
+                    value_pairs.append((AVG_METRIC_KEY, avg_value))


stats.max/stats.avg are taken from Metric.to_serializable_dict() and therefore reflect MovingStatistics over the raw measurement values (pre-current_hook). For counter-derived metrics like cpu_util (see docker/intrinsic.py, where current_hook uses metric.stats.rate while the raw value is the cumulative cpu time), exporting these as value_type=max/avg will produce values that are effectively max/avg of the monotonically increasing counter (i.e., not “max/avg utilization”). Consider either (a) computing max/avg over the derived current_hook value (e.g., maintain a separate MovingStatistics for the derived value), or (b) only exporting max/avg for true GAUGE metrics where the raw series is meaningful for max/avg.

Suggested change

if (max_value := metric_value.get("stats.max")) is not None:

value_pairs.append((MAX_METRIC_KEY, max_value))

if (avg_value := metric_value.get("stats.avg")) is not None:

value_pairs.append((AVG_METRIC_KEY, avg_value))

has_current_hook = getattr(obj, "current_hook", None) is not None

if not has_current_hook:

if (max_value := metric_value.get("stats.max")) is not None:

value_pairs.append((MAX_METRIC_KEY, max_value))

if (avg_value := metric_value.get("stats.avg")) is not None:

value_pairs.append((AVG_METRIC_KEY, avg_value))

fregataa · 2026-04-30T12:12:58Z

-                    self.observe_container_metric(
-                        kernel_id,
-                        metric_key,
-                        measure,
-                    )


Is this duplicate??

Revert agent-side stats.max/avg export in favor of computing max/avg via PromQL window expressions on the manager. Container live-stat fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg). - Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label to map "stats.max" / "stats.avg" labels back into typed value types. - Build max/avg templates that union a gauge sub-expression with a rate sub-expression and label_replace the result back to the legacy "stats.max" / "stats.avg" label. - Classify metrics into gauge-shape vs rate-shape stats sources, with exact names for built-ins and regex patterns for accelerator metrics. - Repository merges 5 query responses in a generic loop instead of unpacking three buckets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jopemachine · 2026-05-08T08:32:12Z

    PCT = "pct"
+    MAX = "max"
+    AVG = "avg"
+    RATE = "rate"


Just a question, this seems PR is adding max, and avg
Why rate is included here?

github-actions Bot assigned seedspirit Apr 28, 2026

github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels Apr 28, 2026

seedspirit added a commit that referenced this pull request Apr 28, 2026

changelog: add news fragment for PR #11360

5065ac2

github-actions Bot added the comp:agent Related to Agent component label Apr 29, 2026

seedspirit changed the title ~~feat(BA-5878): add window-based max/avg live stats via PromQL~~ feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types Apr 29, 2026

seedspirit requested a review from a team April 29, 2026 09:00

seedspirit marked this pull request as ready for review April 29, 2026 09:00

Copilot AI review requested due to automatic review settings April 29, 2026 09:00

seedspirit added a commit that referenced this pull request Apr 29, 2026

changelog: add news fragment for PR #11360

5f03b92

seedspirit force-pushed the feat/BA-5878 branch from f45c9a0 to 9f6a8ee Compare April 29, 2026 09:02

Copilot AI reviewed Apr 29, 2026

View reviewed changes

fregataa reviewed Apr 30, 2026

View reviewed changes

seedspirit force-pushed the feat/BA-5878 branch from cfb4990 to 263b14a Compare May 8, 2026 08:27

seedspirit changed the title ~~feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types~~ feat(BA-5878): add window-based max/avg container live stats via PromQL May 8, 2026

jopemachine reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-5878): add window-based max/avg container live stats via PromQL#11360

feat(BA-5878): add window-based max/avg container live stats via PromQL#11360
seedspirit wants to merge 1 commit intomainfrom
feat/BA-5878

seedspirit commented Apr 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 29, 2026

Uh oh!

fregataa Apr 30, 2026

Uh oh!

jopemachine May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

seedspirit commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

fregataa Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jopemachine May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seedspirit commented Apr 28, 2026 •

edited

Loading