Skip to content

feat(BA-5878): add window-based max/avg container live stats via PromQL#11360

Open
seedspirit wants to merge 1 commit intomainfrom
feat/BA-5878
Open

feat(BA-5878): add window-based max/avg container live stats via PromQL#11360
seedspirit wants to merge 1 commit intomainfrom
feat/BA-5878

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 28, 2026

Summary

  • Manager renders dedicated max_over_time / avg_over_time PromQL queries (with subqueries for rate-shape metrics) so the legacy stats.max / stats.avg fields can be filled directly from Prometheus — no agent-side change.
  • Container live-stat fan-out grows from 3 to 5 PromQL queries (gauge / diff / rate / max / avg). The repository merges responses in a generic loop instead of unpacking three buckets.
  • Metrics are split into gauge-shape vs rate-shape stats sources: exact names for built-ins (mem, io_scratch_size, cpu_util) and regex patterns for accelerator/plugin metrics (*_util, *_mem, *_power). Each bucket renders a label_replace-wrapped expression so results carry a synthetic value_type=stats.max / value_type=stats.avg label that legacy live-stat consumers can read.
  • Adds MAX / AVG to ValueType, plus from_legacy_live_stat_label / to_live_stat_label helpers to bridge the legacy stats.* label scheme.

Test plan

  • pants test tests/unit/manager/services/utilization_metric/test_container_metric.py — characterization tests for the rendered max/avg PromQL and the stats.*ValueType mapping
  • pants fmt fix lint check --changed-since=HEAD~1
  • Live verification against Prometheus once LegacyLiveStatConverter wiring lands — current PR is supply-side only.

🤖 Generated with Claude Code

@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels Apr 28, 2026
seedspirit added a commit that referenced this pull request Apr 28, 2026
@github-actions github-actions Bot added the comp:agent Related to Agent component label Apr 29, 2026
@seedspirit seedspirit changed the title feat(BA-5878): add window-based max/avg live stats via PromQL feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types Apr 29, 2026
seedspirit added a commit that referenced this pull request Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are
exported as Prometheus value_type labels instead of computing them
via window-based PromQL queries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit requested a review from a team April 29, 2026 09:00
@seedspirit seedspirit marked this pull request as ready for review April 29, 2026 09:00
Copilot AI review requested due to automatic review settings April 29, 2026 09:00
seedspirit added a commit that referenced this pull request Apr 29, 2026
seedspirit added a commit that referenced this pull request Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are
exported as Prometheus value_type labels instead of computing them
via window-based PromQL queries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.

Changes:

  • Agent: emit value_type=max / value_type=avg samples for container metrics when stats.max / stats.avg are present.
  • Common/Manager: add MAX / AVG to the shared ValueType enum and update live-stat query merging to handle an extensible query bundle.
  • Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ai/backend/agent/stats.py Adds stats.max / stats.avg emission as Prometheus value_type samples during container stat collection.
src/ai/backend/agent/metrics/types.py Introduces MAX_METRIC_KEY / AVG_METRIC_KEY and includes them in ALL_METRIC_VALUE_TYPES.
src/ai/backend/common/clients/prometheus/types.py Extends ValueType with MAX / AVG.
src/ai/backend/manager/repositories/metric/repository.py Refactors live-stat query merging to iterate over gathered responses.
tests/unit/manager/services/utilization_metric/test_container_metric.py Adds tests for live-stat query bundle shape and merge behavior with max/avg value types.
changes/11360.feature.md Documents the new agent-side export behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/agent/stats.py Outdated
Comment on lines +798 to +801
if (max_value := metric_value.get("stats.max")) is not None:
value_pairs.append((MAX_METRIC_KEY, max_value))
if (avg_value := metric_value.get("stats.avg")) is not None:
value_pairs.append((AVG_METRIC_KEY, avg_value))
Copy link

Copilot AI Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stats.max/stats.avg are taken from Metric.to_serializable_dict() and therefore reflect MovingStatistics over the raw measurement values (pre-current_hook). For counter-derived metrics like cpu_util (see docker/intrinsic.py, where current_hook uses metric.stats.rate while the raw value is the cumulative cpu time), exporting these as value_type=max/avg will produce values that are effectively max/avg of the monotonically increasing counter (i.e., not “max/avg utilization”). Consider either (a) computing max/avg over the derived current_hook value (e.g., maintain a separate MovingStatistics for the derived value), or (b) only exporting max/avg for true GAUGE metrics where the raw series is meaningful for max/avg.

Suggested change
if (max_value := metric_value.get("stats.max")) is not None:
value_pairs.append((MAX_METRIC_KEY, max_value))
if (avg_value := metric_value.get("stats.avg")) is not None:
value_pairs.append((AVG_METRIC_KEY, avg_value))
has_current_hook = getattr(obj, "current_hook", None) is not None
if not has_current_hook:
if (max_value := metric_value.get("stats.max")) is not None:
value_pairs.append((MAX_METRIC_KEY, max_value))
if (avg_value := metric_value.get("stats.avg")) is not None:
value_pairs.append((AVG_METRIC_KEY, avg_value))

Copilot uses AI. Check for mistakes.
Comment on lines -776 to -780
self.observe_container_metric(
kernel_id,
metric_key,
measure,
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicate??

Revert agent-side stats.max/avg export in favor of computing max/avg
via PromQL window expressions on the manager. Container live-stat
fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg).

- Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label
  to map "stats.max" / "stats.avg" labels back into typed value types.
- Build max/avg templates that union a gauge sub-expression with a
  rate sub-expression and label_replace the result back to the legacy
  "stats.max" / "stats.avg" label.
- Classify metrics into gauge-shape vs rate-shape stats sources, with
  exact names for built-ins and regex patterns for accelerator metrics.
- Repository merges 5 query responses in a generic loop instead of
  unpacking three buckets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit changed the title feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types feat(BA-5878): add window-based max/avg container live stats via PromQL May 8, 2026
PCT = "pct"
MAX = "max"
AVG = "avg"
RATE = "rate"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, this seems PR is adding max, and avg
Why rate is included here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants