feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus by seedspirit · Pull Request #11360 · lablup/backend.ai

seedspirit · 2026-04-28T01:18:21Z

Why

The legacy live_stat resolvers — mem_max_bytes, cpu_using, io_max_scratch_size, net_rx_bytes, net_tx_bytes, and friends — read stats.max / stats.avg / stats.rate keys produced by an in-memory MovingStatistics accumulator on the agent. Each running container's metric carries a MovingStatistics instance holding running per-sample max / sum / count over every sample observed (sampled every 5 s via UTILIZATION_METRIC_INTERVAL = 5.0), plus the latest 2 samples for rate calculation. Each metric declares a stats_filter whitelist controlling which slots get published.

That accumulator has serious failure modes:

Agent restart wipes the history. First post-restart sample becomes the new baseline, so stats.max silently rewinds to "max since the agent process happened to last restart" — not lifetime peak.
avg flattens forever. _count is unbounded; a session that ran hot then went idle drifts toward the idle reading.
No retention across host reboot, no consistency across sessions whose agents restarted at different times.
stats.rate is computed from the last two samples only, so the first post-restart reading is 0.
For net_rx / net_tx the agent's stats_filter is empty, so the agent never actually publishes stats.rate — the legacy GraphQL net_rx_bytes / net_tx_bytes resolvers always return 0.

What this PR provides

A parallel manager-side supply for the same stats.{max,avg,rate} slots, computed on demand from Prometheus via window queries (MetricConfig.timewindow, default 5m). The legacy agent producer stays in place — this is a drop-in alternative source that future legacy compat code can route to.

The conceptual switch is "lifetime peak as remembered by the agent" → "recent peak as recorded by Prometheus." The new supply:

Survives agent / manager / host restart (data lives in Prometheus TSDB)
Is consistent across sessions (wall-clock window, not per-agent uptime)
Doesn't accumulate forever (old peaks roll out of the window)
Closes the long-standing net_rx / net_tx stats.rate hole that the agent's empty stats_filter left

The query results carry clean value_type=max / value_type=avg / value_type=rate labels (without the legacy stats. prefix). A future LegacyLiveStatConverter will re-attach the stats. prefix when serving legacy clients; the manager pipeline itself doesn't need to carry the legacy encoding.

Coverage matrix (what each pipeline publishes)

Metric	Legacy `stats.max`	This PR `stats.max`	Legacy `stats.avg`	This PR `stats.avg`	Legacy `stats.rate`	This PR `stats.rate`
`cpu_util`	✓	✓	✓	✓	—	—
`mem`	✓	✓	—	—	—	—
`io_scratch_size`	✓	✓	—	—	—	—
`io_read`	—	—	—	—	✓	✓
`io_write`	—	—	—	—	✓	✓
`net_rx`	—	—	—	—	— (see note)	✓
`net_tx`	—	—	—	—	— (see note)	✓
accel `*_mem`	✓	✓	—	—	—	—
accel `*_util`	✓	✓	✓	✓	—	—
accel `*_power`	✓	✓	✓	✓	—	—
accel `*_temperature`	✓	✓	✓	✓	—	—

Notes:

Legacy stats.rate for net_rx / net_tx is broken — always returns 0. The agent's measurement entries for these omit stats_filter (default empty), so the agent never publishes stats.rate to Valkey for net_rx/net_tx. The legacy GraphQL net_rx_bytes / net_tx_bytes resolvers read the missing key and fall back to 0. The agent's per-second rate IS published, but as current (via current_hook = lambda m: m.stats.rate), which the legacy resolver doesn't read — an internal contract mismatch.
This PR's stats.rate for net_rx / net_tx — passthrough, no rate() wrap. The current series in Prometheus already carries per-second rate (thanks to the agent's current_hook), so the new query just sums across replicas and label_replaces the value_type label from current to rate. Applying PromQL rate() to an already-rate series would compute rate-of-rate and produce nonsense.
This PR's stats.rate for io_read / io_write — rate(...[5m]) over the counter. No current_hook here, so current is the cumulative byte counter. The new query applies PromQL rate(...[5m]) to convert it to per-second bytes, then label_replaces the result to rate. Conceptually matches what the legacy stats_filter={"rate"} slot held, except computed over a 5-minute window instead of the agent's last-2-sample delta.

Live verification

Manager restarted on this branch, halfstack with Prometheus 3.1.0, one idle running kernel (1 CPU / 1 GiB):

metric	current	stats.max	stats.avg	stats.rate
`mem`	133,910,528	133,910,528	— (not produced by agent)	—
`cpu_util`	5371.729	5371.729	5371.729	—
`io_scratch_size`	0	0 (empty disk, not a missing field)	—	—
`net_rx`	27530	—	—	27530
`net_tx`	30378	—	—	30378
`io_read`	0	—	—	0
`io_write`	0	—	—	0

Idle workload across the 5-minute window → current / stats.max / stats.avg converge. Same kernel via the legacy GraphQL resolver returned live_stat: null (Valkey path empty in this env), confirming all values above came from the new pipeline.

Bug fixed during verification

Prometheus's RE2 engine rejects \- as parse error: unknown escape sequence U+002D '-'. Python's re.escape over-escapes hyphens, so every kernel-live-stat query (UUIDs always contain hyphens) would have failed to parse:

kernel_id=~"…-…"   → status: success
kernel_id=~"…\-…"  → status: error

Fixed by stripping the redundant escape before sending.

Test plan

pants test tests/unit/manager/services/utilization_metric/test_container_metric.py
pants fmt fix lint check --changed-since=HEAD~1
Live verification against Prometheus on a running kernel — stats.max / stats.avg / stats.rate populated as expected for mem / cpu_util / io_scratch_size / net_rx / net_tx / io_read / io_write. Wiring the new pipeline behind the legacy GraphQL resolvers (the LegacyLiveStatConverter) is the follow-up.

🤖 Generated with Claude Code

Reflect the new direction: agent-side stats.max / stats.avg are exported as Prometheus value_type labels instead of computing them via window-based PromQL queries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.

Changes:

Agent: emit value_type=max / value_type=avg samples for container metrics when stats.max / stats.avg are present.
Common/Manager: add MAX / AVG to the shared ValueType enum and update live-stat query merging to handle an extensible query bundle.
Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`src/ai/backend/agent/stats.py`	Adds `stats.max` / `stats.avg` emission as Prometheus `value_type` samples during container stat collection.
`src/ai/backend/agent/metrics/types.py`	Introduces `MAX_METRIC_KEY` / `AVG_METRIC_KEY` and includes them in `ALL_METRIC_VALUE_TYPES`.
`src/ai/backend/common/clients/prometheus/types.py`	Extends `ValueType` with `MAX` / `AVG`.
`src/ai/backend/manager/repositories/metric/repository.py`	Refactors live-stat query merging to iterate over gathered responses.
`tests/unit/manager/services/utilization_metric/test_container_metric.py`	Adds tests for live-stat query bundle shape and merge behavior with `max`/`avg` value types.
`changes/11360.feature.md`	Documents the new agent-side export behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Revert agent-side stats.max/avg export in favor of computing max/avg via PromQL window expressions on the manager. Container live-stat fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg). - Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label to map "stats.max" / "stats.avg" labels back into typed value types. - Build max/avg templates that union a gauge sub-expression with a rate sub-expression and label_replace the result back to the legacy "stats.max" / "stats.avg" label. - Classify metrics into gauge-shape vs rate-shape stats sources, with exact names for built-ins and regex patterns for accelerator metrics. - Repository merges 5 query responses in a generic loop instead of unpacking three buckets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PromQL parser rejects \- as an unknown escape sequence inside a regex literal, so re.escape over-escaping broke every container live-stat query (kernel_id UUIDs always contain hyphens). Strip the backslash from \- after escaping so the rendered queries are RE2-acceptable. Also drop the unused ValueType.RATE — no producer ever emits it and no consumer matches on it; only MAX/AVG round-trip to the legacy stats.* labels. And include the underlying exception in the warning emitted from MetricRepository.query_container_live_stats so "empty results" no longer hide the real Prometheus failure mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s_query The gauge and rate branches both produced the same label_replace + *_over_time + sum_by skeleton, differing only in whether the inner selector was wrapped with rate(...[window]). The two branches embedded that skeleton as inline f-strings, which hid the symmetry and was hard to read. Extract two methods on FixedQueryBuilder: - _utilization_selector() — builds {METRIC}{labels}, hiding the brace-doubling away from rendering callers. - _window_stat_subquery() — wraps a selector in label_replace({stat_fn}((sum by ({group_by})({selector}))[window:]), "value_type","{label}","value_type",".*"). This is the single place where the stats-subquery shape lives. _render_stats_query now reads as: pick a regex, build a selector (optionally wrapped in rate()), and feed it through the same skeleton. Rendered output is byte-for-byte unchanged (characterization tests pass without touching the fixture). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The five fields of ContainerLiveStatQueries were built through three asymmetric idioms: gauge/diff/rate constructed _LiveStatQuerySpec on the fly inside the call, max/avg used pre-built module constants but asked the caller to compute kernel_id_regex/group_by and pass them positionally. Promote the three filtered-query specs to module-level Final constants (_GAUGE_LIVE_STAT_SPEC / _DIFF_LIVE_STAT_SPEC / _RATE_LIVE_STAT_SPEC) so they sit alongside _MAX_STATS_BUCKET / _AVG_STATS_BUCKET, and rename the two builders so all five fields read as self._build_*_preset(kernel_ids, _CONSTANT). The kernel_id_regex / group_by computation moves inside _build_window_stats_preset where it is actually consumed, instead of leaking to the caller. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The match over MetricType already covers all three variants, so the case _ -> raise UnreachableError(...) arm was dead code that Pylance flagged as unreachable. mypy's exhaustiveness check now enforces the same invariant statically — adding a new MetricType variant without a case will fail type check, which is the same protection the runtime raise gave but caught earlier. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The two helpers were named "to_live_stat_label" / "from_legacy_live_stat_label", which obscured their actual asymmetry: only the encoder produces the legacy "stats.<name>" form, while the decoder accepts both legacy and current shapes. Rename them to: - ValueType.to_legacy_live_stat_label — encoder, legacy emission - ValueType.from_live_stat_label — decoder, accepts either form and document the "stats." prefix as the legacy convention so the removeprefix branch reads as historical compatibility, not generic parsing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ture The window-stat gauge patterns lagged behind what legacy accelerator plugins actually publish: - *_power: legacy emits both stats.max and stats.avg; this PR was emitting only stats.max. - *_temperature: legacy emits both stats.max and stats.avg; this PR was emitting neither. Surveyed plugins (rebellions/common, rebellions/atom_max, habana, ipu, mock): - All emit *_mem (max only) and *_util (max + avg). - Only mock currently emits *_power and *_temperature, both with {avg, max} filters. Extend STATS_MAX_GAUGE_METRIC_PATTERNS to include _temperature and STATS_AVG_GAUGE_METRIC_PATTERNS to include _power and _temperature so the new pipeline matches what every legacy plugin actually publishes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t_rx/net_tx Legacy live_stat consumers — most importantly the legacy_compute_session.net_rx_bytes / net_tx_bytes / io_read_bytes / io_write_bytes GraphQL resolvers — read the "stats.rate" key from the per-metric live_stat dict. The new PromQL pipeline previously emitted nothing under that label, leaving those four legacy fields uncovered: - io_read / io_write: legacy agent's stats_filter={"rate"}, which the new pipeline did not produce. - net_rx / net_tx: legacy agent's stats_filter is empty, so the agent publishes no stats.rate at all even though the legacy resolver expects it — making the resolver always return 0. The new pipeline now produces a value where legacy never did. Two metric shapes flow through the new bucket: - Gauge-shape (net_rx, net_tx): the metric's `current` value is already a per-second rate (set by agent's current_hook = lambda m: m.stats.rate), so PromQL only needs to sum across replicas and label_replace to "stats.rate". - Counter-shape (io_read, io_write): the value is a cumulative byte counter, so PromQL applies rate(...[window]) before label_replace. Live verified against Prometheus on a running kernel: {net_rx, stats.rate} = 27530 {net_tx, stats.rate} = 30378 {io_read, stats.rate} = 0 {io_write, stats.rate} = 0 Re-introduces ValueType.RATE (and its to_legacy_live_stat_label / from_live_stat_label round-trip) that was removed earlier in the branch when no producer existed; the round-trip is now load-bearing again. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…bels Earlier the new pipeline emitted "stats.max" / "stats.avg" / "stats.rate" as the rewritten value_type label, mirroring the legacy live_stat dict keys. That coupling was unnecessary: the labels are synthesized at query time via label_replace and never persist in the Prometheus TSDB (only "current" / "capacity" series actually get scraped). The legacy "stats." prefix is a Valkey live_stat artifact that future legacy- compat consumers should re-attach in a dedicated converter, not something the manager pipeline needs to carry. The new pipeline now emits clean, ValueType-enum-aligned labels: stats.max -> max stats.avg -> avg stats.rate -> rate Round-trip helpers go away: - ValueType.to_legacy_live_stat_label() removed (was a 1-arm match that always returned "stats.{value}" for the three stat slots). - ValueType.from_live_stat_label() removed; KernelMetricValuesByKernel parses the response label via ValueType(value_type_str) directly. Drop trailing .value on f-string interpolation of ValueType members — StrEnum's __format__ already returns the string value, so f"{ValueType.MAX}" == "max" without the explicit attribute access. Test fixtures and the from_prometheus_response characterisation test now use the bare labels. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The original phrasing led with the implementation ("window-based max/avg queries via PromQL") rather than the user-facing motivation ("a parallel supply for legacy stats.* live_stat fields that doesn't suffer the agent-restart / accumulator failure modes"). Updates the changelog entry to lead with the compat angle, expands coverage to include stats.rate (now also produced), and notes the restart-safe / window semantics that make this a meaningful upgrade over the legacy producer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…oint Derive STATS_MAX_GAUGE_METRIC_PATTERNS / STATS_AVG_GAUGE_METRIC_PATTERNS from a single _ACCEL_GAUGE_SUFFIXES_* source so adding a new accelerator metric kind (e.g. clock, voltage) is a one-suffix edit instead of editing two regex strings in lockstep. Functionally identical — the generated regex is the same alternation, just sorted alphabetically. Snapshot tests adjusted for the new sort order. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions Bot assigned seedspirit Apr 28, 2026

github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels Apr 28, 2026

seedspirit added a commit that referenced this pull request Apr 28, 2026

changelog: add news fragment for PR #11360

5065ac2

github-actions Bot added the comp:agent Related to Agent component label Apr 29, 2026

seedspirit changed the title ~~feat(BA-5878): add window-based max/avg live stats via PromQL~~ feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types Apr 29, 2026

seedspirit requested a review from a team April 29, 2026 09:00

seedspirit marked this pull request as ready for review April 29, 2026 09:00

Copilot AI review requested due to automatic review settings April 29, 2026 09:00

seedspirit added a commit that referenced this pull request Apr 29, 2026

changelog: add news fragment for PR #11360

5f03b92

seedspirit force-pushed the feat/BA-5878 branch from f45c9a0 to 9f6a8ee Compare April 29, 2026 09:02

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread src/ai/backend/agent/stats.py Outdated

fregataa reviewed Apr 30, 2026

View reviewed changes

Comment thread src/ai/backend/agent/stats.py

seedspirit force-pushed the feat/BA-5878 branch from cfb4990 to 263b14a Compare May 8, 2026 08:27

seedspirit changed the title ~~feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types~~ feat(BA-5878): add window-based max/avg container live stats via PromQL May 8, 2026

jopemachine reviewed May 8, 2026

View reviewed changes

Comment thread src/ai/backend/common/clients/prometheus/types.py

seedspirit and others added 2 commits May 10, 2026 17:23

seedspirit force-pushed the feat/BA-5878 branch from 73977f6 to 520fcca Compare May 10, 2026 08:25

seedspirit and others added 4 commits May 10, 2026 17:34

seedspirit requested review from fregataa and jopemachine May 10, 2026 09:12

seedspirit and others added 2 commits May 10, 2026 18:34

seedspirit changed the title ~~feat(BA-5878): add window-based max/avg container live stats via PromQL~~ feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus May 10, 2026

seedspirit requested a review from a team May 10, 2026 10:35

seedspirit mentioned this pull request May 10, 2026

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus #11330

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360

feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360
seedspirit wants to merge 11 commits intomainfrom
feat/BA-5878

seedspirit commented Apr 28, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

seedspirit commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What this PR provides

Coverage matrix (what each pipeline publishes)

Live verification

Bug fixed during verification

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seedspirit commented Apr 28, 2026 •

edited

Loading