Skip to content

feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360

Open
seedspirit wants to merge 11 commits intomainfrom
feat/BA-5878
Open

feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus#11360
seedspirit wants to merge 11 commits intomainfrom
feat/BA-5878

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 28, 2026

Why

The legacy live_stat resolvers — mem_max_bytes, cpu_using, io_max_scratch_size, net_rx_bytes, net_tx_bytes, and friends — read stats.max / stats.avg / stats.rate keys produced by an in-memory MovingStatistics accumulator on the agent. Each running container's metric carries a MovingStatistics instance holding running per-sample max / sum / count over every sample observed (sampled every 5 s via UTILIZATION_METRIC_INTERVAL = 5.0), plus the latest 2 samples for rate calculation. Each metric declares a stats_filter whitelist controlling which slots get published.

That accumulator has serious failure modes:

  • Agent restart wipes the history. First post-restart sample becomes the new baseline, so stats.max silently rewinds to "max since the agent process happened to last restart" — not lifetime peak.
  • avg flattens forever. _count is unbounded; a session that ran hot then went idle drifts toward the idle reading.
  • No retention across host reboot, no consistency across sessions whose agents restarted at different times.
  • stats.rate is computed from the last two samples only, so the first post-restart reading is 0.
  • For net_rx / net_tx the agent's stats_filter is empty, so the agent never actually publishes stats.rate — the legacy GraphQL net_rx_bytes / net_tx_bytes resolvers always return 0.

What this PR provides

A parallel manager-side supply for the same stats.{max,avg,rate} slots, computed on demand from Prometheus via window queries (MetricConfig.timewindow, default 5m). The legacy agent producer stays in place — this is a drop-in alternative source that future legacy compat code can route to.

The conceptual switch is "lifetime peak as remembered by the agent""recent peak as recorded by Prometheus." The new supply:

  • Survives agent / manager / host restart (data lives in Prometheus TSDB)
  • Is consistent across sessions (wall-clock window, not per-agent uptime)
  • Doesn't accumulate forever (old peaks roll out of the window)
  • Closes the long-standing net_rx / net_tx stats.rate hole that the agent's empty stats_filter left

The query results carry clean value_type=max / value_type=avg / value_type=rate labels (without the legacy stats. prefix). A future LegacyLiveStatConverter will re-attach the stats. prefix when serving legacy clients; the manager pipeline itself doesn't need to carry the legacy encoding.

Coverage matrix (what each pipeline publishes)

Metric Legacy stats.max This PR stats.max Legacy stats.avg This PR stats.avg Legacy stats.rate This PR stats.rate
cpu_util
mem
io_scratch_size
io_read
io_write
net_rx (see note)
net_tx (see note)
accel *_mem
accel *_util
accel *_power
accel *_temperature

Notes:

  • Legacy stats.rate for net_rx / net_tx is broken — always returns 0. The agent's measurement entries for these omit stats_filter (default empty), so the agent never publishes stats.rate to Valkey for net_rx/net_tx. The legacy GraphQL net_rx_bytes / net_tx_bytes resolvers read the missing key and fall back to 0. The agent's per-second rate IS published, but as current (via current_hook = lambda m: m.stats.rate), which the legacy resolver doesn't read — an internal contract mismatch.
  • This PR's stats.rate for net_rx / net_tx — passthrough, no rate() wrap. The current series in Prometheus already carries per-second rate (thanks to the agent's current_hook), so the new query just sums across replicas and label_replaces the value_type label from current to rate. Applying PromQL rate() to an already-rate series would compute rate-of-rate and produce nonsense.
  • This PR's stats.rate for io_read / io_writerate(...[5m]) over the counter. No current_hook here, so current is the cumulative byte counter. The new query applies PromQL rate(...[5m]) to convert it to per-second bytes, then label_replaces the result to rate. Conceptually matches what the legacy stats_filter={"rate"} slot held, except computed over a 5-minute window instead of the agent's last-2-sample delta.

Live verification

Manager restarted on this branch, halfstack with Prometheus 3.1.0, one idle running kernel (1 CPU / 1 GiB):

metric current stats.max stats.avg stats.rate
mem 133,910,528 133,910,528 (not produced by agent)
cpu_util 5371.729 5371.729 5371.729
io_scratch_size 0 0 (empty disk, not a missing field)
net_rx 27530 27530
net_tx 30378 30378
io_read 0 0
io_write 0 0

Idle workload across the 5-minute window → current / stats.max / stats.avg converge. Same kernel via the legacy GraphQL resolver returned live_stat: null (Valkey path empty in this env), confirming all values above came from the new pipeline.

Bug fixed during verification

Prometheus's RE2 engine rejects \- as parse error: unknown escape sequence U+002D '-'. Python's re.escape over-escapes hyphens, so every kernel-live-stat query (UUIDs always contain hyphens) would have failed to parse:

kernel_id=~"…-…"   → status: success
kernel_id=~"…\-…"  → status: error

Fixed by stripping the redundant escape before sending.

Test plan

  • pants test tests/unit/manager/services/utilization_metric/test_container_metric.py
  • pants fmt fix lint check --changed-since=HEAD~1
  • Live verification against Prometheus on a running kernel — stats.max / stats.avg / stats.rate populated as expected for mem / cpu_util / io_scratch_size / net_rx / net_tx / io_read / io_write. Wiring the new pipeline behind the legacy GraphQL resolvers (the LegacyLiveStatConverter) is the follow-up.

🤖 Generated with Claude Code

@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels Apr 28, 2026
seedspirit added a commit that referenced this pull request Apr 28, 2026
@github-actions github-actions Bot added the comp:agent Related to Agent component label Apr 29, 2026
@seedspirit seedspirit changed the title feat(BA-5878): add window-based max/avg live stats via PromQL feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types Apr 29, 2026
seedspirit added a commit that referenced this pull request Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are
exported as Prometheus value_type labels instead of computing them
via window-based PromQL queries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit requested a review from a team April 29, 2026 09:00
@seedspirit seedspirit marked this pull request as ready for review April 29, 2026 09:00
Copilot AI review requested due to automatic review settings April 29, 2026 09:00
seedspirit added a commit that referenced this pull request Apr 29, 2026
seedspirit added a commit that referenced this pull request Apr 29, 2026
Reflect the new direction: agent-side stats.max / stats.avg are
exported as Prometheus value_type labels instead of computing them
via window-based PromQL queries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the container utilization Prometheus metric to carry agent-computed rolling window statistics (stats.max / stats.avg) as additional value_type label variants (max / avg), and updates the manager-side live-stat query path to read/merge these additional samples without adding extra PromQL window queries.

Changes:

  • Agent: emit value_type=max / value_type=avg samples for container metrics when stats.max / stats.avg are present.
  • Common/Manager: add MAX / AVG to the shared ValueType enum and update live-stat query merging to handle an extensible query bundle.
  • Tests/Changelog: add unit tests for live-stat query composition/merging and add a feature note.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/ai/backend/agent/stats.py Adds stats.max / stats.avg emission as Prometheus value_type samples during container stat collection.
src/ai/backend/agent/metrics/types.py Introduces MAX_METRIC_KEY / AVG_METRIC_KEY and includes them in ALL_METRIC_VALUE_TYPES.
src/ai/backend/common/clients/prometheus/types.py Extends ValueType with MAX / AVG.
src/ai/backend/manager/repositories/metric/repository.py Refactors live-stat query merging to iterate over gathered responses.
tests/unit/manager/services/utilization_metric/test_container_metric.py Adds tests for live-stat query bundle shape and merge behavior with max/avg value types.
changes/11360.feature.md Documents the new agent-side export behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/agent/stats.py Outdated
Comment thread src/ai/backend/agent/stats.py
@seedspirit seedspirit changed the title feat(BA-5878): export agent-side max/avg container stats as Prometheus value_types feat(BA-5878): add window-based max/avg container live stats via PromQL May 8, 2026
Comment thread src/ai/backend/common/clients/prometheus/types.py
seedspirit and others added 2 commits May 10, 2026 17:23
Revert agent-side stats.max/avg export in favor of computing max/avg
via PromQL window expressions on the manager. Container live-stat
fan-out grows from 3 to 5 queries (gauge / diff / rate / max / avg).

- Re-introduce MAX / AVG to ValueType plus from_legacy_live_stat_label
  to map "stats.max" / "stats.avg" labels back into typed value types.
- Build max/avg templates that union a gauge sub-expression with a
  rate sub-expression and label_replace the result back to the legacy
  "stats.max" / "stats.avg" label.
- Classify metrics into gauge-shape vs rate-shape stats sources, with
  exact names for built-ins and regex patterns for accelerator metrics.
- Repository merges 5 query responses in a generic loop instead of
  unpacking three buckets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PromQL parser rejects \- as an unknown escape sequence inside a
regex literal, so re.escape over-escaping broke every container
live-stat query (kernel_id UUIDs always contain hyphens). Strip the
backslash from \- after escaping so the rendered queries are
RE2-acceptable.

Also drop the unused ValueType.RATE — no producer ever emits it and
no consumer matches on it; only MAX/AVG round-trip to the legacy
stats.* labels. And include the underlying exception in the
warning emitted from MetricRepository.query_container_live_stats so
"empty results" no longer hide the real Prometheus failure mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seedspirit and others added 4 commits May 10, 2026 17:34
…s_query

The gauge and rate branches both produced the same
label_replace + *_over_time + sum_by skeleton, differing only in
whether the inner selector was wrapped with rate(...[window]). The
two branches embedded that skeleton as inline f-strings, which
hid the symmetry and was hard to read.

Extract two methods on FixedQueryBuilder:

- _utilization_selector() — builds {METRIC}{labels}, hiding the
  brace-doubling away from rendering callers.
- _window_stat_subquery() — wraps a selector in
  label_replace({stat_fn}((sum by ({group_by})({selector}))[window:]),
  "value_type","{label}","value_type",".*"). This is the single
  place where the stats-subquery shape lives.

_render_stats_query now reads as: pick a regex, build a selector
(optionally wrapped in rate()), and feed it through the same
skeleton. Rendered output is byte-for-byte unchanged
(characterization tests pass without touching the fixture).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The five fields of ContainerLiveStatQueries were built through three
asymmetric idioms: gauge/diff/rate constructed _LiveStatQuerySpec on
the fly inside the call, max/avg used pre-built module constants but
asked the caller to compute kernel_id_regex/group_by and pass them
positionally.

Promote the three filtered-query specs to module-level Final constants
(_GAUGE_LIVE_STAT_SPEC / _DIFF_LIVE_STAT_SPEC / _RATE_LIVE_STAT_SPEC)
so they sit alongside _MAX_STATS_BUCKET / _AVG_STATS_BUCKET, and rename
the two builders so all five fields read as
self._build_*_preset(kernel_ids, _CONSTANT). The kernel_id_regex /
group_by computation moves inside _build_window_stats_preset where it
is actually consumed, instead of leaking to the caller.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The match over MetricType already covers all three variants, so the
case _ -> raise UnreachableError(...) arm was dead code that Pylance
flagged as unreachable. mypy's exhaustiveness check now enforces the
same invariant statically — adding a new MetricType variant without a
case will fail type check, which is the same protection the runtime
raise gave but caught earlier.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The two helpers were named "to_live_stat_label" / "from_legacy_live_stat_label",
which obscured their actual asymmetry: only the encoder produces the
legacy "stats.<name>" form, while the decoder accepts both legacy and
current shapes. Rename them to:

- ValueType.to_legacy_live_stat_label — encoder, legacy emission
- ValueType.from_live_stat_label — decoder, accepts either form

and document the "stats." prefix as the legacy convention so the
removeprefix branch reads as historical compatibility, not generic
parsing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
seedspirit and others added 2 commits May 10, 2026 18:34
…ture

The window-stat gauge patterns lagged behind what legacy accelerator
plugins actually publish:

- *_power: legacy emits both stats.max and stats.avg; this PR was
  emitting only stats.max.
- *_temperature: legacy emits both stats.max and stats.avg; this PR
  was emitting neither.

Surveyed plugins (rebellions/common, rebellions/atom_max, habana, ipu,
mock):
- All emit *_mem (max only) and *_util (max + avg).
- Only mock currently emits *_power and *_temperature, both with
  {avg, max} filters.

Extend STATS_MAX_GAUGE_METRIC_PATTERNS to include _temperature and
STATS_AVG_GAUGE_METRIC_PATTERNS to include _power and _temperature so
the new pipeline matches what every legacy plugin actually publishes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t_rx/net_tx

Legacy live_stat consumers — most importantly the
legacy_compute_session.net_rx_bytes / net_tx_bytes / io_read_bytes /
io_write_bytes GraphQL resolvers — read the "stats.rate" key from the
per-metric live_stat dict. The new PromQL pipeline previously emitted
nothing under that label, leaving those four legacy fields uncovered:

- io_read / io_write: legacy agent's stats_filter={"rate"}, which the
  new pipeline did not produce.
- net_rx / net_tx: legacy agent's stats_filter is empty, so the agent
  publishes no stats.rate at all even though the legacy resolver
  expects it — making the resolver always return 0. The new pipeline
  now produces a value where legacy never did.

Two metric shapes flow through the new bucket:

- Gauge-shape (net_rx, net_tx): the metric's `current` value is
  already a per-second rate (set by agent's
  current_hook = lambda m: m.stats.rate), so PromQL only needs to
  sum across replicas and label_replace to "stats.rate".
- Counter-shape (io_read, io_write): the value is a cumulative byte
  counter, so PromQL applies rate(...[window]) before label_replace.

Live verified against Prometheus on a running kernel:

    {net_rx, stats.rate} = 27530
    {net_tx, stats.rate} = 30378
    {io_read, stats.rate} = 0
    {io_write, stats.rate} = 0

Re-introduces ValueType.RATE (and its to_legacy_live_stat_label /
from_live_stat_label round-trip) that was removed earlier in the
branch when no producer existed; the round-trip is now load-bearing
again.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…bels

Earlier the new pipeline emitted "stats.max" / "stats.avg" / "stats.rate"
as the rewritten value_type label, mirroring the legacy live_stat dict
keys. That coupling was unnecessary: the labels are synthesized at
query time via label_replace and never persist in the Prometheus TSDB
(only "current" / "capacity" series actually get scraped). The legacy
"stats." prefix is a Valkey live_stat artifact that future legacy-
compat consumers should re-attach in a dedicated converter, not
something the manager pipeline needs to carry.

The new pipeline now emits clean, ValueType-enum-aligned labels:

    stats.max  -> max
    stats.avg  -> avg
    stats.rate -> rate

Round-trip helpers go away:

- ValueType.to_legacy_live_stat_label() removed (was a 1-arm match
  that always returned "stats.{value}" for the three stat slots).
- ValueType.from_live_stat_label() removed; KernelMetricValuesByKernel
  parses the response label via ValueType(value_type_str) directly.

Drop trailing .value on f-string interpolation of ValueType members —
StrEnum's __format__ already returns the string value, so
f"{ValueType.MAX}" == "max" without the explicit attribute access.

Test fixtures and the from_prometheus_response characterisation test
now use the bare labels.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@seedspirit seedspirit changed the title feat(BA-5878): add window-based max/avg container live stats via PromQL feat(BA-5878): re-supply legacy live_stat stats.{max,avg,rate} from Prometheus May 10, 2026
The original phrasing led with the implementation ("window-based max/avg
queries via PromQL") rather than the user-facing motivation ("a
parallel supply for legacy stats.* live_stat fields that doesn't suffer
the agent-restart / accumulator failure modes"). Updates the changelog
entry to lead with the compat angle, expands coverage to include
stats.rate (now also produced), and notes the restart-safe / window
semantics that make this a meaningful upgrade over the legacy producer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…oint

Derive STATS_MAX_GAUGE_METRIC_PATTERNS / STATS_AVG_GAUGE_METRIC_PATTERNS
from a single _ACCEL_GAUGE_SUFFIXES_* source so adding a new accelerator
metric kind (e.g. clock, voltage) is a one-suffix edit instead of editing
two regex strings in lockstep.

Functionally identical — the generated regex is the same alternation, just
sorted alphabetically. Snapshot tests adjusted for the new sort order.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants