refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus by seedspirit · Pull Request #11330 · lablup/backend.ai

seedspirit · 2026-04-27T03:55:02Z

Summary

KernelNode.batch_load_live_stat (Valkey → valkey_stat.get_session_statistics_batch) is replaced by _batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a new LegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.
MetricValue / MovingStatValue move from common/types.py to common/metrics/types.py next to the new RATE_STAT_METRICS / DIFF_STAT_METRICS classifications and resolve_unit_hint() helper (with naming-convention fallback so plugin metrics still get a usable unit hint).
New: LegacyLiveStatConverter unit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.

Known wire-level gaps (Solve in follow-up issue)

Field	Status
`current` / `pct` (with capacity) / `unit_hint` / `stats.diff` / `stats.rate`	✅ legacy-equivalent
`capacity` for `cpu_used` / `net_` / `io_` (and dependent `pct`)	❌ "0" — no Prometheus equivalent for the legacy "first sample as capacity" hack
`stats.max` (all metrics)	❌ "0" — `max_over_time` rolled back (window peak ≠ lifetime peak, plugin coverage)
`stats.avg` (`cpu_util`, `cuda_util`)	❌ "0" — not measured
Plugin metric stats coverage (`cuda_*`)	❌ whitelist-managed in manager — fragile across plugin changes

Test plan

pants test tests/unit/manager/api/gql_legacy/test_stat_converter.py
pants check on the modified files
pants fmt / pants lint clean
A/B equivalence run with scripts/test-live-stat-equivalence.sh against a live session (stash → A label, pop → B label, diff live-stat-eqv/A vs live-stat-eqv/B)
WebUI smoke (kernel/session detail page → live_stat panel)

🤖 Generated with Claude Code

📚 Documentation preview 📚: https://sorna--11330.org.readthedocs.build/en/11330/

📚 Documentation preview 📚: https://sorna-ko--11330.org.readthedocs.build/ko/11330/

Copilot

Pull request overview

Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.

Changes:

Replaced KernelNode.batch_load_live_stat Valkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a new LegacyLiveStatConverter.
Moved legacy MetricValue / MovingStatValue into ai.backend.common.metrics.types, adding metric classifications and resolve_unit_hint() for unit derivation.
Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/manager/api/gql_legacy/test_stat_converter.py	Adds unit tests for the legacy live_stat converter behavior.
src/ai/backend/manager/repositories/metric/repository.py	Minor refactor of Prometheus query result unpacking for live stats.
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py	Removes unreachable guard in template selection for metric types.
src/ai/backend/manager/api/gql_legacy/statistics.py	Removes an unused/legacy batch loader wrapper method.
src/ai/backend/manager/api/gql_legacy/stat_converter.py	Introduces `LegacyLiveStatConverter` to adapt Prometheus results to legacy shape.
src/ai/backend/manager/api/gql_legacy/kernel.py	Switches live_stat resolver to Prometheus action + legacy conversion via dataloader.
src/ai/backend/common/types.py	Removes legacy MetricValue/MovingStatValue from the common types module.
src/ai/backend/common/metrics/types.py	Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py	Updates imports to use the moved MetricValue type.
src/ai/backend/client/output/formatters.py	Updates imports to use the moved MetricValue type.
src/ai/backend/appproxy/worker/types.py	Updates imports to use the moved MetricValue/MovingStatValue types.
src/ai/backend/agent/stats.py	Updates imports to use the moved MetricValue/MovingStatValue types.
changes/11330.enhance.md	Adds changelog entry for the migration.

Comments suppressed due to low confidence (1)

src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154

_get_template() no longer has a fallback branch. If an unexpected/new MetricType value ever reaches this function (e.g., enum extended in the future), Python will implicitly return None, which then propagates into MetricPreset(template=...) and fails later with a less clear error. Please add an explicit default case that raises (e.g., UnreachableError/ValueError) so failures are immediate and actionable.

    def _get_template(self, metric_type: MetricType) -> str:
        match metric_type:
            case MetricType.GAUGE:
                return _GAUGE_TEMPLATE
            case MetricType.RATE:
                return _RATE_TEMPLATE
            case MetricType.DIFF:
                return _DIFF_TEMPLATE

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T04:10:25Z

+MetricValue = TypedDict(
+    "MetricValue",
+    {
+        "current": str,
+        "capacity": str,
+        "pct": str,
+        "unit_hint": str,
+        "stats.min": str,
+        "stats.max": str,
+        "stats.sum": str,
+        "stats.avg": str,
+        "stats.diff": str,
+        "stats.rate": str,
+        "stats.version": int | None,
+    },
+)


MetricValue now defines capacity as a required str, but there are still in-repo producers/consumers that treat capacity as nullable (e.g., alembic stats migration sets "capacity": None, and the CLI formatter checks metric["capacity"] is not None). This makes the TypedDict inconsistent with real payloads and will either break type-checking or force unsafe casts. Please either keep capacity as str | None (like the previous definition) or update all producers to always emit a string (e.g., "0") and remove/adjust the None handling accordingly.

jopemachine · 2026-04-27T05:31:43Z

+from ai.backend.manager.data.metric.types import KernelLiveStatBatchResult
+
+
+class LegacyLiveStatConverter:


Is there a way for me to verify that the data processed through this converter produces the same results as the existing logic?

Writing integration tests would probably be the most reliable approach, but since that’s difficult, I’m currently running the code locally by executing queries directly and comparing the results.

HyeockJinKim · 2026-05-08T04:54:43Z

+    converted = LegacyLiveStatConverter().convert(action_result.stats)
+    return [converted.get(kid) for kid in kernel_ids]


Why do we have to create a converter every time?

github-actions Bot added the size:XL 500~ LoC label Apr 27, 2026

github-actions Bot assigned seedspirit Apr 27, 2026

github-actions Bot added area:docs Documentations comp:manager Related to Manager component comp:agent Related to Agent component comp:client Related to Client component comp:common Related to Common component comp:app-proxy Related to App Proxy component labels Apr 27, 2026

seedspirit added a commit that referenced this pull request Apr 27, 2026

changelog: add news fragment for PR #11330

18676b6

refactor: migrate kernel live_stat from Valkey to Prometheus

323c1d1

seedspirit force-pushed the refactor/BA-5744 branch from 18676b6 to 323c1d1 Compare April 27, 2026 04:00

style: Add linter result

cf30ee0

seedspirit marked this pull request as ready for review April 27, 2026 04:06

Copilot AI review requested due to automatic review settings April 27, 2026 04:06

Copilot started reviewing on behalf of seedspirit April 27, 2026 04:07 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

seedspirit requested a review from a team April 27, 2026 04:41

jopemachine reviewed Apr 27, 2026

View reviewed changes

Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py

jopemachine reviewed Apr 27, 2026

View reviewed changes

Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py Outdated

jopemachine reviewed Apr 27, 2026

View reviewed changes

style: Remove untracked file name in comment

cde8b1f

jopemachine requested a review from a team April 27, 2026 09:27

HyeockJinKim reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
seedspirit wants to merge 3 commits intomainfrom
refactor/BA-5744

seedspirit commented Apr 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 27, 2026

Uh oh!

Uh oh!

Uh oh!

jopemachine Apr 27, 2026

Uh oh!

seedspirit Apr 27, 2026

Uh oh!

HyeockJinKim May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from ai.backend.manager.data.metric.types import KernelLiveStatBatchResult


		class LegacyLiveStatConverter:

		converted = LegacyLiveStatConverter().convert(action_result.stats)
		return [converted.get(kid) for kid in kernel_ids]

Conversation

seedspirit commented Apr 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Known wire-level gaps (Solve in follow-up issue)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jopemachine Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

seedspirit Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

HyeockJinKim May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seedspirit commented Apr 27, 2026 •

edited by github-actions Bot

Loading