refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330seedspirit wants to merge 3 commits intomainfrom
Conversation
18676b6 to
323c1d1
Compare
There was a problem hiding this comment.
Pull request overview
Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.
Changes:
- Replaced
KernelNode.batch_load_live_statValkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a newLegacyLiveStatConverter. - Moved legacy
MetricValue/MovingStatValueintoai.backend.common.metrics.types, adding metric classifications andresolve_unit_hint()for unit derivation. - Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/manager/api/gql_legacy/test_stat_converter.py | Adds unit tests for the legacy live_stat converter behavior. |
| src/ai/backend/manager/repositories/metric/repository.py | Minor refactor of Prometheus query result unpacking for live stats. |
| src/ai/backend/manager/clients/prometheus/fixed_query_builder.py | Removes unreachable guard in template selection for metric types. |
| src/ai/backend/manager/api/gql_legacy/statistics.py | Removes an unused/legacy batch loader wrapper method. |
| src/ai/backend/manager/api/gql_legacy/stat_converter.py | Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape. |
| src/ai/backend/manager/api/gql_legacy/kernel.py | Switches live_stat resolver to Prometheus action + legacy conversion via dataloader. |
| src/ai/backend/common/types.py | Removes legacy MetricValue/MovingStatValue from the common types module. |
| src/ai/backend/common/metrics/types.py | Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications. |
| src/ai/backend/common/clients/valkey_client/valkey_stat/client.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/client/output/formatters.py | Updates imports to use the moved MetricValue type. |
| src/ai/backend/appproxy/worker/types.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| src/ai/backend/agent/stats.py | Updates imports to use the moved MetricValue/MovingStatValue types. |
| changes/11330.enhance.md | Adds changelog entry for the migration. |
Comments suppressed due to low confidence (1)
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154
_get_template()no longer has a fallback branch. If an unexpected/newMetricTypevalue ever reaches this function (e.g., enum extended in the future), Python will implicitly returnNone, which then propagates intoMetricPreset(template=...)and fails later with a less clear error. Please add an explicit default case that raises (e.g.,UnreachableError/ValueError) so failures are immediate and actionable.
def _get_template(self, metric_type: MetricType) -> str:
match metric_type:
case MetricType.GAUGE:
return _GAUGE_TEMPLATE
case MetricType.RATE:
return _RATE_TEMPLATE
case MetricType.DIFF:
return _DIFF_TEMPLATE
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MetricValue = TypedDict( | ||
| "MetricValue", | ||
| { | ||
| "current": str, | ||
| "capacity": str, | ||
| "pct": str, | ||
| "unit_hint": str, | ||
| "stats.min": str, | ||
| "stats.max": str, | ||
| "stats.sum": str, | ||
| "stats.avg": str, | ||
| "stats.diff": str, | ||
| "stats.rate": str, | ||
| "stats.version": int | None, | ||
| }, | ||
| ) |
There was a problem hiding this comment.
MetricValue now defines capacity as a required str, but there are still in-repo producers/consumers that treat capacity as nullable (e.g., alembic stats migration sets "capacity": None, and the CLI formatter checks metric["capacity"] is not None). This makes the TypedDict inconsistent with real payloads and will either break type-checking or force unsafe casts. Please either keep capacity as str | None (like the previous definition) or update all producers to always emit a string (e.g., "0") and remove/adjust the None handling accordingly.
| from ai.backend.manager.data.metric.types import KernelLiveStatBatchResult | ||
|
|
||
|
|
||
| class LegacyLiveStatConverter: |
There was a problem hiding this comment.
Is there a way for me to verify that the data processed through this converter produces the same results as the existing logic?
There was a problem hiding this comment.
Writing integration tests would probably be the most reliable approach, but since that’s difficult, I’m currently running the code locally by executing queries directly and comparing the results.
| converted = LegacyLiveStatConverter().convert(action_result.stats) | ||
| return [converted.get(kid) for kid in kernel_ids] |
There was a problem hiding this comment.
Why do we have to create a converter every time?
Summary
KernelNode.batch_load_live_stat(Valkey →valkey_stat.get_session_statistics_batch) is replaced by_batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a newLegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.MetricValue/MovingStatValuemove fromcommon/types.pytocommon/metrics/types.pynext to the newRATE_STAT_METRICS/DIFF_STAT_METRICSclassifications andresolve_unit_hint()helper (with naming-convention fallback so plugin metrics still get a usable unit hint).LegacyLiveStatConverterunit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.Known wire-level gaps (Solve in follow-up issue)
current/pct(with capacity) /unit_hint/stats.diff/stats.ratecapacityforcpu_used/net_*/io_*(and dependentpct)stats.max(all metrics)max_over_timerolled back (window peak ≠ lifetime peak, plugin coverage)stats.avg(cpu_util,cuda_util)cuda_*)Test plan
pants test tests/unit/manager/api/gql_legacy/test_stat_converter.pypants checkon the modified filespants fmt/pants lintcleanscripts/test-live-stat-equivalence.shagainst a live session (stash → A label, pop → B label, difflive-stat-eqv/Avslive-stat-eqv/B)🤖 Generated with Claude Code
📚 Documentation preview 📚: https://sorna--11330.org.readthedocs.build/en/11330/
📚 Documentation preview 📚: https://sorna-ko--11330.org.readthedocs.build/ko/11330/