Skip to content

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330

Open
seedspirit wants to merge 3 commits intomainfrom
refactor/BA-5744
Open

refactor(BA-5744): migrate kernel live_stat from Valkey to Prometheus#11330
seedspirit wants to merge 3 commits intomainfrom
refactor/BA-5744

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented Apr 27, 2026

Summary

  • KernelNode.batch_load_live_stat (Valkey → valkey_stat.get_session_statistics_batch) is replaced by _batch_load_kernel_live_stat, which calls the metric processor and adapts the result through a new LegacyLiveStatConverter. Wire shape (dict[metric_name, MetricValue]) is preserved so GQL/WebUI consumers stay compatible.
  • MetricValue / MovingStatValue move from common/types.py to common/metrics/types.py next to the new RATE_STAT_METRICS / DIFF_STAT_METRICS classifications and resolve_unit_hint() helper (with naming-convention fallback so plugin metrics still get a usable unit hint).
  • New: LegacyLiveStatConverter unit tests covering gauge / rate / diff / capacity-default / pct-derivation / unknown-metric / multi-kernel isolation.

Known wire-level gaps (Solve in follow-up issue)

Field Status
current / pct (with capacity) / unit_hint / stats.diff / stats.rate ✅ legacy-equivalent
capacity for cpu_used / net_* / io_* (and dependent pct) ❌ "0" — no Prometheus equivalent for the legacy "first sample as capacity" hack
stats.max (all metrics) ❌ "0" — max_over_time rolled back (window peak ≠ lifetime peak, plugin coverage)
stats.avg (cpu_util, cuda_util) ❌ "0" — not measured
Plugin metric stats coverage (cuda_*) ❌ whitelist-managed in manager — fragile across plugin changes

Test plan

  • pants test tests/unit/manager/api/gql_legacy/test_stat_converter.py
  • pants check on the modified files
  • pants fmt / pants lint clean
  • A/B equivalence run with scripts/test-live-stat-equivalence.sh against a live session (stash → A label, pop → B label, diff live-stat-eqv/A vs live-stat-eqv/B)
  • WebUI smoke (kernel/session detail page → live_stat panel)

🤖 Generated with Claude Code


📚 Documentation preview 📚: https://sorna--11330.org.readthedocs.build/en/11330/


📚 Documentation preview 📚: https://sorna-ko--11330.org.readthedocs.build/ko/11330/

@github-actions github-actions Bot added the size:XL 500~ LoC label Apr 27, 2026
@github-actions github-actions Bot added area:docs Documentations comp:manager Related to Manager component comp:agent Related to Agent component comp:client Related to Client component comp:common Related to Common component comp:app-proxy Related to App Proxy component labels Apr 27, 2026
seedspirit added a commit that referenced this pull request Apr 27, 2026
@seedspirit seedspirit marked this pull request as ready for review April 27, 2026 04:06
Copilot AI review requested due to automatic review settings April 27, 2026 04:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the legacy GraphQL live_stat resolver path from Valkey-based session statistics to Prometheus-backed container live-stat queries, while preserving the existing dict[metric_name, MetricValue] wire shape for WebUI/GQL consumers.

Changes:

  • Replaced KernelNode.batch_load_live_stat Valkey calls with a Prometheus-based batch loader (_batch_load_kernel_live_stat) and a new LegacyLiveStatConverter.
  • Moved legacy MetricValue / MovingStatValue into ai.backend.common.metrics.types, adding metric classifications and resolve_unit_hint() for unit derivation.
  • Added unit tests covering conversion behavior across gauge/rate/diff metrics, pct derivation, defaults, and multi-kernel isolation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/manager/api/gql_legacy/test_stat_converter.py Adds unit tests for the legacy live_stat converter behavior.
src/ai/backend/manager/repositories/metric/repository.py Minor refactor of Prometheus query result unpacking for live stats.
src/ai/backend/manager/clients/prometheus/fixed_query_builder.py Removes unreachable guard in template selection for metric types.
src/ai/backend/manager/api/gql_legacy/statistics.py Removes an unused/legacy batch loader wrapper method.
src/ai/backend/manager/api/gql_legacy/stat_converter.py Introduces LegacyLiveStatConverter to adapt Prometheus results to legacy shape.
src/ai/backend/manager/api/gql_legacy/kernel.py Switches live_stat resolver to Prometheus action + legacy conversion via dataloader.
src/ai/backend/common/types.py Removes legacy MetricValue/MovingStatValue from the common types module.
src/ai/backend/common/metrics/types.py Adds MetricValue/MovingStatValue + unit hint resolution and metric classifications.
src/ai/backend/common/clients/valkey_client/valkey_stat/client.py Updates imports to use the moved MetricValue type.
src/ai/backend/client/output/formatters.py Updates imports to use the moved MetricValue type.
src/ai/backend/appproxy/worker/types.py Updates imports to use the moved MetricValue/MovingStatValue types.
src/ai/backend/agent/stats.py Updates imports to use the moved MetricValue/MovingStatValue types.
changes/11330.enhance.md Adds changelog entry for the migration.
Comments suppressed due to low confidence (1)

src/ai/backend/manager/clients/prometheus/fixed_query_builder.py:154

  • _get_template() no longer has a fallback branch. If an unexpected/new MetricType value ever reaches this function (e.g., enum extended in the future), Python will implicitly return None, which then propagates into MetricPreset(template=...) and fails later with a less clear error. Please add an explicit default case that raises (e.g., UnreachableError/ValueError) so failures are immediate and actionable.
    def _get_template(self, metric_type: MetricType) -> str:
        match metric_type:
            case MetricType.GAUGE:
                return _GAUGE_TEMPLATE
            case MetricType.RATE:
                return _RATE_TEMPLATE
            case MetricType.DIFF:
                return _DIFF_TEMPLATE


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to +31
MetricValue = TypedDict(
"MetricValue",
{
"current": str,
"capacity": str,
"pct": str,
"unit_hint": str,
"stats.min": str,
"stats.max": str,
"stats.sum": str,
"stats.avg": str,
"stats.diff": str,
"stats.rate": str,
"stats.version": int | None,
},
)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricValue now defines capacity as a required str, but there are still in-repo producers/consumers that treat capacity as nullable (e.g., alembic stats migration sets "capacity": None, and the CLI formatter checks metric["capacity"] is not None). This makes the TypedDict inconsistent with real payloads and will either break type-checking or force unsafe casts. Please either keep capacity as str | None (like the previous definition) or update all producers to always emit a string (e.g., "0") and remove/adjust the None handling accordingly.

Copilot uses AI. Check for mistakes.
@seedspirit seedspirit requested a review from a team April 27, 2026 04:41
Comment thread src/ai/backend/manager/api/gql_legacy/kernel.py
Comment thread src/ai/backend/manager/api/gql_legacy/stat_converter.py Outdated
from ai.backend.manager.data.metric.types import KernelLiveStatBatchResult


class LegacyLiveStatConverter:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way for me to verify that the data processed through this converter produces the same results as the existing logic?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing integration tests would probably be the most reliable approach, but since that’s difficult, I’m currently running the code locally by executing queries directly and comparing the results.

@jopemachine jopemachine requested a review from a team April 27, 2026 09:27
Comment on lines +81 to +82
converted = LegacyLiveStatConverter().convert(action_result.stats)
return [converted.get(kid) for kid in kernel_ids]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to create a converter every time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:agent Related to Agent component comp:app-proxy Related to App Proxy component comp:client Related to Client component comp:common Related to Common component comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants