[API] Add per-workspace/user metrics + GPU aggregation recording rules by DanielZhangQD · Pull Request #9720 · skypilot-org/skypilot

DanielZhangQD · 2026-05-26T13:05:41Z

Summary

New `WorkspaceUsageCollector` emits `sky_clusters_count{workspace,user,status,cloud}`, `sky_clusters_gpus_in_flight{...,gpu_type}`, and `sky_clusters_burn_rate_dollars` (multiplied by `launched_nodes`; replaces the unlabeled `sky_apiserver_total_burn_rate_dollars` over time).
`ManagedJobsCollector` rewritten as single `sky_managed_jobs_count{workspace,user,status,cloud}` with terminal statuses (SUCCEEDED / FAILED* / CANCELLED) filtered at the SQL layer — terminal counts on a gauge can't be usefully `rate()`-d.
Recording rules at `charts/skypilot/manifests/sky-gpu-recording-rules.yaml` aggregate the federated DCGM data on `/gpu-metrics` into 5 per-cluster, per-GPU-model series: capacity / allocated / free / utilization avg + p90.
Opt-in Helm templates: `apiService.metrics.serviceMonitor.enabled` renders a `ServiceMonitor` CRD; `apiService.metrics.prometheusRule.enabled` renders a `PrometheusRule` CRD that wraps the recording rules. Both default false so non-Operator deployments are unaffected.

Test plan

`pytest tests/unit_tests/test_sky/server/test_metrics.py -k 'workspace_usage or managed_jobs_collector'` — 9 tests pass, covering: per-workspace/user/status/cloud counts, GPU-only-for-UP-clusters, burn-rate × node count, gpu_type='cpu' fallback, null-label defaults, cost-lookup-failure handling, 30 s cache TTL, terminal-filter behaviour.
`helm template skypilot ./charts/skypilot --set apiService.metrics.enabled=true --set apiService.metrics.serviceMonitor.enabled=true --set apiService.metrics.prometheusRule.enabled=true` renders both CRDs with the 5 recording rules.
Manual: scrape `/metrics` from a live API server, confirm the new `sky_clusters_*` and `sky_managed_jobs_count` series appear with expected labels.

🤖 Generated with Claude Code

- WorkspaceUsageCollector: sky_clusters_count{workspace,user,status,cloud}, sky_clusters_gpus_in_flight{...,gpu_type}, sky_clusters_burn_rate_dollars (multiplied by launched_nodes; replaces the unlabeled sky_apiserver_total_burn_rate_dollars over time). - ManagedJobsCollector: sky_managed_jobs_count{workspace,user,status,cloud} with terminal statuses filtered at the SQL layer (monotonic-growing terminal counts can't be usefully rate()-d as a gauge). - Recording rules YAML at charts/skypilot/manifests/sky-gpu-recording-rules.yaml with 5 rules aggregating raw federated DCGM into per-cluster / per-GPU-model capacity/allocated/free/utilization series. - Opt-in Helm templates for Prometheus Operator users: apiService.metrics.serviceMonitor.enabled and .prometheusRule.enabled render ServiceMonitor and PrometheusRule CRDs. Defaults false so non-Operator deployments are unaffected. - Unit tests covering both collectors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request enhances Prometheus metrics collection by introducing per-workspace and per-user cluster usage metrics (such as cluster counts, GPUs in flight, and burn rates) via a new WorkspaceUsageCollector. It also updates the ManagedJobsCollector to support grouping by workspace, user, status, and cloud, and adds Prometheus Operator ServiceMonitor and PrometheusRule CRDs to the Helm chart. A potential crash was identified in the WorkspaceUsageCollector where the float() conversion of per_node_hourly is performed outside of the try...except block, which could fail if get_cost() returns a non-numeric value or None.

gemini-code-assist · 2026-05-26T13:07:19Z

+            try:
+                per_node_hourly = launched_resources.get_cost(
+                    self._SECONDS_PER_HOUR)
+            except Exception:  # pylint: disable=broad-except
+                # Cost lookup can fail when a catalog entry is missing for
+                # the launched instance type (e.g. on-prem clouds, or
+                # newly-added instance types). Skip silently — the count
+                # / GPU gauges are still correct.
+                per_node_hourly = 0.0
+            burn_key = (workspace, user, cloud, gpu_type)
+            burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +
+                                   float(per_node_hourly) * num_nodes)


The float() conversion of per_node_hourly is performed outside of the try...except block. If get_cost() returns None or a non-numeric value, float(per_node_hourly) will raise a TypeError or ValueError outside the exception handler, causing the entire metrics collection to crash.

To ensure robust error handling and defensive programming, perform the float() conversion inside the try block and safely fall back to 0.0 in the except block.

Suggested change

try:

per_node_hourly = launched_resources.get_cost(

self._SECONDS_PER_HOUR)

except Exception: # pylint: disable=broad-except

# Cost lookup can fail when a catalog entry is missing for

# the launched instance type (e.g. on-prem clouds, or

# newly-added instance types). Skip silently — the count

# / GPU gauges are still correct.

per_node_hourly = 0.0

burn_key = (workspace, user, cloud, gpu_type)

burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +

float(per_node_hourly) * num_nodes)

try:

per_node_hourly = float(launched_resources.get_cost(

self._SECONDS_PER_HOUR) or 0.0)

except Exception: # pylint: disable=broad-except

# Cost lookup can fail when a catalog entry is missing for

# the launched instance type (e.g. on-prem clouds, or

# newly-added instance types). Skip silently — the count

# / GPU gauges are still correct.

per_node_hourly = 0.0

burn_key = (workspace, user, cloud, gpu_type)

burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +

per_node_hourly * num_nodes)

devin-ai-integration

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

devin-ai-integration · 2026-05-26T13:10:00Z

                except Exception:  # pylint: disable=broad-except
                    logger.exception('Failed to collect managed jobs metrics')


🟡 ManagedJobsCollector.collect() does not update _last_scrape_time on exception, causing retry storms

When _refresh() raises an exception in ManagedJobsCollector.collect(), _last_scrape_time is not updated (lines 316-317). This means the cache-TTL guard now - self._last_scrape_time >= self._cache_ttl will remain true on every subsequent scrape, causing the failing DB query to be retried on every Prometheus scrape rather than backing off for the 30s cache TTL.

Both BurnRateCollector.collect() (sky/server/metrics.py:223) and the newly-added WorkspaceUsageCollector.collect() (sky/server/metrics.py:472) correctly set self._last_scrape_time = now inside their except blocks. The ManagedJobsCollector is the only collector missing this, and this PR makes it call the new, more complex get_status_counts_by_workspace_user_cloud() join query — increasing the likelihood of transient failures.

Suggested change

except Exception: # pylint: disable=broad-except

logger.exception('Failed to collect managed jobs metrics')

except Exception: # pylint: disable=broad-except

logger.exception('Failed to collect managed jobs metrics')

self._last_scrape_time = now

Was this helpful? React with 👍 or 👎 to provide feedback.

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

devin-ai-integration Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720

[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720
DanielZhangQD wants to merge 1 commit into
masterfrom
sky-4712-metrics

DanielZhangQD commented May 26, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		except Exception: # pylint: disable=broad-except
		logger.exception('Failed to collect managed jobs metrics')

Conversation

DanielZhangQD commented May 26, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DanielZhangQD commented May 26, 2026 •

edited by devin-ai-integration Bot

Loading