Skip to content

[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720

Open
DanielZhangQD wants to merge 1 commit into
masterfrom
sky-4712-metrics
Open

[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720
DanielZhangQD wants to merge 1 commit into
masterfrom
sky-4712-metrics

Conversation

@DanielZhangQD
Copy link
Copy Markdown
Collaborator

@DanielZhangQD DanielZhangQD commented May 26, 2026

Summary

  • New `WorkspaceUsageCollector` emits `sky_clusters_count{workspace,user,status,cloud}`, `sky_clusters_gpus_in_flight{...,gpu_type}`, and `sky_clusters_burn_rate_dollars` (multiplied by `launched_nodes`; replaces the unlabeled `sky_apiserver_total_burn_rate_dollars` over time).
  • `ManagedJobsCollector` rewritten as single `sky_managed_jobs_count{workspace,user,status,cloud}` with terminal statuses (SUCCEEDED / FAILED* / CANCELLED) filtered at the SQL layer — terminal counts on a gauge can't be usefully `rate()`-d.
  • Recording rules at `charts/skypilot/manifests/sky-gpu-recording-rules.yaml` aggregate the federated DCGM data on `/gpu-metrics` into 5 per-cluster, per-GPU-model series: capacity / allocated / free / utilization avg + p90.
  • Opt-in Helm templates: `apiService.metrics.serviceMonitor.enabled` renders a `ServiceMonitor` CRD; `apiService.metrics.prometheusRule.enabled` renders a `PrometheusRule` CRD that wraps the recording rules. Both default false so non-Operator deployments are unaffected.

Test plan

  • `pytest tests/unit_tests/test_sky/server/test_metrics.py -k 'workspace_usage or managed_jobs_collector'` — 9 tests pass, covering: per-workspace/user/status/cloud counts, GPU-only-for-UP-clusters, burn-rate × node count, gpu_type='cpu' fallback, null-label defaults, cost-lookup-failure handling, 30 s cache TTL, terminal-filter behaviour.
  • `helm template skypilot ./charts/skypilot --set apiService.metrics.enabled=true --set apiService.metrics.serviceMonitor.enabled=true --set apiService.metrics.prometheusRule.enabled=true` renders both CRDs with the 5 recording rules.
  • Manual: scrape `/metrics` from a live API server, confirm the new `sky_clusters_*` and `sky_managed_jobs_count` series appear with expected labels.

🤖 Generated with Claude Code


Open in Devin Review

- WorkspaceUsageCollector: sky_clusters_count{workspace,user,status,cloud},
  sky_clusters_gpus_in_flight{...,gpu_type}, sky_clusters_burn_rate_dollars
  (multiplied by launched_nodes; replaces the unlabeled
  sky_apiserver_total_burn_rate_dollars over time).
- ManagedJobsCollector: sky_managed_jobs_count{workspace,user,status,cloud}
  with terminal statuses filtered at the SQL layer (monotonic-growing
  terminal counts can't be usefully rate()-d as a gauge).
- Recording rules YAML at charts/skypilot/manifests/sky-gpu-recording-rules.yaml
  with 5 rules aggregating raw federated DCGM into per-cluster /
  per-GPU-model capacity/allocated/free/utilization series.
- Opt-in Helm templates for Prometheus Operator users:
  apiService.metrics.serviceMonitor.enabled and .prometheusRule.enabled
  render ServiceMonitor and PrometheusRule CRDs. Defaults false so
  non-Operator deployments are unaffected.
- Unit tests covering both collectors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances Prometheus metrics collection by introducing per-workspace and per-user cluster usage metrics (such as cluster counts, GPUs in flight, and burn rates) via a new WorkspaceUsageCollector. It also updates the ManagedJobsCollector to support grouping by workspace, user, status, and cloud, and adds Prometheus Operator ServiceMonitor and PrometheusRule CRDs to the Helm chart. A potential crash was identified in the WorkspaceUsageCollector where the float() conversion of per_node_hourly is performed outside of the try...except block, which could fail if get_cost() returns a non-numeric value or None.

Comment thread sky/server/metrics.py
Comment on lines +430 to +441
try:
per_node_hourly = launched_resources.get_cost(
self._SECONDS_PER_HOUR)
except Exception: # pylint: disable=broad-except
# Cost lookup can fail when a catalog entry is missing for
# the launched instance type (e.g. on-prem clouds, or
# newly-added instance types). Skip silently — the count
# / GPU gauges are still correct.
per_node_hourly = 0.0
burn_key = (workspace, user, cloud, gpu_type)
burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +
float(per_node_hourly) * num_nodes)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The float() conversion of per_node_hourly is performed outside of the try...except block. If get_cost() returns None or a non-numeric value, float(per_node_hourly) will raise a TypeError or ValueError outside the exception handler, causing the entire metrics collection to crash.

To ensure robust error handling and defensive programming, perform the float() conversion inside the try block and safely fall back to 0.0 in the except block.

Suggested change
try:
per_node_hourly = launched_resources.get_cost(
self._SECONDS_PER_HOUR)
except Exception: # pylint: disable=broad-except
# Cost lookup can fail when a catalog entry is missing for
# the launched instance type (e.g. on-prem clouds, or
# newly-added instance types). Skip silently — the count
# / GPU gauges are still correct.
per_node_hourly = 0.0
burn_key = (workspace, user, cloud, gpu_type)
burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +
float(per_node_hourly) * num_nodes)
try:
per_node_hourly = float(launched_resources.get_cost(
self._SECONDS_PER_HOUR) or 0.0)
except Exception: # pylint: disable=broad-except
# Cost lookup can fail when a catalog entry is missing for
# the launched instance type (e.g. on-prem clouds, or
# newly-added instance types). Skip silently — the count
# / GPU gauges are still correct.
per_node_hourly = 0.0
burn_key = (workspace, user, cloud, gpu_type)
burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) +
per_node_hourly * num_nodes)

Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread sky/server/metrics.py
Comment on lines 316 to 317
except Exception: # pylint: disable=broad-except
logger.exception('Failed to collect managed jobs metrics')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 ManagedJobsCollector.collect() does not update _last_scrape_time on exception, causing retry storms

When _refresh() raises an exception in ManagedJobsCollector.collect(), _last_scrape_time is not updated (lines 316-317). This means the cache-TTL guard now - self._last_scrape_time >= self._cache_ttl will remain true on every subsequent scrape, causing the failing DB query to be retried on every Prometheus scrape rather than backing off for the 30s cache TTL.

Both BurnRateCollector.collect() (sky/server/metrics.py:223) and the newly-added WorkspaceUsageCollector.collect() (sky/server/metrics.py:472) correctly set self._last_scrape_time = now inside their except blocks. The ManagedJobsCollector is the only collector missing this, and this PR makes it call the new, more complex get_status_counts_by_workspace_user_cloud() join query — increasing the likelihood of transient failures.

Suggested change
except Exception: # pylint: disable=broad-except
logger.exception('Failed to collect managed jobs metrics')
except Exception: # pylint: disable=broad-except
logger.exception('Failed to collect managed jobs metrics')
self._last_scrape_time = now
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant