[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720
[API] Add per-workspace/user metrics + GPU aggregation recording rules#9720DanielZhangQD wants to merge 1 commit into
Conversation
- WorkspaceUsageCollector: sky_clusters_count{workspace,user,status,cloud},
sky_clusters_gpus_in_flight{...,gpu_type}, sky_clusters_burn_rate_dollars
(multiplied by launched_nodes; replaces the unlabeled
sky_apiserver_total_burn_rate_dollars over time).
- ManagedJobsCollector: sky_managed_jobs_count{workspace,user,status,cloud}
with terminal statuses filtered at the SQL layer (monotonic-growing
terminal counts can't be usefully rate()-d as a gauge).
- Recording rules YAML at charts/skypilot/manifests/sky-gpu-recording-rules.yaml
with 5 rules aggregating raw federated DCGM into per-cluster /
per-GPU-model capacity/allocated/free/utilization series.
- Opt-in Helm templates for Prometheus Operator users:
apiService.metrics.serviceMonitor.enabled and .prometheusRule.enabled
render ServiceMonitor and PrometheusRule CRDs. Defaults false so
non-Operator deployments are unaffected.
- Unit tests covering both collectors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request enhances Prometheus metrics collection by introducing per-workspace and per-user cluster usage metrics (such as cluster counts, GPUs in flight, and burn rates) via a new WorkspaceUsageCollector. It also updates the ManagedJobsCollector to support grouping by workspace, user, status, and cloud, and adds Prometheus Operator ServiceMonitor and PrometheusRule CRDs to the Helm chart. A potential crash was identified in the WorkspaceUsageCollector where the float() conversion of per_node_hourly is performed outside of the try...except block, which could fail if get_cost() returns a non-numeric value or None.
| try: | ||
| per_node_hourly = launched_resources.get_cost( | ||
| self._SECONDS_PER_HOUR) | ||
| except Exception: # pylint: disable=broad-except | ||
| # Cost lookup can fail when a catalog entry is missing for | ||
| # the launched instance type (e.g. on-prem clouds, or | ||
| # newly-added instance types). Skip silently — the count | ||
| # / GPU gauges are still correct. | ||
| per_node_hourly = 0.0 | ||
| burn_key = (workspace, user, cloud, gpu_type) | ||
| burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) + | ||
| float(per_node_hourly) * num_nodes) |
There was a problem hiding this comment.
The float() conversion of per_node_hourly is performed outside of the try...except block. If get_cost() returns None or a non-numeric value, float(per_node_hourly) will raise a TypeError or ValueError outside the exception handler, causing the entire metrics collection to crash.
To ensure robust error handling and defensive programming, perform the float() conversion inside the try block and safely fall back to 0.0 in the except block.
| try: | |
| per_node_hourly = launched_resources.get_cost( | |
| self._SECONDS_PER_HOUR) | |
| except Exception: # pylint: disable=broad-except | |
| # Cost lookup can fail when a catalog entry is missing for | |
| # the launched instance type (e.g. on-prem clouds, or | |
| # newly-added instance types). Skip silently — the count | |
| # / GPU gauges are still correct. | |
| per_node_hourly = 0.0 | |
| burn_key = (workspace, user, cloud, gpu_type) | |
| burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) + | |
| float(per_node_hourly) * num_nodes) | |
| try: | |
| per_node_hourly = float(launched_resources.get_cost( | |
| self._SECONDS_PER_HOUR) or 0.0) | |
| except Exception: # pylint: disable=broad-except | |
| # Cost lookup can fail when a catalog entry is missing for | |
| # the launched instance type (e.g. on-prem clouds, or | |
| # newly-added instance types). Skip silently — the count | |
| # / GPU gauges are still correct. | |
| per_node_hourly = 0.0 | |
| burn_key = (workspace, user, cloud, gpu_type) | |
| burn_rate[burn_key] = (burn_rate.get(burn_key, 0.0) + | |
| per_node_hourly * num_nodes) |
| except Exception: # pylint: disable=broad-except | ||
| logger.exception('Failed to collect managed jobs metrics') |
There was a problem hiding this comment.
🟡 ManagedJobsCollector.collect() does not update _last_scrape_time on exception, causing retry storms
When _refresh() raises an exception in ManagedJobsCollector.collect(), _last_scrape_time is not updated (lines 316-317). This means the cache-TTL guard now - self._last_scrape_time >= self._cache_ttl will remain true on every subsequent scrape, causing the failing DB query to be retried on every Prometheus scrape rather than backing off for the 30s cache TTL.
Both BurnRateCollector.collect() (sky/server/metrics.py:223) and the newly-added WorkspaceUsageCollector.collect() (sky/server/metrics.py:472) correctly set self._last_scrape_time = now inside their except blocks. The ManagedJobsCollector is the only collector missing this, and this PR makes it call the new, more complex get_status_counts_by_workspace_user_cloud() join query — increasing the likelihood of transient failures.
| except Exception: # pylint: disable=broad-except | |
| logger.exception('Failed to collect managed jobs metrics') | |
| except Exception: # pylint: disable=broad-except | |
| logger.exception('Failed to collect managed jobs metrics') | |
| self._last_scrape_time = now |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Test plan
🤖 Generated with Claude Code