Skip to content

Commit ec8a12a

Browse files
authored
[CLI] Rename stats command to metrics (#2462)
1 parent 1ab8fde commit ec8a12a

File tree

8 files changed

+145
-129
lines changed

8 files changed

+145
-129
lines changed

docs/blog/posts/dstack-stats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ categories:
1515
## How it works { style="display:none"}
1616

1717
While it's possible to use third-party monitoring tools with `dstack`, it is often more convenient to debug your run and
18-
track metrics out of the box. That's why, with the latest release, `dstack` introduced [`dstack stats`](../../docs/reference/cli/dstack/stats.md), a new CLI (and API)
18+
track metrics out of the box. That's why, with the latest release, `dstack` introduced [`dstack stats`](../../docs/reference/cli/dstack/metrics.md), a new CLI (and API)
1919
for monitoring container metrics, including GPU usage for `NVIDIA`, `AMD`, and other accelerators.
2020

2121
<img src="https://github.com/dstackai/static-assets/blob/main/static-assets/images/dstack-stats-v2.png?raw=true" width="725"/>

docs/blog/posts/prometheus.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ Effective AI infrastructure management requires full visibility into compute per
1717
detailed insights into container- and GPU-level performance, while managers rely on cost metrics to track resource usage
1818
across projects.
1919

20-
While `dstack` provides key metrics through its UI and [`dstack stats`](dstack-stats.md) CLI, teams often need more granular data and prefer
20+
While `dstack` provides key metrics through its UI and [`dstack metrics`](dstack-stats.md) CLI, teams often need more granular data and prefer
2121
using their own monitoring tools. To support this, we’ve introduced a new endpoint that allows real-time exporting all collected
2222
metrics—covering fleets and runs—directly to Prometheus.
2323

@@ -49,19 +49,19 @@ Overall, `dstack` collects three groups of metrics:
4949
For a full list of available metrics and labels, check out the [Monitoring](../../docs/guides/monitoring.md) guide.
5050

5151
??? info "NVIDIA"
52-
NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,
52+
NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,
5353
as well as for [SSH fleets](../../docs/concepts/fleets.md#ssh).
54-
54+
5555
To ensure NVIDIA DCGM metrics are collected from SSH fleets, ensure the `datacenter-gpu-manager-4-core`,
5656
`datacenter-gpu-manager-4-proprietary`, and `datacenter-gpu-manager-exporter` packages are installed on the hosts.
5757

5858
??? info "AMD"
5959
AMD device metrics are not yet collected for any backends. This support will be available soon. For now, AMD metrics are
60-
only accessible through the UI and the [`dstack stats`](dstack-stats.md) CLI.
60+
only accessible through the UI and the [`dstack metrics`](dstack-stats.md) CLI.
6161

6262
!!! info "What's next?"
6363
1. See the [Monitoring](../../docs/guides/monitoring.md) guide
64-
1. Check [dev environments](../../docs/concepts/dev-environments.md),
64+
1. Check [dev environments](../../docs/concepts/dev-environments.md),
6565
[tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),
6666
and [fleets](../../docs/concepts/fleets.md)
6767
2. Join [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd){:target="_blank"}

docs/docs/guides/protips.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ The GPU vendor is indicated by one of the following case-insensitive values:
312312

313313
While `dstack` allows the use of any third-party monitoring tools (e.g., Weights and Biases), you can also
314314
monitor container metrics such as CPU, memory, and GPU usage using the [built-in
315-
`dstack stats` CLI command](../../blog/posts/dstack-stats.md) or the corresponding API.
315+
`dstack metrics` CLI command](../../blog/posts/dstack-stats.md) or the corresponding API.
316316

317317
## Service quotas
318318

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# dstack stats
1+
# dstack metrics
22

33
This command shows run hardware metrics such as CPU, memory, and GPU utilization.
44

@@ -7,7 +7,7 @@ This command shows run hardware metrics such as CPU, memory, and GPU utilization
77
<div class="termy">
88

99
```shell
10-
$ dstack stats --help
10+
$ dstack metrics --help
1111
#GENERATE#
1212
```
1313

mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ nav:
253253
- dstack stop: docs/reference/cli/dstack/stop.md
254254
- dstack attach: docs/reference/cli/dstack/attach.md
255255
- dstack logs: docs/reference/cli/dstack/logs.md
256-
- dstack stats: docs/reference/cli/dstack/stats.md
256+
- dstack metrics: docs/reference/cli/dstack/metrics.md
257257
- dstack config: docs/reference/cli/dstack/config.md
258258
- dstack fleet: docs/reference/cli/dstack/fleet.md
259259
- dstack volume: docs/reference/cli/dstack/volume.md
Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
import argparse
2+
import time
3+
from typing import Any, Dict, List, Optional, Union
4+
5+
from rich.live import Live
6+
from rich.table import Table
7+
8+
from dstack._internal.cli.commands import APIBaseCommand
9+
from dstack._internal.cli.services.completion import RunNameCompleter
10+
from dstack._internal.cli.utils.common import (
11+
LIVE_TABLE_PROVISION_INTERVAL_SECS,
12+
LIVE_TABLE_REFRESH_RATE_PER_SEC,
13+
add_row_from_dict,
14+
console,
15+
)
16+
from dstack._internal.core.errors import CLIError
17+
from dstack._internal.core.models.metrics import JobMetrics
18+
from dstack.api._public import Client
19+
from dstack.api._public.runs import Run
20+
21+
22+
class MetricsCommand(APIBaseCommand):
23+
NAME = "metrics"
24+
DESCRIPTION = "Show run metrics"
25+
26+
def _register(self):
27+
super()._register()
28+
self._parser.add_argument("run_name").completer = RunNameCompleter()
29+
self._parser.add_argument(
30+
"-w",
31+
"--watch",
32+
help="Watch run metrics in realtime",
33+
action="store_true",
34+
)
35+
36+
def _command(self, args: argparse.Namespace):
37+
super()._command(args)
38+
run = self.api.runs.get(run_name=args.run_name)
39+
if run is None:
40+
raise CLIError(f"Run {args.run_name} not found")
41+
if run.status.is_finished():
42+
raise CLIError(f"Run {args.run_name} is finished")
43+
metrics = _get_run_jobs_metrics(api=self.api, run=run)
44+
45+
if not args.watch:
46+
console.print(_get_metrics_table(run, metrics))
47+
return
48+
49+
try:
50+
with Live(console=console, refresh_per_second=LIVE_TABLE_REFRESH_RATE_PER_SEC) as live:
51+
while True:
52+
live.update(_get_metrics_table(run, metrics))
53+
time.sleep(LIVE_TABLE_PROVISION_INTERVAL_SECS)
54+
run = self.api.runs.get(run_name=args.run_name)
55+
if run is None:
56+
raise CLIError(f"Run {args.run_name} not found")
57+
if run.status.is_finished():
58+
raise CLIError(f"Run {args.run_name} is finished")
59+
metrics = _get_run_jobs_metrics(api=self.api, run=run)
60+
except KeyboardInterrupt:
61+
pass
62+
63+
64+
def _get_run_jobs_metrics(api: Client, run: Run) -> List[JobMetrics]:
65+
metrics = []
66+
for job in run._run.jobs:
67+
job_metrics = api.client.metrics.get_job_metrics(
68+
project_name=api.project,
69+
run_name=run.name,
70+
replica_num=job.job_spec.replica_num,
71+
job_num=job.job_spec.job_num,
72+
)
73+
metrics.append(job_metrics)
74+
return metrics
75+
76+
77+
def _get_metrics_table(run: Run, metrics: List[JobMetrics]) -> Table:
78+
table = Table(box=None)
79+
table.add_column("NAME", style="bold", no_wrap=True)
80+
table.add_column("CPU")
81+
table.add_column("MEMORY")
82+
table.add_column("GPU")
83+
84+
run_row: Dict[Union[str, int], Any] = {"NAME": run.name}
85+
if len(run._run.jobs) != 1:
86+
add_row_from_dict(table, run_row)
87+
88+
for job, job_metrics in zip(run._run.jobs, metrics):
89+
cpu_usage = _get_metric_value(job_metrics, "cpu_usage_percent")
90+
if cpu_usage is not None:
91+
cpu_usage = f"{cpu_usage}%"
92+
memory_usage = _get_metric_value(job_metrics, "memory_working_set_bytes")
93+
if memory_usage is not None:
94+
memory_usage = f"{round(memory_usage / 1024 / 1024)}MB"
95+
if job.job_submissions[-1].job_provisioning_data is not None:
96+
memory_usage += f"/{job.job_submissions[-1].job_provisioning_data.instance_type.resources.memory_mib}MB"
97+
gpu_metrics = ""
98+
gpus_detected_num = _get_metric_value(job_metrics, "gpus_detected_num")
99+
if gpus_detected_num is not None:
100+
for i in range(gpus_detected_num):
101+
gpu_memory_usage = _get_metric_value(job_metrics, f"gpu_memory_usage_bytes_gpu{i}")
102+
gpu_util_percent = _get_metric_value(job_metrics, f"gpu_util_percent_gpu{i}")
103+
if gpu_memory_usage is not None:
104+
if i != 0:
105+
gpu_metrics += "\n"
106+
gpu_metrics += f"#{i} {round(gpu_memory_usage / 1024 / 1024)}MB"
107+
if job.job_submissions[-1].job_provisioning_data is not None:
108+
gpu_metrics += f"/{job.job_submissions[-1].job_provisioning_data.instance_type.resources.gpus[i].memory_mib}MB"
109+
gpu_metrics += f" {gpu_util_percent}% Util"
110+
111+
job_row: Dict[Union[str, int], Any] = {
112+
"NAME": f" replica={job.job_spec.replica_num} job={job.job_spec.job_num}",
113+
"CPU": cpu_usage or "-",
114+
"MEMORY": memory_usage or "-",
115+
"GPU": gpu_metrics or "-",
116+
}
117+
if len(run._run.jobs) == 1:
118+
job_row.update(run_row)
119+
add_row_from_dict(table, job_row)
120+
121+
return table
122+
123+
124+
def _get_metric_value(job_metrics: JobMetrics, name: str) -> Optional[Any]:
125+
for metric in job_metrics.metrics:
126+
if metric.name == name:
127+
return metric.values[-1]
128+
return None
Lines changed: 5 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,128 +1,14 @@
11
import argparse
2-
import time
3-
from typing import Any, Dict, List, Optional, Union
42

5-
from rich.live import Live
6-
from rich.table import Table
3+
from dstack._internal.cli.commands.metrics import MetricsCommand
4+
from dstack._internal.utils.logging import get_logger
75

8-
from dstack._internal.cli.commands import APIBaseCommand
9-
from dstack._internal.cli.services.completion import RunNameCompleter
10-
from dstack._internal.cli.utils.common import (
11-
LIVE_TABLE_PROVISION_INTERVAL_SECS,
12-
LIVE_TABLE_REFRESH_RATE_PER_SEC,
13-
add_row_from_dict,
14-
console,
15-
)
16-
from dstack._internal.core.errors import CLIError
17-
from dstack._internal.core.models.metrics import JobMetrics
18-
from dstack.api._public import Client
19-
from dstack.api._public.runs import Run
6+
logger = get_logger(__name__)
207

218

22-
class StatsCommand(APIBaseCommand):
9+
class StatsCommand(MetricsCommand):
2310
NAME = "stats"
24-
DESCRIPTION = "Show run stats"
25-
26-
def _register(self):
27-
super()._register()
28-
self._parser.add_argument("run_name").completer = RunNameCompleter()
29-
self._parser.add_argument(
30-
"-w",
31-
"--watch",
32-
help="Watch run stats in realtime",
33-
action="store_true",
34-
)
3511

3612
def _command(self, args: argparse.Namespace):
13+
logger.warning("`dstack stats` is deprecated in favor of `dstack metrics`")
3714
super()._command(args)
38-
run = self.api.runs.get(run_name=args.run_name)
39-
if run is None:
40-
raise CLIError(f"Run {args.run_name} not found")
41-
if run.status.is_finished():
42-
raise CLIError(f"Run {args.run_name} is finished")
43-
metrics = _get_run_jobs_metrics(api=self.api, run=run)
44-
45-
if not args.watch:
46-
console.print(_get_stats_table(run, metrics))
47-
return
48-
49-
try:
50-
with Live(console=console, refresh_per_second=LIVE_TABLE_REFRESH_RATE_PER_SEC) as live:
51-
while True:
52-
live.update(_get_stats_table(run, metrics))
53-
time.sleep(LIVE_TABLE_PROVISION_INTERVAL_SECS)
54-
run = self.api.runs.get(run_name=args.run_name)
55-
if run is None:
56-
raise CLIError(f"Run {args.run_name} not found")
57-
if run.status.is_finished():
58-
raise CLIError(f"Run {args.run_name} is finished")
59-
metrics = _get_run_jobs_metrics(api=self.api, run=run)
60-
except KeyboardInterrupt:
61-
pass
62-
63-
64-
def _get_run_jobs_metrics(api: Client, run: Run) -> List[JobMetrics]:
65-
metrics = []
66-
for job in run._run.jobs:
67-
job_metrics = api.client.metrics.get_job_metrics(
68-
project_name=api.project,
69-
run_name=run.name,
70-
replica_num=job.job_spec.replica_num,
71-
job_num=job.job_spec.job_num,
72-
)
73-
metrics.append(job_metrics)
74-
return metrics
75-
76-
77-
def _get_stats_table(run: Run, metrics: List[JobMetrics]) -> Table:
78-
table = Table(box=None)
79-
table.add_column("NAME", style="bold", no_wrap=True)
80-
table.add_column("CPU")
81-
table.add_column("MEMORY")
82-
table.add_column("GPU")
83-
84-
run_row: Dict[Union[str, int], Any] = {"NAME": run.name}
85-
if len(run._run.jobs) != 1:
86-
add_row_from_dict(table, run_row)
87-
88-
for job, job_metrics in zip(run._run.jobs, metrics):
89-
cpu_usage = _get_metric_value(job_metrics, "cpu_usage_percent")
90-
if cpu_usage is not None:
91-
cpu_usage = f"{cpu_usage}%"
92-
memory_usage = _get_metric_value(job_metrics, "memory_working_set_bytes")
93-
if memory_usage is not None:
94-
memory_usage = f"{round(memory_usage / 1024 / 1024)}MB"
95-
if job.job_submissions[-1].job_provisioning_data is not None:
96-
memory_usage += f"/{job.job_submissions[-1].job_provisioning_data.instance_type.resources.memory_mib}MB"
97-
gpu_stats = ""
98-
gpus_detected_num = _get_metric_value(job_metrics, "gpus_detected_num")
99-
if gpus_detected_num is not None:
100-
for i in range(gpus_detected_num):
101-
gpu_memory_usage = _get_metric_value(job_metrics, f"gpu_memory_usage_bytes_gpu{i}")
102-
gpu_util_percent = _get_metric_value(job_metrics, f"gpu_util_percent_gpu{i}")
103-
if gpu_memory_usage is not None:
104-
if i != 0:
105-
gpu_stats += "\n"
106-
gpu_stats += f"#{i} {round(gpu_memory_usage / 1024 / 1024)}MB"
107-
if job.job_submissions[-1].job_provisioning_data is not None:
108-
gpu_stats += f"/{job.job_submissions[-1].job_provisioning_data.instance_type.resources.gpus[i].memory_mib}MB"
109-
gpu_stats += f" {gpu_util_percent}% Util"
110-
111-
job_row: Dict[Union[str, int], Any] = {
112-
"NAME": f" replica={job.job_spec.replica_num} job={job.job_spec.job_num}",
113-
"CPU": cpu_usage or "-",
114-
"MEMORY": memory_usage or "-",
115-
"GPU": gpu_stats or "-",
116-
}
117-
if len(run._run.jobs) == 1:
118-
job_row.update(run_row)
119-
add_row_from_dict(table, job_row)
120-
121-
return table
122-
123-
124-
def _get_metric_value(job_metrics: JobMetrics, name: str) -> Optional[Any]:
125-
for metric in job_metrics.metrics:
126-
if metric.name == name:
127-
return metric.values[-1]
128-
return None

src/dstack/_internal/cli/main.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from dstack._internal.cli.commands.gateway import GatewayCommand
1414
from dstack._internal.cli.commands.init import InitCommand
1515
from dstack._internal.cli.commands.logs import LogsCommand
16+
from dstack._internal.cli.commands.metrics import MetricsCommand
1617
from dstack._internal.cli.commands.ps import PsCommand
1718
from dstack._internal.cli.commands.server import ServerCommand
1819
from dstack._internal.cli.commands.stats import StatsCommand
@@ -65,6 +66,7 @@ def main():
6566
GatewayCommand.register(subparsers)
6667
InitCommand.register(subparsers)
6768
LogsCommand.register(subparsers)
69+
MetricsCommand.register(subparsers)
6870
PsCommand.register(subparsers)
6971
ServerCommand.register(subparsers)
7072
StatsCommand.register(subparsers)

0 commit comments

Comments
 (0)