Skip to content

Commit 2b3e95e

Browse files
authored
Add instance and job cost/usage Prometheus metrics (#2432)
* dstack_instance_duration_seconds_total * dstack_instance_price_dollars_per_hour * dstack_instance_gpu_count * dstack_job_duration_seconds_total * dstack_job_price_dollars_per_hour * dstack_job_gpu_count `/metrics/project/<project-name>` is deprecated and does not include new metrics. Part-of: #2431
1 parent e31b609 commit 2b3e95e

File tree

7 files changed

+382
-67
lines changed

7 files changed

+382
-67
lines changed

docs/docs/guides/metrics.md

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Prometheus metrics
2+
3+
If enabled, `dstack` collects and exports Prometheus metrics. Metrics are available at the `/metrics` path.
4+
5+
By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable.
6+
7+
!!! info "Convention"
8+
*type?* denotes an optional type. If a type is optional, an empty string is a valid value.
9+
10+
## Instance metrics
11+
12+
| Metric | Type | Description | Examples |
13+
|---|---|---|---|
14+
| `dstack_instance_duration_seconds_total` | *counter* | Total seconds the instance is running | `1123763.22` |
15+
| `dstack_instance_price_dollars_per_hour` | *gauge* | Instance price, USD/hour | `16.0`|
16+
| `dstack_instance_gpu_count` | *gauge* | Instance GPU count | `4.0`, `0.0` |
17+
18+
| Label | Type | Examples |
19+
|---|---|---|
20+
| `dstack_project_name` | *string* | `main` |
21+
| `dstack_fleet_name` | *string?* | `my-fleet` |
22+
| `dstack_fleet_id` | *string?* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
23+
| `dstack_instance_name` | *string* | `my-fleet-0` |
24+
| `dstack_instance_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
25+
| `dstack_instance_type` | *string?* | `g4dn.xlarge` |
26+
| `dstack_backend` | *string?* | `aws`, `runpod` |
27+
| `dstack_gpu` | *string?* | `T4` |
28+
29+
## Job metrics
30+
31+
| Metric | Type | Description | Examples |
32+
|---|---|---|---|
33+
| `dstack_job_duration_seconds_total` | *counter* | Total seconds the job is running | `520.37` |
34+
| `dstack_job_price_dollars_per_hour` | *gauge* | Job instance price, USD/hour | `8.0`|
35+
| `dstack_job_gpu_count` | *gauge* | Job GPU count | `2.0`, `0.0` |
36+
37+
| Label | Type | Examples |
38+
|---|---|---|
39+
| `dstack_project_name` | *string* | `main` |
40+
| `dstack_user_name` | *string* | `alice` |
41+
| `dstack_run_name` | *string* | `nccl-tests` |
42+
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
43+
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
44+
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
45+
| `dstack_job_num` | *integer* | `0` |
46+
| `dstack_replica_num` | *integer* | `0` |
47+
| `dstack_run_type` | *string* | `task`, `dev-environment` |
48+
| `dstack_backend` | *string* | `aws`, `runpod` |
49+
| `dstack_gpu` | *string?* | `T4` |
50+
51+
## NVIDIA DCGM job metrics
52+
53+
A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets.
54+
55+
??? info "SSH fleets"
56+
In order for DCGM metrics to work, the following packages must be installed on the instances:
57+
58+
* `datacenter-gpu-manager-4-core`
59+
* `datacenter-gpu-manager-4-proprietary`
60+
* `datacenter-gpu-manager-exporter`
61+
62+
Check [`dcgm/exporter.go`](https://github.com/dstackai/dstack/blob/master/runner/internal/shim/dcgm/exporter.go) for the list of metrics.
63+
64+
| Label | Type | Examples |
65+
|---|---|---|
66+
| `dstack_project_name` | *string* | `main` |
67+
| `dstack_user_name` | *string* | `alice` |
68+
| `dstack_run_name` | *string* | `nccl-tests` |
69+
| `dstack_run_id` | *string* | `51e837bf-fae9-4a37-ac9c-85c005606c22` |
70+
| `dstack_job_name` | *string* | `nccl-tests-0-0` |
71+
| `dstack_job_id` | *string* | `8c28c52c-2f94-4a19-8c06-12f1dfee4dd2` |
72+
| `dstack_job_num` | *integer* | `0` |
73+
| `dstack_replica_num` | *integer* | `0` |

docs/docs/guides/server-deployment.md

Lines changed: 0 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -202,21 +202,6 @@ To store logs using GCP Logging, set the `DSTACK_SERVER_GCP_LOGGING_PROJECT` env
202202

203203
</div>
204204

205-
## Metrics
206-
207-
If enabled, `dstack` collects and exports Prometheus metrics from running jobs. Metrics for jobs from all projects are available
208-
at the `/metrics` path, and metrics for jobs from a specific project are available at the `/metrics/project/<project-name>` path.
209-
210-
By default, metrics are disabled. To enable, set the `DSTACK_ENABLE_PROMETHEUS_METRICS` variable.
211-
212-
Each sample includes a set of `dstack_*` labels, e.g., `dstack_project_name="main"`, `dstack_run_name="vllm-llama32"`.
213-
214-
Currently, `dstack` collects the following metrics:
215-
216-
* A fixed subset of NVIDIA GPU metrics from [DCGM Exporter :material-arrow-top-right-thin:{ .external }](https://docs.nvidia.com/datacenter/dcgm/latest/gpu-telemetry/dcgm-exporter.html){:target="_blank"} on supported cloud backends — AWS, Azure, GCP, OCI — and SSH fleets.
217-
On supported cloud backends the required packages are already installed.
218-
If you use SSH fleets, install `datacenter-gpu-manager-4-core`, `datacenter-gpu-manager-4-proprietary` and `datacenter-gpu-manager-exporter`.
219-
220205
## Encryption
221206

222207
By default, `dstack` stores data in plaintext. To enforce encryption, you

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,7 @@ nav:
225225
- Guides:
226226
- Protips: docs/guides/protips.md
227227
- Server deployment: docs/guides/server-deployment.md
228+
- Metrics: docs/guides/metrics.md
228229
- Troubleshooting: docs/guides/troubleshooting.md
229230
- Administration: docs/guides/administration.md
230231
- Reference:

src/dstack/_internal/server/routers/prometheus.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ async def get_prometheus_metrics(
2626
return await prometheus.get_metrics(session=session)
2727

2828

29-
@router.get("/metrics/project/{project_name}")
29+
@router.get("/metrics/project/{project_name}", deprecated=True)
3030
async def get_project_prometheus_metrics(
3131
session: Annotated[AsyncSession, Depends(get_session)],
3232
project: Annotated[ProjectModel, Depends(Project())],

src/dstack/_internal/server/services/prometheus.py

Lines changed: 176 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1+
import itertools
12
from collections.abc import Generator, Iterable
3+
from datetime import timezone
24

35
from prometheus_client import Metric
46
from prometheus_client.parser import text_string_to_metric_families
@@ -7,21 +9,172 @@
79
from sqlalchemy.ext.asyncio import AsyncSession
810
from sqlalchemy.orm import joinedload
911

10-
from dstack._internal.core.models.runs import JobStatus
11-
from dstack._internal.server.models import JobModel, JobPrometheusMetrics, ProjectModel
12+
from dstack._internal.core.models.instances import InstanceStatus
13+
from dstack._internal.core.models.runs import JobStatus, RunSpec
14+
from dstack._internal.server.models import (
15+
InstanceModel,
16+
JobModel,
17+
JobPrometheusMetrics,
18+
ProjectModel,
19+
RunModel,
20+
)
21+
from dstack._internal.server.services.instances import get_instance_offer
22+
from dstack._internal.server.services.jobs import get_job_provisioning_data, get_job_runtime_data
23+
from dstack._internal.utils.common import get_current_datetime
24+
25+
_INSTANCE_DURATION = "dstack_instance_duration_seconds_total"
26+
_INSTANCE_PRICE = "dstack_instance_price_dollars_per_hour"
27+
_INSTANCE_GPU_COUNT = "dstack_instance_gpu_count"
28+
_JOB_DURATION = "dstack_job_duration_seconds_total"
29+
_JOB_PRICE = "dstack_job_price_dollars_per_hour"
30+
_JOB_GPU_COUNT = "dstack_job_gpu_count"
1231

1332

1433
async def get_metrics(session: AsyncSession) -> str:
34+
metrics_iter = itertools.chain(
35+
await get_instance_metrics(session),
36+
await get_job_metrics(session),
37+
await get_job_gpu_metrics(session),
38+
)
39+
return "\n".join(_render_metrics(metrics_iter)) + "\n"
40+
41+
42+
async def get_instance_metrics(session: AsyncSession) -> Iterable[Metric]:
43+
res = await session.execute(
44+
select(InstanceModel)
45+
.join(ProjectModel)
46+
.where(
47+
InstanceModel.deleted == False,
48+
InstanceModel.status.in_(
49+
[
50+
InstanceStatus.PROVISIONING,
51+
InstanceStatus.IDLE,
52+
InstanceStatus.BUSY,
53+
InstanceStatus.TERMINATING,
54+
]
55+
),
56+
)
57+
.order_by(ProjectModel.name, InstanceModel.name)
58+
.options(
59+
joinedload(InstanceModel.project),
60+
joinedload(InstanceModel.fleet),
61+
)
62+
)
63+
instances = res.unique().scalars().all()
64+
metrics: dict[str, Metric] = {
65+
_INSTANCE_DURATION: Metric(
66+
name=_INSTANCE_DURATION,
67+
documentation="Total seconds the instance is running",
68+
typ="counter",
69+
),
70+
_INSTANCE_PRICE: Metric(
71+
name=_INSTANCE_PRICE, documentation="Instance price, USD/hour", typ="gauge"
72+
),
73+
_INSTANCE_GPU_COUNT: Metric(
74+
name=_INSTANCE_GPU_COUNT, documentation="Instance GPU count", typ="gauge"
75+
),
76+
}
77+
now = get_current_datetime()
78+
for instance in instances:
79+
fleet = instance.fleet
80+
offer = get_instance_offer(instance)
81+
gpu = ""
82+
gpu_count = 0
83+
if offer is not None and len(offer.instance.resources.gpus) > 0:
84+
gpu = offer.instance.resources.gpus[0].name
85+
gpu_count = len(offer.instance.resources.gpus)
86+
labels: dict[str, str] = {
87+
"dstack_project_name": instance.project.name,
88+
"dstack_fleet_name": fleet.name if fleet is not None else "",
89+
"dstack_fleet_id": str(fleet.id) if fleet is not None else "",
90+
"dstack_instance_name": str(instance.name),
91+
"dstack_instance_id": str(instance.id),
92+
"dstack_instance_type": offer.instance.name if offer is not None else "",
93+
"dstack_backend": instance.backend.value if instance.backend is not None else "",
94+
"dstack_gpu": gpu,
95+
}
96+
duration = (now - instance.created_at.replace(tzinfo=timezone.utc)).total_seconds()
97+
metrics[_INSTANCE_DURATION].add_sample(
98+
name=_INSTANCE_DURATION, labels=labels, value=duration
99+
)
100+
metrics[_INSTANCE_PRICE].add_sample(
101+
name=_INSTANCE_PRICE, labels=labels, value=instance.price or 0.0
102+
)
103+
metrics[_INSTANCE_GPU_COUNT].add_sample(
104+
name=_INSTANCE_GPU_COUNT, labels=labels, value=gpu_count
105+
)
106+
return metrics.values()
107+
108+
109+
async def get_job_metrics(session: AsyncSession) -> Iterable[Metric]:
110+
res = await session.execute(
111+
select(JobModel)
112+
.join(ProjectModel)
113+
.where(
114+
JobModel.status.in_(
115+
[
116+
JobStatus.PROVISIONING,
117+
JobStatus.PULLING,
118+
JobStatus.RUNNING,
119+
JobStatus.TERMINATING,
120+
]
121+
)
122+
)
123+
.order_by(ProjectModel.name, JobModel.job_name)
124+
.options(
125+
joinedload(JobModel.project),
126+
joinedload(JobModel.run).joinedload(RunModel.user),
127+
)
128+
)
129+
jobs = res.scalars().all()
130+
metrics: dict[str, Metric] = {
131+
_JOB_DURATION: Metric(
132+
name=_JOB_DURATION, documentation="Total seconds the job is running", typ="counter"
133+
),
134+
_JOB_PRICE: Metric(
135+
name=_JOB_PRICE, documentation="Job instance price, USD/hour", typ="gauge"
136+
),
137+
_JOB_GPU_COUNT: Metric(name=_JOB_GPU_COUNT, documentation="Job GPU count", typ="gauge"),
138+
}
139+
now = get_current_datetime()
140+
for job in jobs:
141+
jpd = get_job_provisioning_data(job)
142+
if jpd is None:
143+
continue
144+
jrd = get_job_runtime_data(job)
145+
gpus = jpd.instance_type.resources.gpus
146+
price = jpd.price
147+
if jrd is not None and jrd.offer is not None:
148+
gpus = jrd.offer.instance.resources.gpus
149+
price = jrd.offer.price
150+
run_spec = RunSpec.__response__.parse_raw(job.run.run_spec)
151+
labels = _get_job_labels(job)
152+
labels["dstack_run_type"] = run_spec.configuration.type
153+
labels["dstack_backend"] = jpd.get_base_backend().value
154+
labels["dstack_gpu"] = gpus[0].name if gpus else ""
155+
duration = (now - job.submitted_at.replace(tzinfo=timezone.utc)).total_seconds()
156+
metrics[_JOB_DURATION].add_sample(name=_JOB_DURATION, labels=labels, value=duration)
157+
metrics[_JOB_PRICE].add_sample(name=_JOB_PRICE, labels=labels, value=price)
158+
metrics[_JOB_GPU_COUNT].add_sample(name=_JOB_GPU_COUNT, labels=labels, value=len(gpus))
159+
return metrics.values()
160+
161+
162+
async def get_job_gpu_metrics(session: AsyncSession) -> Iterable[Metric]:
15163
res = await session.execute(
16164
select(JobPrometheusMetrics)
17165
.join(JobModel)
18166
.join(ProjectModel)
19167
.where(JobModel.status.in_([JobStatus.RUNNING]))
20168
.order_by(ProjectModel.name, JobModel.job_name)
21-
.options(joinedload(JobPrometheusMetrics.job).joinedload(JobModel.project))
169+
.options(
170+
joinedload(JobPrometheusMetrics.job).joinedload(JobModel.project),
171+
joinedload(JobPrometheusMetrics.job)
172+
.joinedload(JobModel.run)
173+
.joinedload(RunModel.user),
174+
)
22175
)
23176
metrics_models = res.scalars().all()
24-
return _process_metrics(metrics_models)
177+
return _parse_and_enrich_job_gpu_metrics(metrics_models)
25178

26179

27180
async def get_project_metrics(session: AsyncSession, project: ProjectModel) -> str:
@@ -33,20 +186,20 @@ async def get_project_metrics(session: AsyncSession, project: ProjectModel) -> s
33186
JobModel.status.in_([JobStatus.RUNNING]),
34187
)
35188
.order_by(JobModel.job_name)
36-
.options(joinedload(JobPrometheusMetrics.job).joinedload(JobModel.project))
189+
.options(
190+
joinedload(JobPrometheusMetrics.job).joinedload(JobModel.project),
191+
joinedload(JobPrometheusMetrics.job)
192+
.joinedload(JobModel.run)
193+
.joinedload(RunModel.user),
194+
)
37195
)
38196
metrics_models = res.scalars().all()
39-
return _process_metrics(metrics_models)
40-
41-
42-
def _process_metrics(metrics_models: Iterable[JobPrometheusMetrics]) -> str:
43-
metrics = _parse_and_enrich_metrics(metrics_models)
44-
if not metrics:
45-
return ""
46-
return "\n".join(_render_metrics(metrics)) + "\n"
197+
return "\n".join(_render_metrics(_parse_and_enrich_job_gpu_metrics(metrics_models))) + "\n"
47198

48199

49-
def _parse_and_enrich_metrics(metrics_models: Iterable[JobPrometheusMetrics]) -> list[Metric]:
200+
def _parse_and_enrich_job_gpu_metrics(
201+
metrics_models: Iterable[JobPrometheusMetrics],
202+
) -> Iterable[Metric]:
50203
metrics: dict[str, Metric] = {}
51204
for metrics_model in metrics_models:
52205
for metric in text_string_to_metric_families(metrics_model.text):
@@ -56,31 +209,36 @@ def _parse_and_enrich_metrics(metrics_models: Iterable[JobPrometheusMetrics]) ->
56209
metric = metrics.setdefault(name, metric)
57210
for sample in samples:
58211
labels = sample.labels
59-
labels.update(_get_dstack_labels(metrics_model.job))
212+
labels.update(_get_job_labels(metrics_model.job))
60213
# text_string_to_metric_families "fixes" counter names appending _total,
61214
# we rebuild Sample to revert this
62215
metric.samples.append(Sample(name, labels, *sample[2:]))
63-
return list(metrics.values())
216+
return metrics.values()
64217

65218

66-
def _get_dstack_labels(job: JobModel) -> dict[str, str]:
219+
def _get_job_labels(job: JobModel) -> dict[str, str]:
67220
return {
68221
"dstack_project_name": job.project.name,
222+
"dstack_user_name": job.run.user.name,
69223
"dstack_run_name": job.run_name,
224+
"dstack_run_id": str(job.run_id),
70225
"dstack_job_name": job.job_name,
226+
"dstack_job_id": str(job.id),
71227
"dstack_job_num": str(job.job_num),
72228
"dstack_replica_num": str(job.replica_num),
73229
}
74230

75231

76232
def _render_metrics(metrics: Iterable[Metric]) -> Generator[str, None, None]:
77233
for metric in metrics:
234+
if not metric.samples:
235+
continue
78236
yield f"# HELP {metric.name} {metric.documentation}"
79237
yield f"# TYPE {metric.name} {metric.type}"
80238
for sample in metric.samples:
81239
parts: list[str] = [f"{sample.name}{{"]
82240
parts.extend(",".join(f'{name}="{value}"' for name, value in sample.labels.items()))
83-
parts.append(f"}} {sample.value}")
241+
parts.append(f"}} {float(sample.value)}")
84242
# text_string_to_metric_families converts milliseconds to float seconds
85243
if isinstance(sample.timestamp, float):
86244
parts.append(f" {int(sample.timestamp * 1000)}")

0 commit comments

Comments
 (0)