Skip to content

Commit d6dece8

Browse files
committed
Merge branch 'master' into pr_offers_with_availability_cache
2 parents e14c6fb + 88e7146 commit d6dece8

12 files changed

Lines changed: 196 additions & 252 deletions

File tree

docs/assets/stylesheets/extra.css

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -804,7 +804,7 @@ body {
804804
display: inline-block;
805805
font-size: 17px;
806806
font-weight: 600;
807-
line-height: 1.4rem;
807+
/* line-height: 1.4rem; */
808808
/*letter-spacing: -0.5px;*/
809809
position: relative;
810810
left: -11px;
@@ -866,7 +866,7 @@ body {
866866
}
867867

868868
.md-sidebar--primary .md-nav__link, .md-sidebar--post .md-nav__link {
869-
padding: 5px 15px 4px;
869+
padding: 4px 15px 4px;
870870
margin-top: 0;
871871
}
872872

@@ -989,6 +989,10 @@ html .md-footer-meta.md-typeset a:is(:focus,:hover) {
989989
.md-nav--integrated>.md-nav__list>.md-nav__item--active .md-nav--secondary {
990990
margin-bottom: 0;
991991
}
992+
993+
.md-nav--primary .md-nav__list {
994+
padding-bottom: .2rem;
995+
}
992996
}
993997

994998
.md-typeset :where(ol, ul) {
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
title: "The state of cloud GPUs in 2025: costs, performance, playbooks"
3+
date: 2025-09-10
4+
description: "TBA"
5+
slug: state-of-cloud-gpu-2025
6+
image: https://dstack.ai/static-assets/static-assets/images/cloud-gpu-providers.png
7+
# categories:
8+
# - Benchmarks
9+
---
10+
11+
# The state of cloud GPUs in 2025: costs, performance, playbooks
12+
13+
This is a practical map for teams renting GPUs — whether you’re a single project team fine-tuning models or a production-scale team managing thousand-GPU workloads. We’ll break down where providers fit, what actually drives performance, how pricing really works, and how to design a control plane that makes multi-cloud not just possible, but a competitive advantage.
14+
15+
<!-- more -->
16+
17+
## A quick map of the market
18+
19+
Two forces define the market: **Target scale** (from single nodes → racks → multi-rack pods) and **automation maturity** (manual VMs → basic Kubernetes → API-first orchestration).
20+
21+
<img src="https://dstack.ai/static-assets/static-assets/images/cloud-gpu-providers.png" width="750"/>
22+
23+
These axes split providers into distinct archetypes—each with different economics, fabrics, and operational realities.
24+
25+
### Categories at a glance
26+
27+
| Category | Description | Examples |
28+
| :---- | :---- | :---- |
29+
| **Classical hyperscalers** | General-purpose clouds with GPU SKUs bolted on | AWS, Google Cloud, Azure, OCI |
30+
| **Massive neoclouds** | GPU-first operators built around dense HGX or MI-series clusters | CoreWeave, Lambda, Nebius, Crusoe |
31+
| **Rapidly-catching neoclouds** | Smaller GPU-first players building out aggressively | RunPod, DataCrunch, Voltage Park, TensorWave, Hot Aisle |
32+
| **Cloud marketplaces** | Don’t own capacity; sell orchestration + unified API over multiple backends | NVIDIA DGX Cloud (Lepton), Modal, Lightning AI, dstack Sky |
33+
| **DC aggregators** | Aggregate idle capacity from third-party datacenters, pricing via market dynamics | Vast.ai |
34+
35+
> Massive neoclouds lead at extreme GPU scales. Hyperscalers may procure GPU capacity from these GPU-first operators for both training and inference.
36+
37+
## Silicon reality check
38+
39+
=== "NVIDIA"
40+
**NVIDIA** remains the path of least resistance for most teams—CUDA and the NVIDIA Container Toolkit still lead in framework compatibility and tooling maturity. H100 is now table stakes and widely available across clouds, a reflection of billions in GPU capex flowing into the open market. GB200 takes it further with tightly coupled domains ideal for memory- and bandwidth-heavy prefill, while cheaper pools can handle lighter decode phases.
41+
42+
=== "AMD"
43+
**AMD** has now crossed the viability threshold with ROCm 6/7—native PyTorch wheels, ROCm containers, and upstream support in vLLM/SGLang mean OSS stacks “Day 0” if you standardize ROCm images. MI300X (192 GB) and MI350X (288 GB HBM3E) match or exceed NVIDIA on per-GPU memory and are increasingly listed by neoclouds. The new MI355X further pushes boundaries—designed for rack-scale AI, it packs massive HBM3E pools in high-density systems for ultra-large model throughput.
44+
45+
=== "TPU & Trainium"
46+
**TPUs** and **Trainium** excel in tightly coupled training when you’re all-in on one provider, letting you amortize integration over years. The trade-offs—vendor lock-in, slower OSS support, and smaller ecosystems—make them viable mainly for multi-year, hyperscale workloads where efficiency outweighs migration cost.
47+
48+
> **AMD** vs **NVIDIA** fit. MI300X matches H200 in capacity (192 GB vs 141 GB) but with more headroom for long-context prefill. MI325X (256 GB) is rolling out slowly, with many providers jumping to MI350X/MI355X (288 GB HBM3E). These top models exceed B200’s 192 GB, making them viable drop-ins where ROCm is ready; GB200/NVL still lead for ultra-low-latency collectives.
49+
50+
## What you’re really buying
51+
52+
The GPU SKU is only one piece. Real throughput depends on the system around it. Clusters are optional—until your workload forces them.
53+
54+
| Dimension | Why it matters | Examples |
55+
| :---- | :---- | :---- |
56+
| **GPU memory** | Governs max batch size and KV-cache headroom, reducing parallelism overhead. | H100 (80 GB), H200 (~141 GB), B200 (~192 GB), MI300X (192 GB), MI325X (256 GB), MI350X/MI355X (288 GB). |
57+
| **Fabric bandwidth** | Dictates all-reduce speed and MoE routing efficiency. Matters beyond a few nodes | 400 Gb/s – 3.2 Tb/s (e.g., 8×400 Gb/s NICs) |
58+
| **Topology** | Low-diameter, uniform interconnect pods beat ad-hoc multi-rack for scale efficiency | HGX islands |
59+
| **Local NVMe** | NVMe hides object-store latency for shards and checkpoints | Multi-TB local SSD per node is common on training SKUs |
60+
| **Network volumes** | Removes “copy to every node” overhead | FSx for Lustre, Filestore, managed NFS; in HPC/neocloud setups, Vast and Weka are common. |
61+
| **Orchestration** | Containers, placement, gang scheduling, autoscaling | K8s+Kueue, KubeRay, dstack, SLURM, vendor schedulers |
62+
63+
## Pricing models – and what they hide
64+
65+
Price tables don’t show availability risk. Commitments lower cost and increase odds you get the hardware when you need it.
66+
67+
| With commitments | No committments |
68+
| ----- | ----- |
69+
| **Long-term (1–3 years)** Reserved or savings plans. 30–70% below on-demand. High capacity assurance, but utilization risk if needs shift. | **On-demand** Launch instantly—if quota allows. Highest $/hr. Limited availability for hot SKUs. |
70+
| **Short-term (6–12 months)** Private offers, common with neoclouds. 20–60% off. Often includes hard capacity guarantees. | **Flex / queued** Starts when supply frees up. Cheaper than on-demand; runs capped in duration. |
71+
| **Calendar capacity** Fixed-date bookings (AWS Capacity Blocks, GCP Calendar). Guarantees start time for planned runs. | **Spot / preemptible** 60–90% off. Eviction-prone; needs checkpointing/stateless design. |
72+
73+
!!! info "Playbook"
74+
Lock in calendar or reserved for steady base load or planned long runs. Keep urgent, interactive, and development/CI/CD work on on-demand. Push experiments and ephemeral runs to spot/flex. Always leave exit ramps to pivot to new SKUs.
75+
76+
### Quotas, approvals, and the human factor
77+
78+
Even listed SKUs may be gated. Hyperscalers and neoclouds enforce quotas and manual approvals—region by region—especially for new accounts on credits. If you can’t clear those gates, multi-cloud isn’t optional, it’s survival.
79+
80+
### H100 pricing example
81+
82+
Below is the price range for a single H100 SXM across providers.
83+
84+
<img src="https://dstack.ai/static-assets/static-assets/images/cloud-providers-single-h100.png" width="750"/>
85+
86+
> Price is per GPU and excludes full CPU, disk amount and type, and network factors. 8xGPU multi-node setups with fast interconnects will cost more.
87+
88+
For comparison, below is the price range for H100×GPU clusters across providers.
89+
90+
<img src="https://dstack.ai/static-assets/static-assets/images/cloud-providers-cluster-h100.png" width="750"/>
91+
92+
> Most hyperscalers and neoclouds need short- or long-term contracts, though providers like RunPod, DataCrunch, and Nebius offer on-demand clusters. Larger capacity and longer commitments bring bigger discounts — Nebius offers up to 35% off for longer terms.
93+
94+
## New GPU generations – why they matter
95+
96+
* **Memory and bandwidth scaling.** Higher HBM and faster interconnects expand batch size, context length, and per-node throughput. NVIDIA’s B300 and AMD’s MI355X push this further with massive HBM3E capacity and rack-scale fabrics, targeting ultra-large training runs.
97+
* **Fabrics.** Each new generation often brings major interconnect upgrades — GB200 with NVLink5 (1.8 TB/s) and 800 Gb/s Infiniband, MI355X with PCIe Gen6 and NDR. These cut all-reduce and MoE latency, but only if the cloud deploys matching network infrastructure. Pairing new GPUs with legacy 400 Gb/s links can erase much of the gain.
98+
* **Prefill vs decode.** Prefill (memory/bandwidth heavy) thrives on large HBM and tightly coupled GPUs like GB200 NVL72. Decode can run cheaper, on high-concurrency pools. Splitting them is a major cost lever.
99+
* **Cascade.** Top-end SKUs arrive roughly every 18–24 months, with mid-cycle refreshes in between. Each launch pushes older SKUs down the price curve — locking in for years right before a release risks overpaying within months.
100+
101+
!!! info "Prices"
102+
H100 prices have dropped significantly in recent years due to new GPU generations and models like DeepSeek that require more memory. New generations include the H200 and B200. Only AWS has reduced H100 instance prices by 44%. H200 and later B200 prices are expected to follow the same trend.
103+
104+
**AMD** MI300X pricing is also softening as MI350X/MI355X roll out, with some neoclouds undercutting H100/H200 on $/GPU-hr while offering more memory per GPU.
105+
106+
107+
## Where provisioning is going
108+
109+
The shift is from ad-hoc starts to time-bound allocations.
110+
111+
Large runs are booked ahead; daily work rides elastic pools. Placement engines increasingly decide on region + provider + interconnect before SKU. The mindset moves from “more GPUs” to “higher sustained utilization.”
112+
113+
## Control plane as the force multiplier
114+
115+
A real multi-cloud control plane should:
116+
117+
* **Be quota-aware and cost-aware** – place jobs where they’ll start fastest at the best $/SLO.
118+
* **Maximize utilization** – keep GPUs busy with checkpointing, resumable pipelines, and efficient gang scheduling.
119+
* **Enforce portability** – one container spec, CUDA+ROCm images, upstream framework compatibility, state in object storage.
120+
121+
This turns capacity from individual silos into one fungible pool.
122+
123+
## Final takeaways
124+
125+
* **Price ≠ cost** — List price often explains <50% of total job cost on multi-node training; fabric and storage dominate at scale.
126+
* **Match commitments to workload reality** — and leave room for next-gen hardware.
127+
* **Multi-cloud isn’t backup, it’s strategy** – keep a warm secondary.
128+
* **Watch AMD’s ramp-up** – the MI series is becoming production-ready, and MI355X availability is set to expand quickly as providers bring it online.
129+
* **Control plane is leverage** – define once, run anywhere, at the cheapest viable pool.
130+
131+
??? info "Scope & limitations of this report"
132+
133+
- **Provider coverage.** The vendor set is a curated sample aligned with the dstack team’s view of the market. A limited group of community members and domain experts reviewed drafts. Corrections, reproducibility notes, and additional data points are welcome.
134+
- **Methodology gaps.** We did not perform cross-vendor **price normalization** (CPU/RAM/NVMe/fabric adjustments, region effects, egress), controlled **microbenchmarks** (NCCL/all-reduce, MoE routing latency, KV-cache behavior, object store vs. parallel FS), or a full **orchestration capability matrix** (scheduler semantics, gang scheduling, quota APIs, preemption, multi-tenancy).
135+
- **Next steps.** We plan to publish price normalization, hardware/network microbenchmarks, and a scheduler capability matrix; preliminary harnesses are linked in the appendix. Contributors welcome.
136+
137+
138+
> If you need a lighter, simpler orchestration and control-plane alternative to Kubernetes or Slurm, consider [dstack :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/){:target="_blank"}.
139+
It’s open-source and self-hosted.
140+
141+
??? info "dstack Sky"
142+
If you want unified access to low-cost on-demand and spot GPUs across multiple clouds, try [dstack Sky :material-arrow-top-right-thin:{ .external }](https://sky.dstack.ai/){:target="_blank"}.
143+
144+
<img src="https://dstack.ai/static-assets/static-assets/images/dstack-sky-offers.png" width="750"/>
145+
146+
You can use it with your own cloud accounts or through the cloud marketplace.

docs/docs/guides/protips.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -321,6 +321,33 @@ retry:
321321

322322
</div>
323323

324+
## Profiles
325+
326+
Sometimes, you may want to reuse parameters across runs or set defaults so you don’t have to repeat them in every configuration. You can do this by defining a profile.
327+
328+
??? info ".dstack/profiles.yml"
329+
A profile file can be created either globally in `~/.dstack/profiles.yml` or locally in `.dstack/profiles.yml`:
330+
331+
```yaml
332+
profiles:
333+
- name: my-profile
334+
# If set to true, this profile will be applied automatically
335+
default: true
336+
337+
# The spot pololicy can be "spot", "on-demand", or "auto"
338+
spot_policy: auto
339+
# Limit the maximum price of the instance per hour
340+
max_price: 1.5
341+
# Stop any run if it runs longer that this duration
342+
max_duration: 1d
343+
# Use only these backends
344+
backends: [azure, lambda]
345+
```
346+
347+
Check [`.dstack/profiles.yml`](../reference/profiles.yml.md) to see what properties can be defined there.
348+
349+
A profile can be set as `default` to apply automatically to any run, or specified with `--profile NAME` in `dstack apply`.
350+
324351
## Projects
325352

326353
If you're using multiple `dstack` projects (e.g., from different `dstack` servers),
Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,32 @@
1-
# profiles.yml
1+
# .dstack/profiles.yml
22

3-
Sometimes, you may want to reuse the same parameters across different [`.dstack.yml`](dstack.yml.md) configurations.
3+
Sometimes, you may want to reuse the same parameters across runs or set your own defaults so you don’t have to repeat them in every run configuration. You can do this by defining a profile, either globally in `~/.dstack/profiles.yml` or locally in `.dstack/profiles.yml`.
44

5-
This can be achieved by defining those parameters in a profile.
5+
A profile can be set as `default` to apply automatically to any run, or specified with `--profile NAME` in `dstack apply`.
66

7-
Profiles can be defined on the repository level (via the `.dstack/profiles.yml` file in the root directory of the
8-
repository) or on the global level (via the `~/.dstack/profiles.yml` file).
9-
10-
Any profile can be marked as default so that it will be applied automatically for any run. Otherwise, you can refer to a specific profile
11-
via `--profile NAME` in `dstack apply`.
12-
13-
### Example
7+
Example:
148

159
<div editor-title=".dstack/profiles.yml">
1610

1711
```yaml
1812
profiles:
1913
- name: my-profile
14+
# If set to true, this profile will be applied automatically
15+
default: true
2016

2117
# The spot pololicy can be "spot", "on-demand", or "auto"
2218
spot_policy: auto
23-
2419
# Limit the maximum price of the instance per hour
2520
max_price: 1.5
26-
2721
# Stop any run if it runs longer that this duration
2822
max_duration: 1d
29-
3023
# Use only these backends
3124
backends: [azure, lambda]
32-
33-
# If set to true, this profile will be applied automatically
34-
default: true
3525
```
3626
3727
</div>
3828
39-
The profile configuration supports many properties. See below.
29+
The profile configuration supports most properties that a run configuration supports — see below.
4030
4131
### Root reference
4232
@@ -51,3 +41,9 @@ The profile configuration supports many properties. See below.
5141
#SCHEMA# dstack._internal.core.models.profiles.ProfileRetry
5242
overrides:
5343
show_root_heading: false
44+
45+
### `utilization_policy`
46+
47+
#SCHEMA# dstack._internal.core.models.profiles.UtilizationPolicy
48+
overrides:
49+
show_root_heading: false

src/dstack/_internal/core/backends/tensordock/__init__.py

Whitespace-only changes.

src/dstack/_internal/core/backends/tensordock/api_client.py

Lines changed: 0 additions & 104 deletions
This file was deleted.

0 commit comments

Comments
 (0)