Skip to content

Commit 733a17c

Browse files
Add Crusoe Cloud backend (#3602)
* Add Crusoe Cloud backend Add a VM-based Crusoe Cloud backend supporting single-node and multi-node (cluster) provisioning with InfiniBand. Key features: - gpuhunt online provider for offers with project quota filtering - HMAC-SHA256 authenticated REST API client - Image selection based on GPU type (SXM/PCIe/ROCm/CPU) - Storage: persistent data disk for types without ephemeral NVMe; auto-detects and RAID-0s NVMe for types with ephemeral storage; moves containerd storage so containers get the full disk space - Cluster support via IB partitions - Two-phase termination with data disk cleanup Tested end-to-end: - L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination - A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination - A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock) - Offers: quota enforcement, disk sizes correct per instance type Not tested (no capacity/quota): - H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available) - CPU-only instances c1a/s1a (no quota) - Spot provisioning (disabled in gpuhunt, see TODO) - Full 2-node cluster with IB connectivity test TODOs: - Spot: disabled until Crusoe confirms how to request spot billing via the VM create API endpoint - gpuhunt dependency: currently installed from PR branch; switch to pinned version after gpuhunt PR #211 is merged and released AI Assistance: This implementation was developed with AI assistance. Co-authored-by: Cursor <cursoragent@cursor.com> * Fetch Crusoe locations dynamically instead of hardcoding Co-authored-by: Cursor <cursoragent@cursor.com> * Fix VM image selection for SXM instance types The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but gpuhunt normalizes GPU names and strips the SXM qualifier. Check the instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which preserves the '-sxm' indicator. Without this fix, SXM-IB instances used the PCIe docker image which lacks IB drivers, HPC-X, and NCCL topology files. Verified with a 2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth. Made-with: Cursor * Switch gpuhunt dependency from PR branch to main Made-with: Cursor * Add TODOs to pin gpuhunt and remove allow-direct-references before merging Made-with: Cursor * Pin gpuhunt==0.1.17 (matches master) Made-with: Cursor --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 03f6838 commit 733a17c

File tree

14 files changed

+887
-1
lines changed

14 files changed

+887
-1
lines changed

docs/docs/concepts/backends.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -929,6 +929,34 @@ projects:
929929
* `sizes` - read
930930
* `ssh_key` - create, read, update,delete
931931

932+
### Crusoe Cloud
933+
934+
Log into your [Crusoe Cloud](https://console.crusoecloud.com/) console and create an API key
935+
under your account settings. Note your project ID from the project settings page.
936+
937+
Then, go ahead and configure the backend:
938+
939+
<div editor-title="~/.dstack/server/config.yml">
940+
941+
```yaml
942+
projects:
943+
- name: main
944+
backends:
945+
- type: crusoe
946+
project_id: your-project-id
947+
creds:
948+
type: access_key
949+
access_key: your-access-key
950+
secret_key: your-secret-key
951+
regions:
952+
- us-east1-a
953+
- us-southcentral1-a
954+
```
955+
956+
</div>
957+
958+
`regions` is optional. If not specified, all available Crusoe regions are used.
959+
932960
### Hot Aisle
933961

934962
Log in to the SSH TUI as described in the [Hot Aisle Quick Start](https://hotaisle.xyz/quick-start/).

docs/docs/reference/server/config.yml.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,23 @@ to configure [backends](../../concepts/backends.md) and other [server-level sett
335335
type:
336336
required: true
337337

338+
##### `projects[n].backends[type=crusoe]` { #crusoe data-toc-label="crusoe" }
339+
340+
#SCHEMA# dstack._internal.core.backends.crusoe.models.CrusoeBackendConfigWithCreds
341+
overrides:
342+
show_root_heading: false
343+
type:
344+
required: true
345+
item_id_prefix: crusoe-
346+
347+
###### `projects[n].backends[type=crusoe].creds` { #crusoe-creds data-toc-label="creds" }
348+
349+
#SCHEMA# dstack._internal.core.backends.crusoe.models.CrusoeAccessKeyCreds
350+
overrides:
351+
show_root_heading: false
352+
type:
353+
required: true
354+
338355
##### `projects[n].backends[type=hotaisle]` { #hotaisle data-toc-label="hotaisle" }
339356

340357
#SCHEMA# dstack._internal.core.backends.hotaisle.models.HotAisleBackendConfigWithCreds

frontend/src/types/backend.d.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
declare type TBackendType =
22
| 'aws'
33
| 'azure'
4+
| 'crusoe'
45
| 'cudo'
56
| 'datacrunch'
67
| 'dstack'

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -256,6 +256,9 @@ fluentbit = [
256256
"elasticsearch>=8.0.0",
257257
"dstack[server]",
258258
]
259+
crusoe = [
260+
"dstack[server]",
261+
]
259262
all = [
260-
"dstack[gateway,server,aws,azure,gcp,verda,kubernetes,lambda,nebius,oci,fluentbit]",
263+
"dstack[gateway,server,aws,azure,gcp,verda,kubernetes,lambda,nebius,oci,crusoe,fluentbit]",
261264
]

src/dstack/_internal/core/backends/configurators.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,15 @@
3535
except ImportError:
3636
pass
3737

38+
try:
39+
from dstack._internal.core.backends.crusoe.configurator import (
40+
CrusoeConfigurator,
41+
)
42+
43+
_CONFIGURATOR_CLASSES.append(CrusoeConfigurator)
44+
except ImportError:
45+
pass
46+
3847
try:
3948
from dstack._internal.core.backends.cudo.configurator import (
4049
CudoConfigurator,
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from dstack._internal.core.backends.base.backend import Backend
2+
from dstack._internal.core.backends.crusoe.compute import CrusoeCompute
3+
from dstack._internal.core.backends.crusoe.models import CrusoeConfig
4+
from dstack._internal.core.models.backends.base import BackendType
5+
6+
7+
class CrusoeBackend(Backend):
8+
TYPE = BackendType.CRUSOE
9+
COMPUTE_CLASS = CrusoeCompute
10+
11+
def __init__(self, config: CrusoeConfig):
12+
self.config = config
13+
self._compute = CrusoeCompute(self.config)
14+
15+
def compute(self) -> CrusoeCompute:
16+
return self._compute

0 commit comments

Comments
 (0)