Commit 733a17c
Add Crusoe Cloud backend (#3602)
* Add Crusoe Cloud backend
Add a VM-based Crusoe Cloud backend supporting single-node and
multi-node (cluster) provisioning with InfiniBand.
Key features:
- gpuhunt online provider for offers with project quota filtering
- HMAC-SHA256 authenticated REST API client
- Image selection based on GPU type (SXM/PCIe/ROCm/CPU)
- Storage: persistent data disk for types without ephemeral NVMe;
auto-detects and RAID-0s NVMe for types with ephemeral storage;
moves containerd storage so containers get the full disk space
- Cluster support via IB partitions
- Two-phase termination with data disk cleanup
Tested end-to-end:
- L40S: fleet, dev env, GPU, configurable disk (200GB), clean termination
- A100-PCIe: fleet, dev env, GPU, NVMe auto-mount (880GB), clean termination
- A100-SXM-IB cluster: IB partition created, 1 node provisioned with IB
and 8x NVMe RAID-0 (7TB); 2nd node failed on capacity (out_of_stock)
- Offers: quota enforcement, disk sizes correct per instance type
Not tested (no capacity/quota):
- H100-SXM-IB, MI300X-IB, MI355X-RoCE (no hardware available)
- CPU-only instances c1a/s1a (no quota)
- Spot provisioning (disabled in gpuhunt, see TODO)
- Full 2-node cluster with IB connectivity test
TODOs:
- Spot: disabled until Crusoe confirms how to request spot billing
via the VM create API endpoint
- gpuhunt dependency: currently installed from PR branch; switch to
pinned version after gpuhunt PR #211 is merged and released
AI Assistance: This implementation was developed with AI assistance.
Co-authored-by: Cursor <cursoragent@cursor.com>
* Fetch Crusoe locations dynamically instead of hardcoding
Co-authored-by: Cursor <cursoragent@cursor.com>
* Fix VM image selection for SXM instance types
The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but
gpuhunt normalizes GPU names and strips the SXM qualifier. Check the
instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which
preserves the '-sxm' indicator.
Without this fix, SXM-IB instances used the PCIe docker image which
lacks IB drivers, HPC-X, and NCCL topology files. Verified with a
2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth.
Made-with: Cursor
* Switch gpuhunt dependency from PR branch to main
Made-with: Cursor
* Add TODOs to pin gpuhunt and remove allow-direct-references before merging
Made-with: Cursor
* Pin gpuhunt==0.1.17 (matches master)
Made-with: Cursor
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent 03f6838 commit 733a17c
File tree
14 files changed
+887
-1
lines changed- docs/docs
- concepts
- reference/server
- frontend/src/types
- src
- dstack/_internal/core
- backends
- crusoe
- models/backends
- tests/_internal/server
- routers
- services
14 files changed
+887
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
929 | 929 | | |
930 | 930 | | |
931 | 931 | | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
932 | 960 | | |
933 | 961 | | |
934 | 962 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
335 | 335 | | |
336 | 336 | | |
337 | 337 | | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
338 | 355 | | |
339 | 356 | | |
340 | 357 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
256 | 256 | | |
257 | 257 | | |
258 | 258 | | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
259 | 262 | | |
260 | | - | |
| 263 | + | |
261 | 264 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
38 | 47 | | |
39 | 48 | | |
40 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
0 commit comments