Skip to content

Commit 4c3fce5

Browse files
Fix VM image selection for SXM instance types
The _get_image function checked gpu_type (e.g. 'A100') for 'SXM', but gpuhunt normalizes GPU names and strips the SXM qualifier. Check the instance type name instead (e.g. 'a100-80gb-sxm-ib.8x') which preserves the '-sxm' indicator. Without this fix, SXM-IB instances used the PCIe docker image which lacks IB drivers, HPC-X, and NCCL topology files. Verified with a 2-node A100-SXM-IB NCCL all_reduce test: 193 GB/s bus bandwidth. Made-with: Cursor
1 parent 8cd5ec1 commit 4c3fce5

1 file changed

Lines changed: 5 additions & 3 deletions

File tree

  • src/dstack/_internal/core/backends/crusoe

src/dstack/_internal/core/backends/crusoe/compute.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,12 @@
9494
IMAGE_BASE = "ubuntu22.04:latest"
9595

9696

97-
def _get_image(gpu_type: str) -> str:
97+
def _get_image(instance_name: str, gpu_type: str) -> str:
9898
if not gpu_type:
9999
return IMAGE_BASE
100-
if "SXM" in gpu_type:
100+
# Check instance name for SXM -- gpu_type from gpuhunt is normalized (e.g. "A100")
101+
# and doesn't contain "SXM", but instance names like "a100-80gb-sxm-ib.8x" do.
102+
if "-sxm" in instance_name.lower():
101103
return IMAGE_SXM_DOCKER
102104
if "MI3" in gpu_type:
103105
return IMAGE_ROCM
@@ -216,7 +218,7 @@ def create_instance(
216218
gpus = instance_offer.instance.resources.gpus
217219
gpu_type = gpus[0].name if gpus else ""
218220
instance_type_name = instance_offer.instance.name
219-
image = _get_image(gpu_type)
221+
image = _get_image(instance_type_name, gpu_type)
220222

221223
needs_data_disk = not _has_ephemeral_disk(instance_offer)
222224
# Always include storage setup: it auto-detects /dev/vdb (data disk) or

0 commit comments

Comments
 (0)