You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An InfiniBand cluster is now created automatically
when provisioning 8xH100 or 8xH200 instances using
a fleet configuration with `placement: cluster`.
Nebius clusters were supported using `dstack`
placement groups. Several changes were made to the
placement group management logic:
- The offer for the master instance of the fleet
is passed to `Compute.create_placement_group`,
which allows to set different placement group
settings based on the offer. Nebius requires
different settings for H100 and H200 clusters.
- `Compute.is_suitable_placement_group` is
introduced to allow choosing an appropriate
placement group when creating the master
instance and filtering offers for non-master
instances based on backend-specific placement
group properties. Nebius currently only provides
homogeneous clusters, so offers need to be
filtered based on the placement group.
- The placement group object is passed to
`Compute.create_instance` to allow adding the
instance to the placement group using its
backend-specific properties, such as cluster ID
on Nebius.
- The placement group name is generated at master
instance provisioning time, not at fleet
creation time. This allows to have different
placement group names for the same fleet and
avoid name conflicts, since multiple placement
groups can be created while `dstack` is trying
different offers for the master instance.
- Placement groups that were created during master
instance provisioning but didn't end up being
used are now cleaned up. Nebius quotas limit the
number of clusters, so unused clusters need to
be cleaned up quickly, without waiting for fleet
deletion.
- If all offers failed for the master instance,
`dstack` will no longer attempt to provision
other fleet instances to avoid them being
provisioned without a placement group or without
connectivity at all.
- Placement group creation errors are now handled
gracefully, so that `dstack` can move on to
other master instance offers, which may lead to
creating different placement groups. For
example, if `dstack` cannot create a cluster in
one Nebius region because of a missing quota, it
may attempt to create a cluster in another
region.
0 commit comments