Skip to content

Commit 8a72c8c

Browse files
authored
Kubernetes: add multi-node support (#3141)
* Discover and set instance's internal_ip (PodIP) * Fix region mismatch * Add `privileged: true` support * [runner] Set RLIMIT_MEMLOCK to unlimited. Fixes issues with InfiniBand/RDMA Part-of: #3126
1 parent f7ef485 commit 8a72c8c

File tree

22 files changed

+129
-18
lines changed

22 files changed

+129
-18
lines changed

contributing/BACKENDS.md

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -27,15 +27,15 @@ git clone https://github.com/dstackai/gpuhunt.git
2727

2828
- **Offline providers** offer static machine configurations that are not frequently updated.
2929
`gpuhunt` collects offline providers' instance offers on an hourly basis.
30-
Examples: `aws`, `gcp`, `azure`, etc.
30+
Examples: `aws`, `gcp`, `azure`, etc.
3131
- **Online providers** offer dynamic machine configurations that are available at the very moment
3232
when you fetch configurations (e.g., GPU marketplaces).
3333
`gpuhunt` collects online providers' instance offers each time a `dstack` user provisions a new instance.
3434
Examples: `tensordock`, `vastai`, etc.
3535

3636
### 1.3. Create the provider class
3737

38-
Create the provider class file under `src/gpuhunt/providers`.
38+
Create the provider class file under `src/gpuhunt/providers`.
3939

4040
Make sure your class extends the [`AbstractProvider`](https://github.com/dstackai/gpuhunt/blob/main/src/gpuhunt/providers/__init__.py)
4141
base class. See its docstrings for descriptions of the methods that your class should implement.
@@ -69,13 +69,13 @@ Refer to examples: [test_datacrunch.py](https://github.com/dstackai/gpuhunt/blob
6969

7070
### 1.6. Submit a pull request
7171

72-
Once the cloud provider is added, submit a pull request.
72+
Once the cloud provider is added, submit a pull request.
7373

7474
> Anything unclear? Ask questions on the [Discord server](https://discord.gg/u8SmfwPpMd).
7575
7676
## 2. Integrate the cloud provider to dstackai/dstack
7777

78-
Once the provider is added to `gpuhunt`, we can proceed with implementing
78+
Once the provider is added to `gpuhunt`, we can proceed with implementing
7979
the corresponding backend with `dstack`. Follow the steps below.
8080

8181
### 2.1. Determine if you will implement a VM-based or a container-based backend
@@ -124,10 +124,10 @@ Then add these models to `AnyBackendConfig*` unions in [`src/dstack/_internal/co
124124

125125
The script also generates `*BackendStoredConfig` that extends `*BackendConfig` to be able to store extra parameters in the DB. By the same logic, it generates `*Config` that extends `*BackendStoredConfig` with creds and uses it as the main `Backend` and `Compute` config instead of using `*BackendConfigWithCreds` directly.
126126

127-
Refer to examples:
128-
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/models.py),
129-
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/models.py),
130-
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/models.py),
127+
Refer to examples:
128+
[datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/models.py),
129+
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/models.py),
130+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/models.py),
131131
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/models.py), etc.
132132

133133
### 2.7. Implement the backend compute class
@@ -147,8 +147,8 @@ Go to `configurator.py` and implement custom `Configurator` logic. At minimum, y
147147
You may also need to validate other config parameters if there are any.
148148

149149
Refer to examples: [datacrunch](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/datacrunch/configurator.py),
150-
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/configurator.py),
151-
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/configurator.py),
150+
[aws](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/aws/configurator.py),
151+
[gcp](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/gcp/configurator.py),
152152
[azure](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/azure/configurator.py), etc.
153153

154154
Register configurator by appending it to `_CONFIGURATOR_CLASSES` in [`src/dstack/_internal/core/backends/configurators.py`](https://github.com/dstackai/dstack/blob/master/src/dstack/_internal/core/backends/configurators.py).
@@ -181,6 +181,9 @@ The agent controls the VM and starts Docker containers for users' jobs.
181181
Since `dstack` controls the entire VM, VM-based backends can support more features,
182182
such as blocks, instance volumes, privileged containers, and reusable instances.
183183

184+
Note, all VM-based backend `Compute`s should sublass the `ComputeWithPrivilegedSupport` mixin,
185+
as the `dstack-shim` agent provides this functionality OOTB.
186+
184187
To support a VM-based backend, `dstack` expects the following:
185188

186189
- An API for creating and terminating VMs

runner/internal/executor/executor.go

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ import (
2929
"github.com/dstackai/dstack/runner/internal/schemas"
3030
"github.com/dstackai/dstack/runner/internal/types"
3131
"github.com/prometheus/procfs"
32+
"golang.org/x/sys/unix"
3233
)
3334

3435
// TODO: Tune these parameters for optimal experience/performance
@@ -518,6 +519,21 @@ func (ex *RunExecutor) execJob(ctx context.Context, jobLogFile io.Writer) error
518519

519520
cmd.Env = envMap.Render()
520521

522+
// Configure process resource limits
523+
// TODO: Make rlimits customizable in the run configuration. Currently, we only set max locked memory
524+
// to unlimited to fix the issue with InfiniBand/RDMA: "Cannot allocate memory".
525+
// See: https://github.com/ofiwg/libfabric/issues/6437
526+
// See: https://github.com/openucx/ucx/issues/8229
527+
// Note: we already set RLIMIT_MEMLOCK to unlimited in the shim if we've detected IB devices
528+
// (see configureHpcNetworkingIfAvailable() function), but, as it's on the shim side, it only works
529+
// with VM-based backends.
530+
rlimitMemlock := unix.Rlimit{Cur: unix.RLIM_INFINITY, Max: unix.RLIM_INFINITY}
531+
// TODO: Check if we have CAP_SYS_RESOURCE. In container environments, even root usually doesn't have
532+
// this capability.
533+
if err := unix.Setrlimit(unix.RLIMIT_MEMLOCK, &rlimitMemlock); err != nil {
534+
log.Error(ctx, "Failed to set resource limits", "err", err)
535+
}
536+
521537
log.Trace(ctx, "Starting exec", "cmd", cmd.String(), "working_dir", cmd.Dir, "env", cmd.Env)
522538

523539
ptm, err := startCommand(cmd)

src/dstack/_internal/core/backends/aws/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
ComputeWithMultinodeSupport,
2525
ComputeWithPlacementGroupSupport,
2626
ComputeWithPrivateGatewaySupport,
27+
ComputeWithPrivilegedSupport,
2728
ComputeWithReservationSupport,
2829
ComputeWithVolumeSupport,
2930
generate_unique_gateway_instance_name,
@@ -90,6 +91,7 @@ def _ec2client_cache_methodkey(self, ec2_client, *args, **kwargs):
9091
class AWSCompute(
9192
ComputeWithAllOffersCached,
9293
ComputeWithCreateInstanceSupport,
94+
ComputeWithPrivilegedSupport,
9395
ComputeWithMultinodeSupport,
9496
ComputeWithReservationSupport,
9597
ComputeWithPlacementGroupSupport,

src/dstack/_internal/core/backends/azure/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@
4343
ComputeWithCreateInstanceSupport,
4444
ComputeWithGatewaySupport,
4545
ComputeWithMultinodeSupport,
46+
ComputeWithPrivilegedSupport,
4647
generate_unique_gateway_instance_name,
4748
generate_unique_instance_name,
4849
get_gateway_user_data,
@@ -78,6 +79,7 @@
7879
class AzureCompute(
7980
ComputeWithAllOffersCached,
8081
ComputeWithCreateInstanceSupport,
82+
ComputeWithPrivilegedSupport,
8183
ComputeWithMultinodeSupport,
8284
ComputeWithGatewaySupport,
8385
Compute,

src/dstack/_internal/core/backends/base/compute.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,15 @@ def _restrict_instance_offer_az_to_volumes_az(
320320
]
321321

322322

323+
class ComputeWithPrivilegedSupport:
324+
"""
325+
Must be subclassed to support runs with `privileged: true`.
326+
All VM-based Computes (that is, Computes that use the shim) should subclass this mixin.
327+
"""
328+
329+
pass
330+
331+
323332
class ComputeWithMultinodeSupport:
324333
"""
325334
Must be subclassed to support multinode tasks and cluster fleets.

src/dstack/_internal/core/backends/cloudrift/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
Compute,
55
ComputeWithAllOffersCached,
66
ComputeWithCreateInstanceSupport,
7+
ComputeWithPrivilegedSupport,
78
get_shim_commands,
89
)
910
from dstack._internal.core.backends.base.offers import get_catalog_offers
@@ -27,6 +28,7 @@
2728
class CloudRiftCompute(
2829
ComputeWithAllOffersCached,
2930
ComputeWithCreateInstanceSupport,
31+
ComputeWithPrivilegedSupport,
3032
Compute,
3133
):
3234
def __init__(self, config: CloudRiftConfig):

src/dstack/_internal/core/backends/cudo/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from dstack._internal.core.backends.base.compute import (
77
ComputeWithCreateInstanceSupport,
88
ComputeWithFilteredOffersCached,
9+
ComputeWithPrivilegedSupport,
910
generate_unique_instance_name,
1011
get_shim_commands,
1112
)
@@ -32,6 +33,7 @@
3233
class CudoCompute(
3334
ComputeWithFilteredOffersCached,
3435
ComputeWithCreateInstanceSupport,
36+
ComputeWithPrivilegedSupport,
3537
Compute,
3638
):
3739
def __init__(self, config: CudoConfig):

src/dstack/_internal/core/backends/datacrunch/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
from dstack._internal.core.backends.base.compute import (
99
ComputeWithAllOffersCached,
1010
ComputeWithCreateInstanceSupport,
11+
ComputeWithPrivilegedSupport,
1112
generate_unique_instance_name,
1213
get_shim_commands,
1314
)
@@ -39,6 +40,7 @@
3940
class DataCrunchCompute(
4041
ComputeWithAllOffersCached,
4142
ComputeWithCreateInstanceSupport,
43+
ComputeWithPrivilegedSupport,
4244
Compute,
4345
):
4446
def __init__(self, config: DataCrunchConfig):

src/dstack/_internal/core/backends/digitalocean_base/compute.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from dstack._internal.core.backends.base.compute import (
88
ComputeWithAllOffersCached,
99
ComputeWithCreateInstanceSupport,
10+
ComputeWithPrivilegedSupport,
1011
generate_unique_instance_name,
1112
get_user_data,
1213
)
@@ -40,6 +41,7 @@
4041
class BaseDigitalOceanCompute(
4142
ComputeWithAllOffersCached,
4243
ComputeWithCreateInstanceSupport,
44+
ComputeWithPrivilegedSupport,
4345
Compute,
4446
):
4547
def __init__(self, config: BaseDigitalOceanConfig, api_url: str, type: BackendType):

src/dstack/_internal/core/backends/features.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
ComputeWithMultinodeSupport,
55
ComputeWithPlacementGroupSupport,
66
ComputeWithPrivateGatewaySupport,
7+
ComputeWithPrivilegedSupport,
78
ComputeWithReservationSupport,
89
ComputeWithVolumeSupport,
910
)
@@ -38,6 +39,10 @@ def _get_backends_with_compute_feature(
3839
configurator_classes=_configurator_classes,
3940
compute_feature_class=ComputeWithCreateInstanceSupport,
4041
)
42+
BACKENDS_WITH_PRIVILEGED_SUPPORT = _get_backends_with_compute_feature(
43+
configurator_classes=_configurator_classes,
44+
compute_feature_class=ComputeWithPrivilegedSupport,
45+
)
4146
BACKENDS_WITH_MULTINODE_SUPPORT = [BackendType.REMOTE] + _get_backends_with_compute_feature(
4247
configurator_classes=_configurator_classes,
4348
compute_feature_class=ComputeWithMultinodeSupport,

0 commit comments

Comments
 (0)