You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add RayCluster support for DGX Cloud Lepton (#389)
* Add Ray support for DGX Cloud Lepton
Add support for launching a RayCluster on DGX Cloud Lepton and submitting
RayJobs on the clusters using the lepton SDK. This uses the new RayCluster
feature on DGX Cloud Lepton to dynamically spawn clusters up and down via
the Python SDK and jobs can be submitted to deployed clusters directly.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* making name unique and add resource shape for lepton raycluster
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* adding head node reference in RayCluster, support defining secrets in RayCluster and linting
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
* Remove Slurm packager comments from RayCluster
Removed the placeholder Slurm packager handling comments from the Lepton
RayCluster code. For now, the "workdir" parameter should be used for
transferring local data to the remote Ray cluster.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Fix RayCluster head resource shape
Fix issue to ensure the proper head node resource shape is used if it
isn't explicitly given by the user.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Update LeptonRay comments
Updated the comments in the LeptonRayCluster and LeptonRayJob classes to
accurately reflect the code.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Fix RayJob logs streaming connection dropping
The RayJob logs stream would sometimes timeout and reset, causing a very
long output of logs in the terminal as it continually resets.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Make RayCluster head resource shape optional
The head node resource shape for a LeptonRayCluster should be optional. If
it isn't specified by the user, it should default to the same shape used
for the worker nodes.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Add doc for DGXC Lepton RayClusters
Added an example to the Ray quick-start guide on how to use RayClusters
and RayJobs with NeMo-Run on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Update license date
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Fix Ray guide typo
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Make cluster readiness timeout a variable
Allows users to specify how long to wait for a RayCluster to be created
on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Remove implicit returns
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Remove unused local variable
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Fix linting errors
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Fix formatting errors
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Move LeptonExecutor parameters to definition
Move the RayCluster-specific settings to the LeptonExecutor class for a
more seamless interface for launching and interacting with RayClusters
on DGX Cloud Lepton.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Updated leptonai package version
Need a newer version of the leptonai SDK to support RayClusters.
Signed-Off-By: Robert Clark <roclark@nvidia.com>
* Add Lepton RayCluster tests
Signed-Off-By: Robert Clark <roclark@nvidia.com>
---------
Signed-off-by: Robert Clark <roclark@nvidia.com>
Signed-off-by: Zoey Zhang <zozhang@nvidia.com>
Co-authored-by: Zoey Zhang <zozhang@nvidia.com>
|`run.ray.cluster.RayCluster`| Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). |`KubeRayExecutor`, `SlurmExecutor`|
28
+
|`run.ray.cluster.RayCluster`| Lifecycle of a Ray **cluster** (create ⇒ wait ⇢ status ⇢ port-forward ⇢ delete). |`KubeRayExecutor`, `SlurmExecutor`, `LeptonExecutor`|
29
29
|`run.ray.job.RayJob`| Lifecycle of a Ray **job** (submit ⇒ monitor ⇢ logs ⇢ cancel). | same |
30
30
31
-
The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s) or a **Slurm** job under the hood.
31
+
The two helpers share a uniform API; the chosen *Executor* decides whether we talk to the **KubeRay** operator (K8s), **DGX Cloud Lepton's RayCluster**, or a **Slurm** job under the hood.
32
32
33
33
```mermaid
34
34
classDiagram
35
35
RayCluster <|-- KubeRayCluster
36
36
RayCluster <|-- SlurmRayCluster
37
+
RayCluster <|-- LeptonRayCluster
37
38
RayJob <|-- KubeRayJob
38
39
RayJob <|-- SlurmRayJob
40
+
RayJob <|-- LeptonRayJob
39
41
```
40
42
41
43
## 2. KubeRay quick-start
@@ -183,7 +185,77 @@ cluster.stop()
183
185
*`executor.packager = run.GitArchivePackager()` if you prefer packaging a git tree instead of rsync.
184
186
*`cluster.port_forward()` opens an SSH tunnel from *your laptop* to the Ray dashboard running on the head node.
185
187
186
-
## 4. API reference cheat-sheet
188
+
## 4. DGX Cloud Lepton RayCluster quick-start
189
+
190
+
```python
191
+
import os
192
+
from pathlib import Path
193
+
194
+
import nemo_run as run
195
+
from nemo_run.core.execution.lepton import LeptonExecutor
196
+
from nemo_run.run.ray.cluster import RayCluster
197
+
from nemo_run.run.ray.job import RayJob
198
+
199
+
# 1) Create a LeptonExecutor and tweak defaults
200
+
mounts = [
201
+
{
202
+
"path": "/",
203
+
"mount_path": "/nemo-workspace",
204
+
"from": "node-nfs:lepton-shared-fs",
205
+
}
206
+
]
207
+
208
+
executor = LeptonExecutor(
209
+
resource_shape="gpu.8xh100",
210
+
container_image="rayproject/ray:2.49.2-gpu",
211
+
nemo_run_dir="/nemo-workspace/nemo-run",
212
+
head_resource_shape="cpu.large",
213
+
ray_version="2.49.2",
214
+
mounts=mounts,
215
+
node_group="my-node-group",
216
+
nodes=1,
217
+
nprocs_per_node=8,
218
+
env_vars={
219
+
"TORCH_HOME": "/nemo-workspace/.cache",
220
+
},
221
+
secret_vars=[
222
+
{"WANDB_API_KEY": "WANDB_API_KEY"},
223
+
{"HF_TOKEN": "HUGGING_FACE_HUB_TOKEN"},
224
+
],
225
+
launcher="torchrun",
226
+
image_pull_secrets=[],
227
+
pre_launch_commands=[],
228
+
)
229
+
230
+
# 2) Bring up the RayCluster on DGX Cloud Lepton and show the status
231
+
cluster = RayCluster(
232
+
name="lepton-ray-cluster",
233
+
executor=executor,
234
+
)
235
+
cluster.start(timeout=1800)
236
+
cluster.status(display=True)
237
+
238
+
# 3) Submit a RayJob that runs inside the created RayCluster
239
+
job = RayJob(
240
+
name="demo-lepton-ray-job",
241
+
executor=executor,
242
+
cluster_name="lepton-ray-cluster",
243
+
)
244
+
job.start(
245
+
command="uv run python train.py --config cfgs/train.yaml cluster.num_nodes=2",
246
+
workdir="/path/to/project/", # rsync'ed from local to the RayCluster
247
+
)
248
+
job.status(display=True) # Display the RayJob status
249
+
job.logs(follow=True) # Tail the job logs as it runs
250
+
251
+
# 4) Tear down the RayCluster and free up resources
252
+
cluster.stop()
253
+
```
254
+
255
+
### Tips for DGX Cloud Lepton users
256
+
* This assumes the [DGX Cloud Lepton CLI](https://docs.nvidia.com/dgx-cloud/lepton/reference/cli/get-started/) is installed and has been authenticated.
257
+
258
+
## 5. API reference cheat-sheet
187
259
188
260
```python
189
261
cluster = RayCluster(name, executor)
@@ -201,7 +273,7 @@ job.stop()
201
273
202
274
All methods are synchronous and **return immediately** when their work is done; the helpers hide the messy details (kubectl, squeue, ssh, …).
203
275
204
-
## 5. Rolling your own CLI
276
+
## 6. Rolling your own CLI
205
277
206
278
Because `RayCluster` and `RayJob` are plain Python, you can compose them inside **argparse**, **Typer**, **Click** – anything. Here is a minimal **argparse** script:
0 commit comments