Skip to content

Add RayJob and Slurm support for Ray APIs + integration with run.Experiment#236

Merged
hemildesai merged 18 commits into
mainfrom
hemil/ray-slurm
May 23, 2025
Merged

Add RayJob and Slurm support for Ray APIs + integration with run.Experiment#236
hemildesai merged 18 commits into
mainfrom
hemil/ray-slurm

Conversation

@hemildesai

Copy link
Copy Markdown
Contributor

Usage with RayCluster API:

cluster = RayCluster(
        name="test",
        executor=your_slurm_executor,
        # pre_ray_start_commands=[
        #     "pip install uv && echo 'unset RAY_RUNTIME_ENV_HOOK' >> /home/ray/.bashrc"
        # ],
)
job_id = cluster.schedule_job(
        name="test_job",
        executor=executor,
        command=command,
        workdir="path-to-your-workdir",
)
cluster.start(...)
cluster.stop(...)
cluster.port_forward(...)

Usage with run.Experiment

task = run.Script(inline=command, metadata={"use_with_ray_cluster": True})

with run.Experiment("test_ray_experiment") as exp:
    exp.add(task, executor=your_slurm_executor, name="test_ray_task")
    exp.run(detach=False, tail_logs=True)

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Comment thread nemo_run/core/execution/slurm.py Fixed
Comment thread nemo_run/core/execution/utils.py Fixed
Comment thread nemo_run/run/ray/slurm.py Fixed
Comment thread nemo_run/run/ray/slurm.py Fixed
hemildesai and others added 3 commits May 19, 2025 12:40
…ple times

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemil.desai10@gmail.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
… autoescape=False

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemil.desai10@gmail.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
… with implicit (fall through) returns

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemil.desai10@gmail.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Comment thread nemo_run/run/ray/slurm.py Fixed
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Comment on lines +821 to +828
def start(
self,
command: str,
workdir: str | None = None,
runtime_env_yaml: str | None = None,
pre_ray_start_commands: Optional[list[str]] = None,
dryrun: bool = False,
):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.

Copilot Autofix

AI about 1 year ago

To fix the issue, we will add an explicit return None at the end of the start function. This ensures that the function's return behavior is clear and consistent, even if it is not intended to return a meaningful value. The change will be made at the end of the start function, after all existing logic.


Suggested changeset 1
nemo_run/run/ray/kuberay.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/nemo_run/run/ray/kuberay.py b/nemo_run/run/ray/kuberay.py
--- a/nemo_run/run/ray/kuberay.py
+++ b/nemo_run/run/ray/kuberay.py
@@ -894,2 +894,3 @@
         # *job* helpers, keeping cluster classes focused on cluster lifecycle
+        return None
         # only.
EOF
@@ -894,2 +894,3 @@
# *job* helpers, keeping cluster classes focused on cluster lifecycle
return None
# only.
Copilot is powered by AI and may make mistakes. Always verify output.
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>

@gwarmstrong gwarmstrong left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me now!

@hemildesai hemildesai changed the title Add SlurmRayCluster + integration with run.Experiment Add RayJob and Slurm support for Ray APIs + integration with run.Experiment May 23, 2025
@hemildesai hemildesai merged commit 252edfb into main May 23, 2025
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants