Skip to content

Rename dgx cloud executor to runai executor#253

Closed
hemildesai wants to merge 3 commits into
mainfrom
hemil/rename-dgxc-exec
Closed

Rename dgx cloud executor to runai executor#253
hemildesai wants to merge 3 commits into
mainfrom
hemil/rename-dgxc-exec

Conversation

@hemildesai

Copy link
Copy Markdown
Contributor

Closes #241

@roclark roclark left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, @hemildesai! Looks good, just a branding update to NVIDIA Run:ai in the applicable places.

Comment thread docs/source/guides/execution.md Outdated
def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str):
# Ensure these are set correctly for your DGX Cloud environment
def your_runai_executor(nodes: int, gpus_per_node: int, container_image: str):
# Ensure these are set correctly for your RunAI environment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should technically be NVIDIA Run:ai. I'll add a comment to each of these I find.

Comment thread docs/source/guides/execution.md Outdated
```

For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).
For a complete end-to-end example using RunAI with NeMo, refer to the [NVIDIA RunAI NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be NVIDIA Run:ai for both

Comment thread docs/source/guides/execution.md Outdated
The `DGXCloudExecutor` integrates with a DGX Cloud cluster's Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification.

> **_WARNING:_** Currently, the `DGXCloudExecutor` is only supported when launching experiments *from* a pod running on the DGX Cloud cluster itself. Furthermore, this launching pod must have access to a Persistent Volume Claim (PVC) where the experiment/job directories will be created, and this same PVC must also be configured to be mounted by the job being launched.
The `RunAIExecutor` integrates with the Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be NVIDIA Run:ai

Comment thread nemo_run/core/execution/runai.py Outdated
class RunAIExecutor(Executor):
"""
Dataclass to configure a DGX Executor.
Dataclass to configure a RunAI Executor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

Comment thread nemo_run/core/execution/runai.py Outdated
Dataclass to configure a RunAI Executor.

This executor integrates with a DGX cloud endpoint for launching jobs
This executor integrates with a RunAI cloud endpoint for launching jobs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

DGXCloudState.COMPLETED: AppState.SUCCEEDED,
DGXCloudState.TERMINATING: AppState.RUNNING,
DGXCloudState.UNKNOWN: AppState.FAILED,
# Local placeholder for storing RunAI job states

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

class RunAIRequest:
"""
Wrapper around the torchx AppDef and the DGX executor.
Wrapper around the torchx AppDef and the RunAI executor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

RunAIRequest(app=app, executor=executor, cmd=cmd, name=role.name),
# Minimal function to show the config, if any
lambda req: f"DGX job for app: {req.app.name}, cmd: {' '.join(cmd)}, executor: {executor}",
lambda req: f"RunAI job for app: {req.app.name}, cmd: {' '.join(cmd)}, executor: {executor}",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

def schedule(self, dryrun_info: AppDryRunInfo[RunAIRequest]) -> str:
"""
Launches a job on DGX using the DGXExecutor. Returns an app_id
Launches a job on RunAI using the RunAIExecutor. Returns an app_id

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

job_id, status = executor.launch(name=req.name, cmd=req.cmd)
if not job_id:
raise RuntimeError("Failed scheduling run on DGX: no job_id returned")
raise RuntimeError("Failed scheduling run on RunAI: no job_id returned")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA Run:ai

@hemildesai

Copy link
Copy Markdown
Contributor Author

@roclark Updated everything in cbf2c53

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
@hemildesai hemildesai force-pushed the hemil/rename-dgxc-exec branch from cbf2c53 to 06b4bbf Compare June 4, 2025 18:35
@github-actions

Copy link
Copy Markdown
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale label Jul 24, 2025
@github-actions

Copy link
Copy Markdown
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions Bot closed this Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rename DGXCloudExecutor to RunAIExecutor

2 participants