Rename dgx cloud executor to runai executor#253
Closed
hemildesai wants to merge 3 commits into
Closed
Conversation
roclark
requested changes
Jun 4, 2025
roclark
left a comment
Contributor
There was a problem hiding this comment.
Thanks for doing this, @hemildesai! Looks good, just a branding update to NVIDIA Run:ai in the applicable places.
| def your_dgx_executor(nodes: int, gpus_per_node: int, container_image: str): | ||
| # Ensure these are set correctly for your DGX Cloud environment | ||
| def your_runai_executor(nodes: int, gpus_per_node: int, container_image: str): | ||
| # Ensure these are set correctly for your RunAI environment |
Contributor
There was a problem hiding this comment.
This should technically be NVIDIA Run:ai. I'll add a comment to each of these I find.
| ``` | ||
|
|
||
| For a complete end-to-end example using DGX Cloud with NeMo, refer to the [NVIDIA DGX Cloud NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html). | ||
| For a complete end-to-end example using RunAI with NeMo, refer to the [NVIDIA RunAI NeMo End-to-End Workflow Example](https://docs.nvidia.com/dgx-cloud/run-ai/latest/nemo-e2e-example.html). |
Contributor
There was a problem hiding this comment.
Should be NVIDIA Run:ai for both
| The `DGXCloudExecutor` integrates with a DGX Cloud cluster's Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification. | ||
|
|
||
| > **_WARNING:_** Currently, the `DGXCloudExecutor` is only supported when launching experiments *from* a pod running on the DGX Cloud cluster itself. Furthermore, this launching pod must have access to a Persistent Volume Claim (PVC) where the experiment/job directories will be created, and this same PVC must also be configured to be mounted by the job being launched. | ||
| The `RunAIExecutor` integrates with the Run:ai API to launch distributed jobs. It uses REST API calls to authenticate, identify the target project and cluster, and submit the job specification. |
Contributor
There was a problem hiding this comment.
This should be NVIDIA Run:ai
| class RunAIExecutor(Executor): | ||
| """ | ||
| Dataclass to configure a DGX Executor. | ||
| Dataclass to configure a RunAI Executor. |
| Dataclass to configure a RunAI Executor. | ||
|
|
||
| This executor integrates with a DGX cloud endpoint for launching jobs | ||
| This executor integrates with a RunAI cloud endpoint for launching jobs |
| DGXCloudState.COMPLETED: AppState.SUCCEEDED, | ||
| DGXCloudState.TERMINATING: AppState.RUNNING, | ||
| DGXCloudState.UNKNOWN: AppState.FAILED, | ||
| # Local placeholder for storing RunAI job states |
| class RunAIRequest: | ||
| """ | ||
| Wrapper around the torchx AppDef and the DGX executor. | ||
| Wrapper around the torchx AppDef and the RunAI executor. |
| RunAIRequest(app=app, executor=executor, cmd=cmd, name=role.name), | ||
| # Minimal function to show the config, if any | ||
| lambda req: f"DGX job for app: {req.app.name}, cmd: {' '.join(cmd)}, executor: {executor}", | ||
| lambda req: f"RunAI job for app: {req.app.name}, cmd: {' '.join(cmd)}, executor: {executor}", |
| def schedule(self, dryrun_info: AppDryRunInfo[RunAIRequest]) -> str: | ||
| """ | ||
| Launches a job on DGX using the DGXExecutor. Returns an app_id | ||
| Launches a job on RunAI using the RunAIExecutor. Returns an app_id |
| job_id, status = executor.launch(name=req.name, cmd=req.cmd) | ||
| if not job_id: | ||
| raise RuntimeError("Failed scheduling run on DGX: no job_id returned") | ||
| raise RuntimeError("Failed scheduling run on RunAI: no job_id returned") |
Contributor
Author
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
cbf2c53 to
06b4bbf
Compare
Contributor
|
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days. |
Contributor
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #241