Skip to content

change the create dist job functionn to support creating a single nod…#240

Merged
hemildesai merged 3 commits into
NVIDIA-NeMo:mainfrom
delgadof:delgadof/fix-single-node-training-job
Jun 5, 2025
Merged

change the create dist job functionn to support creating a single nod…#240
hemildesai merged 3 commits into
NVIDIA-NeMo:mainfrom
delgadof:delgadof/fix-single-node-training-job

Conversation

@delgadof

@delgadof delgadof commented May 20, 2025

Copy link
Copy Markdown
Contributor

This pull request refactors the create_distributed_job method to a more general-purpose create_training_job method in the DGXCloudExecutor class, improving functionality and clarity. It also updates related test cases to align with the new method.
closes #239

Core functionality changes:

  • Renamed create_distributed_job to create_training_job in nemo_run/core/execution/dgxcloud.py, adding support for both single-node and multi-node training jobs on DGX Cloud. The method now validates inputs, determines the appropriate endpoint based on node count, and constructs payloads accordingly.
  • Updated the launch method to call the new create_training_job method instead of the old create_distributed_job.

Test updates:

  • Renamed and updated the test method test_create_distributed_job to test_create_training_job in test/core/execution/test_dgxcloud.py to reflect the new method name. [1] [2]
  • Updated test cases that previously mocked create_distributed_job to mock create_training_job instead, ensuring compatibility with the refactored method. [1] [2]

…e job and distribuited jobs

Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
@delgadof delgadof marked this pull request as ready for review May 20, 2025 21:05
@hemildesai hemildesai requested a review from roclark May 21, 2025 19:56
Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
@delgadof

Copy link
Copy Markdown
Contributor Author

Modified the formatting to meet the repository requirements

roclark
roclark previously approved these changes May 22, 2025

@roclark roclark left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! do we also want to add a test for launching a single-node job to ensure it goes down the single-node path in create_training_job? Thanks for putting this together!

Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
@delgadof

Copy link
Copy Markdown
Contributor Author

@roclark You are right. commit 7962d92 add these features.

@roclark

roclark commented May 22, 2025

Copy link
Copy Markdown
Contributor

Awesome work, thanks!!

@hemildesai hemildesai merged commit 2aa0a60 into NVIDIA-NeMo:main Jun 5, 2025
18 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nemo Fine-tunning fails in DGXC with Run:AI after finishes when creating a single node, 8 h100 job using DGXC Executor

3 participants