Skip to content

Update DDP example#1364

Merged
soumith merged 4 commits into
pytorch:mainfrom
jafraustro:jafraust/ddp
Jul 14, 2025
Merged

Update DDP example#1364
soumith merged 4 commits into
pytorch:mainfrom
jafraustro:jafraust/ddp

Conversation

@jafraustro

@jafraustro jafraustro commented Jul 8, 2025

Copy link
Copy Markdown
Contributor

Update DDP to use the accelerator API and switch to torchrun for distributed launches

CC: @dvrogozh , @msaroufim

@jafraustro jafraustro marked this pull request as ready for review July 8, 2025 15:06
@netlify

netlify Bot commented Jul 8, 2025

Copy link
Copy Markdown

Deploy Preview for pytorch-examples-preview canceled.

Name Link
🔨 Latest commit afdd3ce
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-examples-preview/deploys/68712b124833f100080d2c69

@dvrogozh dvrogozh left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jafraustro : CC reviewers in PR description.

@soumith

soumith commented Jul 10, 2025

Copy link
Copy Markdown
Contributor

the CI is failing for Distributed examples because something cant find numpy

@jafraustro

Copy link
Copy Markdown
Contributor Author

the CI is failing for Distributed examples because something cant find numpy

Hi, I changed the torch version in requirements.txt file.

× No solution found when resolving dependencies:
╰─▶ Because only torch<=2.7.1 is available and you require torch>=2.8

- Replace deprecated launch utility with torchrun (see PyTorch docs: https://pytorch.org/docs/stable/distributed.html#launch-utility)
- Update README to reflect torchrun usage
- Remove main.py (no longer referenced in documentation)
- Update CI to test example.py script instead

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@soumith

soumith commented Jul 11, 2025

Copy link
Copy Markdown
Contributor

it's failing now with some new errors

Signed-off-by: jafraustro <jaime.fraustro.valdez@intel.com>
@jafraustro

Copy link
Copy Markdown
Contributor Author

it's failing now with some new errors

Hello @soumith,

The errors occurred because there were not enough GPUs available. To address this, I added a minimum GPU verification step, similar to the approach used in the tensor_parallel_example.py example. This ensures the script only runs when the required number of GPUs are present.

@soumith soumith merged commit f84bcb3 into pytorch:main Jul 14, 2025
8 checks passed
@soumith

soumith commented Jul 14, 2025

Copy link
Copy Markdown
Contributor

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants