Skip to content

Fix Distributed Initialization by Validating Legacy tcp:// and Updating Tests#3758

Draft
TahaZahid05 wants to merge 11 commits into
masterfrom
fix-torchrun-ci
Draft

Fix Distributed Initialization by Validating Legacy tcp:// and Updating Tests#3758
TahaZahid05 wants to merge 11 commits into
masterfrom
fix-torchrun-ci

Conversation

@TahaZahid05
Copy link
Copy Markdown
Collaborator

@TahaZahid05 TahaZahid05 commented May 25, 2026

Problem

Using tcp:// directly in PyTorch's init_method causes multi-process communication to hang and fail in modern PyTorch versions.

Solution

  1. Validation & User Error: Added explicit validation in Parallel and _NativeDistModel that detects legacy tcp:// initialization methods and raises a helpful, actionable ValueError guiding the user on how to use PyTorch's native TCPStore instead.
  2. Updated Tests: Cleaned up the distributed test suite to remove legacy tcp:// connections from test parameterizations and updated low-level context tests to use modern dist.TCPStore objects directly.

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions Bot added the module: distributed Distributed module label May 25, 2026
@TahaZahid05 TahaZahid05 requested a review from vfdev-5 May 26, 2026 19:38
@TahaZahid05 TahaZahid05 changed the title parse tcp:// init_method into TCPStore and improve keyword forwardingFix torchrun ci Fix Distributed Initialization by Validating Legacy tcp:// and Updating Tests May 26, 2026
Comment thread ignite/distributed/comp_models/native.py Outdated
Comment thread ignite/distributed/launcher.py Outdated
Comment thread ignite/distributed/comp_models/native.py Outdated
Comment thread ignite/distributed/comp_models/native.py Outdated
Comment thread ignite/distributed/launcher.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates Ignite’s PyTorch-native distributed initialization to proactively reject legacy init_method="tcp://..." (which can hang on newer PyTorch versions) and introduces/validates TCPStore-based initialization paths, with accompanying test suite updates.

Changes:

  • Added tcp:// init_method validation with actionable ValueError guidance in idist.Parallel and _NativeDistModel.
  • Plumbed store through idist.Parallelidist.initialize → native dist model initialization.
  • Refactored distributed tests/fixtures to avoid legacy tcp:// and to use env:// or TCPStore.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
ignite/distributed/launcher.py Adds store support to Parallel and blocks legacy tcp:// init methods.
ignite/distributed/comp_models/native.py Adds store support and blocks legacy tcp:// init methods in native model initialization.
ignite/distributed/utils.py Updates initialize() API docs to include store usage.
tests/ignite/conftest.py Switches single-node test contexts from tcp:// to env:// and sets MASTER_ADDR/PORT.
tests/ignite/distributed/test_launcher.py Removes tcp:// parameterizations; adds regression tests for tcp validation and TCPStore init.
tests/ignite/distributed/comp_models/test_native.py Replaces tcp:// init with TCPStore in low-level/native context tests; adds new regression tests.
tests/ignite/distributed/utils/test_native.py Removes legacy tcp:// from parameterized init_method test cases.
Comments suppressed due to low confidence (1)

ignite/distributed/launcher.py:261

  • When nproc_per_node is provided, Parallel uses idist.spawn() and does not pass the store anywhere (store is only forwarded to idist.initialize() in the non-spawn path). Accepting a non-None store in this mode is misleading; it should be rejected to avoid silently ignoring user configuration.
        if self.backend is not None:
            if nproc_per_node is not None:
                self._spawn_params = self._setup_spawn_params(
                    nproc_per_node, nnodes, node_rank, master_addr, master_port, init_method, **spawn_kwargs
                )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 233 to 238
else:
arg_names = ["nproc_per_node", "nnodes", "node_rank", "master_addr", "master_port"]
arg_values = [nproc_per_node, nnodes, node_rank, master_addr, master_port]
for name, value in zip(arg_names, arg_values):
if value is not None:
raise ValueError(f"If backend is None, argument '{name}' should be also None, but given {value}")
Comment on lines +240 to +251
if init_method is not None and init_method.startswith("tcp://"):
raise ValueError(
f"TCP initialization via init_method='{init_method}' will hang. "
"To fix this, please configure MASTER_ADDR and MASTER_PORT in the environment and "
"use 'env://' (or omit init_method). Alternatively, you can configure a TCPStore and "
"pass it using the 'store' argument. For example:\n\n"
" import torch.distributed as dist\n"
' store = dist.TCPStore("<master_addr>", <master_port>, world_size, is_master)\n'
" # Then pass the store to idist.Parallel:\n"
" # idist.Parallel(backend=backend, store=store, ...)"
)

Comment thread tests/ignite/conftest.py
Comment on lines 232 to 235
free_port = _setup_free_port(local_rank)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(free_port)

Comment thread tests/ignite/conftest.py
Comment on lines 268 to +270
free_port = _setup_free_port(local_rank)
init_method = f"tcp://localhost:{free_port}"
temp_file = None
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(free_port)
Comment thread tests/ignite/conftest.py
Comment on lines 495 to +502
free_port = _setup_free_port(local_rank)
init_method = f"tcp://localhost:{free_port}"

dist_info = {
"world_size": world_size,
"rank": local_rank,
"init_method": init_method,
}
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = str(free_port)
dist_info = {
"world_size": world_size,
"rank": local_rank,
"init_method": "env://",
}
Comment on lines +362 to +364
is_master = rank == 0
store = dist.TCPStore("0.0.0.0", 2222, world_size=world_size, is_master=is_master)
dist.init_process_group(true_backend, store=store, world_size=world_size, rank=rank)
Comment on lines +450 to +452
is_master = local_rank == 0
store = dist.TCPStore("0.0.0.0", 2222, world_size=world_size, is_master=is_master)
dist.init_process_group("nccl", store=store, world_size=world_size, rank=local_rank)
Comment on lines +747 to +751
try:
store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True, use_libuv=False)
except TypeError:
# PyTorch < 2.4 doesn't support use_libuv
store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True)
Comment thread ignite/distributed/comp_models/native.py
Comment thread tests/ignite/conftest.py
Comment on lines 513 to +517
if temp_file:
temp_file.close()
else:
os.environ.pop("MASTER_ADDR", None)
os.environ.pop("MASTER_PORT", None)
@github-actions github-actions Bot added the ci CI label May 30, 2026
@TahaZahid05 TahaZahid05 marked this pull request as draft June 2, 2026 17:21
if backend is not None:
self._create_from_backend(
backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs
backend,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change in this, let's revert the code formatting.

assert _NativeDistModel.create_from_context() is None

dist.init_process_group(true_backend, "tcp://0.0.0.0:2222", world_size=1, rank=0)
store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a fixture to get a freeport in the conf.py, can we use it here?


@pytest.mark.distributed
@pytest.mark.parametrize("init_method", [None, "tcp://0.0.0.0:22334", "FILE"])
@pytest.mark.parametrize("init_method", [None, "FILE"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of removing a test, let's do store arg instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci CI module: distributed Distributed module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants