Fix Distributed Initialization by Validating Legacy tcp:// and Updating Tests#3758
Fix Distributed Initialization by Validating Legacy tcp:// and Updating Tests#3758TahaZahid05 wants to merge 11 commits into
Conversation
aff62c9 to
1127fd7
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates Ignite’s PyTorch-native distributed initialization to proactively reject legacy init_method="tcp://..." (which can hang on newer PyTorch versions) and introduces/validates TCPStore-based initialization paths, with accompanying test suite updates.
Changes:
- Added
tcp://init_method validation with actionableValueErrorguidance inidist.Paralleland_NativeDistModel. - Plumbed
storethroughidist.Parallel→idist.initialize→ native dist model initialization. - Refactored distributed tests/fixtures to avoid legacy
tcp://and to useenv://orTCPStore.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
ignite/distributed/launcher.py |
Adds store support to Parallel and blocks legacy tcp:// init methods. |
ignite/distributed/comp_models/native.py |
Adds store support and blocks legacy tcp:// init methods in native model initialization. |
ignite/distributed/utils.py |
Updates initialize() API docs to include store usage. |
tests/ignite/conftest.py |
Switches single-node test contexts from tcp:// to env:// and sets MASTER_ADDR/PORT. |
tests/ignite/distributed/test_launcher.py |
Removes tcp:// parameterizations; adds regression tests for tcp validation and TCPStore init. |
tests/ignite/distributed/comp_models/test_native.py |
Replaces tcp:// init with TCPStore in low-level/native context tests; adds new regression tests. |
tests/ignite/distributed/utils/test_native.py |
Removes legacy tcp:// from parameterized init_method test cases. |
Comments suppressed due to low confidence (1)
ignite/distributed/launcher.py:261
- When nproc_per_node is provided, Parallel uses idist.spawn() and does not pass the store anywhere (store is only forwarded to idist.initialize() in the non-spawn path). Accepting a non-None store in this mode is misleading; it should be rejected to avoid silently ignoring user configuration.
if self.backend is not None:
if nproc_per_node is not None:
self._spawn_params = self._setup_spawn_params(
nproc_per_node, nnodes, node_rank, master_addr, master_port, init_method, **spawn_kwargs
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else: | ||
| arg_names = ["nproc_per_node", "nnodes", "node_rank", "master_addr", "master_port"] | ||
| arg_values = [nproc_per_node, nnodes, node_rank, master_addr, master_port] | ||
| for name, value in zip(arg_names, arg_values): | ||
| if value is not None: | ||
| raise ValueError(f"If backend is None, argument '{name}' should be also None, but given {value}") |
| if init_method is not None and init_method.startswith("tcp://"): | ||
| raise ValueError( | ||
| f"TCP initialization via init_method='{init_method}' will hang. " | ||
| "To fix this, please configure MASTER_ADDR and MASTER_PORT in the environment and " | ||
| "use 'env://' (or omit init_method). Alternatively, you can configure a TCPStore and " | ||
| "pass it using the 'store' argument. For example:\n\n" | ||
| " import torch.distributed as dist\n" | ||
| ' store = dist.TCPStore("<master_addr>", <master_port>, world_size, is_master)\n' | ||
| " # Then pass the store to idist.Parallel:\n" | ||
| " # idist.Parallel(backend=backend, store=store, ...)" | ||
| ) | ||
|
|
| free_port = _setup_free_port(local_rank) | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| os.environ["MASTER_PORT"] = str(free_port) | ||
|
|
| free_port = _setup_free_port(local_rank) | ||
| init_method = f"tcp://localhost:{free_port}" | ||
| temp_file = None | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| os.environ["MASTER_PORT"] = str(free_port) |
| free_port = _setup_free_port(local_rank) | ||
| init_method = f"tcp://localhost:{free_port}" | ||
|
|
||
| dist_info = { | ||
| "world_size": world_size, | ||
| "rank": local_rank, | ||
| "init_method": init_method, | ||
| } | ||
| os.environ["MASTER_ADDR"] = "localhost" | ||
| os.environ["MASTER_PORT"] = str(free_port) | ||
| dist_info = { | ||
| "world_size": world_size, | ||
| "rank": local_rank, | ||
| "init_method": "env://", | ||
| } |
| is_master = rank == 0 | ||
| store = dist.TCPStore("0.0.0.0", 2222, world_size=world_size, is_master=is_master) | ||
| dist.init_process_group(true_backend, store=store, world_size=world_size, rank=rank) |
| is_master = local_rank == 0 | ||
| store = dist.TCPStore("0.0.0.0", 2222, world_size=world_size, is_master=is_master) | ||
| dist.init_process_group("nccl", store=store, world_size=world_size, rank=local_rank) |
| try: | ||
| store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True, use_libuv=False) | ||
| except TypeError: | ||
| # PyTorch < 2.4 doesn't support use_libuv | ||
| store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True) |
| if temp_file: | ||
| temp_file.close() | ||
| else: | ||
| os.environ.pop("MASTER_ADDR", None) | ||
| os.environ.pop("MASTER_PORT", None) |
| if backend is not None: | ||
| self._create_from_backend( | ||
| backend, timeout=timeout, init_method=init_method, world_size=world_size, rank=rank, **kwargs | ||
| backend, |
There was a problem hiding this comment.
There is no change in this, let's revert the code formatting.
| assert _NativeDistModel.create_from_context() is None | ||
|
|
||
| dist.init_process_group(true_backend, "tcp://0.0.0.0:2222", world_size=1, rank=0) | ||
| store = dist.TCPStore("0.0.0.0", 2222, world_size=1, is_master=True) |
There was a problem hiding this comment.
We had a fixture to get a freeport in the conf.py, can we use it here?
|
|
||
| @pytest.mark.distributed | ||
| @pytest.mark.parametrize("init_method", [None, "tcp://0.0.0.0:22334", "FILE"]) | ||
| @pytest.mark.parametrize("init_method", [None, "FILE"]) |
There was a problem hiding this comment.
Instead of removing a test, let's do store arg instead
Problem
Using
tcp://directly in PyTorch'sinit_methodcauses multi-process communication to hang and fail in modern PyTorch versions.Solution
Paralleland_NativeDistModelthat detects legacytcp://initialization methods and raises a helpful, actionableValueErrorguiding the user on how to use PyTorch's nativeTCPStoreinstead.tcp://connections from test parameterizations and updated low-level context tests to use moderndist.TCPStoreobjects directly.Check list: