Commit bce64ff
committed
Fix parallel CP test port conflicts in L1 distributed test.sh
When NUM_GPUS >= 8, test.sh runs two parallel pytest sessions —
non-deterministic on GPUs 0-3 and deterministic on GPUs 4-7. Both
sessions launch their own torchrun (or pool worker) which picks
MASTER_PORT from the environment. Without explicit ports, both
inherit the same default (29500) and the second session fails with
EADDRINUSE on every torchrun spawn.
Assign distinct ports: 29500 for the non-deterministic session,
29501 for the deterministic session.
Picked up the OOM skip half of fdf32d5 separately during the
pool-batching rebase; this commit closes the remaining gap.
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>1 parent c1c6d0e commit bce64ff
1 file changed
Lines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | | - | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
| |||
0 commit comments