Skip to content

Commit bce64ff

Browse files
Fix parallel CP test port conflicts in L1 distributed test.sh
When NUM_GPUS >= 8, test.sh runs two parallel pytest sessions — non-deterministic on GPUs 0-3 and deterministic on GPUs 4-7. Both sessions launch their own torchrun (or pool worker) which picks MASTER_PORT from the environment. Without explicit ports, both inherit the same default (29500) and the second session fails with EADDRINUSE on every torchrun spawn. Assign distinct ports: 29500 for the non-deterministic session, 29501 for the deterministic session. Picked up the OOM skip half of fdf32d5 separately during the pool-batching rebase; this commit closes the remaining gap. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
1 parent c1c6d0e commit bce64ff

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

  • qa/L1_pytorch_distributed_unittest

qa/L1_pytorch_distributed_unittest/test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ NUM_GPUS=$(python3 -c "import torch; print(torch.cuda.device_count())")
2828
echo "Detected $NUM_GPUS GPU(s)"
2929
if [ "$NUM_GPUS" -ge 8 ]; then
3030
echo "Running CP tests in parallel: non-deterministic on GPUs 0-3, deterministic on GPUs 4-7"
31-
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_attention_with_cp.xml $TE_PATH/tests/pytorch/attention/test_attention_with_cp.py &
31+
CUDA_VISIBLE_DEVICES=0,1,2,3 MASTER_PORT=29500 python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_attention_with_cp.xml $TE_PATH/tests/pytorch/attention/test_attention_with_cp.py &
3232
PID_CP_NONDET=$!
33-
CUDA_VISIBLE_DEVICES=4,5,6,7 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_attention_deterministic_with_cp.xml $TE_PATH/tests/pytorch/attention/test_attention_with_cp.py &
33+
CUDA_VISIBLE_DEVICES=4,5,6,7 MASTER_PORT=29501 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 python3 -m pytest -v -s --junitxml=$XML_LOG_DIR/pytest_test_attention_deterministic_with_cp.xml $TE_PATH/tests/pytorch/attention/test_attention_with_cp.py &
3434
PID_CP_DET=$!
3535
wait $PID_CP_NONDET || test_fail "test_attention_with_cp.py"
3636
wait $PID_CP_DET || test_fail "NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 test_attention_with_cp.py"

0 commit comments

Comments
 (0)