Commit ade421a

committed

Add persistent NCCL pool CP test infra

Each (world_size) is served by one long-lived torchrun running run_attention_with_cp_pool.py. Tests submit work over rank-0 stdin as JSON and read results from rank-0 stdout, replacing the per-test torchrun launch path. NCCL init/destroy happens once per pool, not once per case, eliminating ~9s overhead per test and fixing L3 timeouts. Why two pool sizes: cp_comm_type="a2a+p2p" needs world_size=4; everything else uses world_size=2. We can't resize an active PG, so one pool per world_size, routed by num_gpus. Pools spawn lazily on first use so a session that only exercises 2-GPU cases never pays the 4-GPU init cost. Includes: - PoolWorker class with sentinel-prefixed JSON protocol over rank-0 stdio (sentinel filters out torchrun status / library prints that share the stdout fd) - Stderr ring buffer (200 lines / ~4 KB tail) attached to crash-path AssertionErrors so CI JUnit XML shows the real failure cause - POOL_SUBMIT_TIMEOUT_SEC defaulting to 90 s (~6x p50 case wall on H100); override via NVTE_CP_POOL_TIMEOUT_SEC - Stream race fix on max_logit_per_step in all-gather CP forward: wait_stream(flash_attn_streams[i-1]) before torch.maximum, so the read on the default stream doesn't race with the write on cp_stream in iteration i=2. The pool's persistent process exposed this latent race; per-process subprocess design happened to schedule it safely. - Deep-copy of model_configs_flash_attn[model] to prevent in-place attn_mask_type mutation from leaking across pool cases - Deterministic-mode skips for FusedAttention configs that OOM on sm90 under NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 Preserves PR NVIDIA#2596 pad_between_seqs additions (fa_pad_between_seqs parameter through generate_input_shapes and run_dpa_with_cp, THD padding cleanup for FA3 tile-spillover comparison). Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

1 parent 77941e0 commit ade421aCopy full SHA for ade421a

4 files changed

tests/pytorch/attention
transformer_engine/pytorch/attention/dot_product_attention
- context_parallel.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit ade421a

File tree

0 commit comments