Commit ade421a
committed
Add persistent NCCL pool CP test infra
Each (world_size) is served by one long-lived torchrun running
run_attention_with_cp_pool.py. Tests submit work over rank-0 stdin
as JSON and read results from rank-0 stdout, replacing the
per-test torchrun launch path. NCCL init/destroy happens once
per pool, not once per case, eliminating ~9s overhead per test
and fixing L3 timeouts.
Why two pool sizes: cp_comm_type="a2a+p2p" needs world_size=4;
everything else uses world_size=2. We can't resize an active PG, so
one pool per world_size, routed by num_gpus. Pools spawn lazily on
first use so a session that only exercises 2-GPU cases never pays
the 4-GPU init cost.
Includes:
- PoolWorker class with sentinel-prefixed JSON protocol over rank-0
stdio (sentinel filters out torchrun status / library prints that
share the stdout fd)
- Stderr ring buffer (200 lines / ~4 KB tail) attached to crash-path
AssertionErrors so CI JUnit XML shows the real failure cause
- POOL_SUBMIT_TIMEOUT_SEC defaulting to 90 s (~6x p50 case wall on
H100); override via NVTE_CP_POOL_TIMEOUT_SEC
- Stream race fix on max_logit_per_step in all-gather CP forward:
wait_stream(flash_attn_streams[i-1]) before torch.maximum, so the
read on the default stream doesn't race with the write on cp_stream
in iteration i=2. The pool's persistent process exposed this latent
race; per-process subprocess design happened to schedule it safely.
- Deep-copy of model_configs_flash_attn[model] to prevent in-place
attn_mask_type mutation from leaking across pool cases
- Deterministic-mode skips for FusedAttention configs that OOM on
sm90 under NVTE_ALLOW_NONDETERMINISTIC_ALGO=0
Preserves PR NVIDIA#2596 pad_between_seqs additions (fa_pad_between_seqs
parameter through generate_input_shapes and run_dpa_with_cp, THD
padding cleanup for FA3 tile-spillover comparison).
Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>1 parent 77941e0 commit ade421a
4 files changed
Lines changed: 958 additions & 557 deletions
File tree
- tests/pytorch/attention
- transformer_engine/pytorch/attention/dot_product_attention
0 commit comments