debug: add diagnostic logging to EFA NCCL test#6114
Open
Eren-Jeager123 wants to merge 32 commits into
Open
Conversation
EFA installer >= 1.44.0 installs the aws-ofi-nccl plugin as libnccl-net-ofi.so (not libnccl-net.so). NCCL's default plugin search looks for libnccl-net.so which no longer exists, causing NCCL to fall back to sockets and fail on EFA-only instances. Setting NCCL_NET_PLUGIN=ofi tells NCCL to look for libnccl-net-ofi.so instead, which is what the EFA installer provides. Also adds a build-time verification that the OFI plugin .so exists after EFA installation, matching the pattern in scripts/common/.
Print plugin file existence, ldd output, and env vars before running the NCCL test. Also explicitly pass NCCL_NET_PLUGIN via mpirun -x to ensure it reaches all ranks.
Shell diagnostics in nccl_allreduce.sh weren't visible in CI logs because Fabric only captures the final command's stdout/stderr on failure. Move diagnostics to the Python test as a separate run_on_container call whose output goes to pytest's captured log.
- Use print() instead of LOGGER.info() so output appears in pytest's "Captured stdout call" section (logger output was being swallowed) - Use warn=True on allreduce call so we can capture and print the full stdout/stderr/log file content on failure instead of just getting the UnexpectedExit traceback
Previous LOGGER.info wasn't visible in CI. Use print() which pytest -s captures. Also dump SG rules to check if the all-traffic self-ref rule is missing (known previous issue). Use warn=True on allreduce so we can capture and print stdout/stderr/log file on failure.
The real failure is 'CUDA system not yet initialized' — the p4d host driver is older than what CUDA 13.0.2 requires. The cuda-compat package (installed via dnf upgrade cuda-compat-*) provides a forward-compatible libcuda.so at /usr/local/cuda/compat/. Prepend /usr/local/cuda/compat to LD_LIBRARY_PATH in the container ENV so the compat driver is always used, regardless of whether the command runs via the entrypoint or via docker exec. Also add the compat path in the allreduce test script as a belt-and- suspenders measure.
The cuda-compat block unconditionally prepends /usr/local/cuda/compat to LD_LIBRARY_PATH when libcuda.so.1 exists there. On p4d instances with --runtime=nvidia, the real host driver libcuda.so is mounted by the NVIDIA container runtime. The compat stub overrides it, causing 'CUDA system not yet initialized' because the compat library can't communicate with the actual GPU hardware. The entrypoint.sh already handles cuda-compat correctly (comparing driver versions), but docker exec commands bypass the entrypoint. The test script should not touch cuda-compat at all.
The cuda-compat RPM registers /usr/local/cuda/compat in the ldconfig cache via /etc/ld.so.conf.d/cuda-compat.conf. This makes the compat libcuda.so visible system-wide, overriding the real host driver mounted by --runtime=nvidia. Result: 'system not yet initialized'. Fix: remove the ldconfig conf file after installing cuda-compat so the compat libs are only used when explicitly prepended to LD_LIBRARY_PATH (which the entrypoint does after driver version comparison).
Host driver (580.150) is older than container cuda-compat (580.159.04). The --runtime=nvidia mounts host libcuda.so which overrides LD_LIBRARY_PATH. Use LD_PRELOAD to force the cuda-compat version for forward compatibility.
dnf upgrade -y cuda-compat-* pulled 580.159.04 which is newer than the DLAMI host driver (580.150). The 580.159 userspace can't initialize with the 580.150 kernel module, causing "system not yet initialized". Remove the cuda-compat upgrade — the base image already ships a compatible version. The host AMI's driver is the ceiling.
The nvidia container runtime uses ldconfig (not LD_LIBRARY_PATH) to resolve libcuda.so. Previously we removed cuda-compat.conf causing ldconfig to resolve to the host driver at /lib64/ instead of the cuda-compat version at /usr/local/cuda/compat/. This broke EFA tests when the container's cuda-compat (580.159) is newer than the host driver (580.150). Fix: write /usr/local/cuda/compat to cuda-compat.conf so ldconfig prefers the forward-compatible library. Also restores dnf upgrade cuda-compat-* for security patching.
cuda-compat 580.159 is incompatible with DLAMI host driver 580.150 — the newer userspace cannot talk to the older kernel module, causing "system not yet initialized" in all CUDA operations. Exclude cuda-compat from dnf upgrade --security and allowlist CVE-2025-33219 until the DLAMI is updated to driver >= 580.159. Keep cuda-compat.conf for forward compat when AMI does get updated.
dnf upgrade -y cuda-compat-* pulls 580.159 which is newer than the host driver (580.150) on both DLAMI and CodeBuild GPU runners. The newer userspace library cannot communicate with the older kernel module, causing "system not yet initialized" on all CUDA operations. Remove the cuda-compat upgrade and allowlist CVE-2025-33219 until host drivers are updated to >= 580.159.
…llreduce capture)
single-gpu passes (CodeBuild) but EFA fails (DLAMI p4d) with same image. Add torch.cuda.is_available() + nvidia-smi -L inside the container on p4d to determine if CUDA is accessible at all. Remove CACHE_BUST ARG since image is confirmed correct (single-gpu passes).
Revert all Dockerfile/allowlist changes. Keep only EFA test diagnostics: - test_efa.py: CUDA driver info, ldconfig, nvidia-smi, SG rules - nccl_allreduce.sh: pre-flight checks, full log dump on failure
Confirmed: base nvidia/cuda:13.0.2 image works on DLAMI p4d (A100, driver 580.150). Our image fails because dnf upgrade cuda-compat-* pulls 580.159 which is incompatible with the DLAMI's embargo driver. Remove the upgrade and allowlist CVE-2025-33219 until DLAMI ships the public 580.159+ driver.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EFA NCCL test fails with
Test CUDA failure util.cu:557 'system not yet initialized'on p4d.24xlarge instances. Adding diagnostic logging to capture the failure context for root cause analysis.Changes
Known observations
nvidia-smiworks inside container (GPU visible)all_reduce_perfcrashes immediately withsystem not yet initializedbefore producing any NCCL debug outputTest plan