Skip to content

debug: add diagnostic logging to EFA NCCL test#6114

Open
Eren-Jeager123 wants to merge 32 commits into
mainfrom
debug-efa
Open

debug: add diagnostic logging to EFA NCCL test#6114
Eren-Jeager123 wants to merge 32 commits into
mainfrom
debug-efa

Conversation

@Eren-Jeager123
Copy link
Copy Markdown
Contributor

@Eren-Jeager123 Eren-Jeager123 commented May 20, 2026

Summary

EFA NCCL test fails with Test CUDA failure util.cu:557 'system not yet initialized' on p4d.24xlarge instances. Adding diagnostic logging to capture the failure context for root cause analysis.

Changes

  • test_efa.py: Dump CUDA driver info (ldconfig, cuda-compat libs, nvidia-smi, driver version) and security group rules before running tests
  • nccl_allreduce.sh: Pre-flight checks (LD_LIBRARY_PATH, ofi-nccl plugin, nvidia-smi) and full NCCL log dump on failure

Known observations

  • nvidia-smi works inside container (GPU visible)
  • Host driver: 580.150
  • cuda-compat: 580.95.05 (base image version)
  • all_reduce_perf crashes immediately with system not yet initialized before producing any NCCL debug output
  • Same image passes single-gpu-test on CodeBuild (L4 GPU)

Test plan

  • EFA test runs and produces diagnostic output for analysis

EFA installer >= 1.44.0 installs the aws-ofi-nccl plugin as
libnccl-net-ofi.so (not libnccl-net.so). NCCL's default plugin
search looks for libnccl-net.so which no longer exists, causing
NCCL to fall back to sockets and fail on EFA-only instances.

Setting NCCL_NET_PLUGIN=ofi tells NCCL to look for
libnccl-net-ofi.so instead, which is what the EFA installer provides.

Also adds a build-time verification that the OFI plugin .so exists
after EFA installation, matching the pattern in scripts/common/.
Print plugin file existence, ldd output, and env vars before
running the NCCL test. Also explicitly pass NCCL_NET_PLUGIN
via mpirun -x to ensure it reaches all ranks.
Shell diagnostics in nccl_allreduce.sh weren't visible in CI logs
because Fabric only captures the final command's stdout/stderr on
failure. Move diagnostics to the Python test as a separate
run_on_container call whose output goes to pytest's captured log.
- Use print() instead of LOGGER.info() so output appears in pytest's
  "Captured stdout call" section (logger output was being swallowed)
- Use warn=True on allreduce call so we can capture and print the full
  stdout/stderr/log file content on failure instead of just getting
  the UnexpectedExit traceback
Previous LOGGER.info wasn't visible in CI. Use print() which pytest -s
captures. Also dump SG rules to check if the all-traffic self-ref rule
is missing (known previous issue). Use warn=True on allreduce so we can
capture and print stdout/stderr/log file on failure.
The real failure is 'CUDA system not yet initialized' — the p4d host
driver is older than what CUDA 13.0.2 requires. The cuda-compat package
(installed via dnf upgrade cuda-compat-*) provides a forward-compatible
libcuda.so at /usr/local/cuda/compat/.

Prepend /usr/local/cuda/compat to LD_LIBRARY_PATH in the container ENV
so the compat driver is always used, regardless of whether the command
runs via the entrypoint or via docker exec.

Also add the compat path in the allreduce test script as a belt-and-
suspenders measure.
The cuda-compat block unconditionally prepends /usr/local/cuda/compat
to LD_LIBRARY_PATH when libcuda.so.1 exists there. On p4d instances
with --runtime=nvidia, the real host driver libcuda.so is mounted by
the NVIDIA container runtime. The compat stub overrides it, causing
'CUDA system not yet initialized' because the compat library can't
communicate with the actual GPU hardware.

The entrypoint.sh already handles cuda-compat correctly (comparing
driver versions), but docker exec commands bypass the entrypoint.
The test script should not touch cuda-compat at all.
The cuda-compat RPM registers /usr/local/cuda/compat in the ldconfig
cache via /etc/ld.so.conf.d/cuda-compat.conf. This makes the compat
libcuda.so visible system-wide, overriding the real host driver
mounted by --runtime=nvidia. Result: 'system not yet initialized'.

Fix: remove the ldconfig conf file after installing cuda-compat so the
compat libs are only used when explicitly prepended to LD_LIBRARY_PATH
(which the entrypoint does after driver version comparison).
Host driver (580.150) is older than container cuda-compat (580.159.04).
The --runtime=nvidia mounts host libcuda.so which overrides
LD_LIBRARY_PATH. Use LD_PRELOAD to force the cuda-compat version
for forward compatibility.
dnf upgrade -y cuda-compat-* pulled 580.159.04 which is newer than the
DLAMI host driver (580.150). The 580.159 userspace can't initialize
with the 580.150 kernel module, causing "system not yet initialized".

Remove the cuda-compat upgrade — the base image already ships a
compatible version. The host AMI's driver is the ceiling.
The nvidia container runtime uses ldconfig (not LD_LIBRARY_PATH) to
resolve libcuda.so. Previously we removed cuda-compat.conf causing
ldconfig to resolve to the host driver at /lib64/ instead of the
cuda-compat version at /usr/local/cuda/compat/. This broke EFA tests
when the container's cuda-compat (580.159) is newer than the host
driver (580.150).

Fix: write /usr/local/cuda/compat to cuda-compat.conf so ldconfig
prefers the forward-compatible library. Also restores dnf upgrade
cuda-compat-* for security patching.
cuda-compat 580.159 is incompatible with DLAMI host driver 580.150 —
the newer userspace cannot talk to the older kernel module, causing
"system not yet initialized" in all CUDA operations.

Exclude cuda-compat from dnf upgrade --security and allowlist
CVE-2025-33219 until the DLAMI is updated to driver >= 580.159.
Keep cuda-compat.conf for forward compat when AMI does get updated.
@Eren-Jeager123 Eren-Jeager123 changed the title Trigger EFA Test fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility May 20, 2026
dnf upgrade -y cuda-compat-* pulls 580.159 which is newer than the
host driver (580.150) on both DLAMI and CodeBuild GPU runners. The
newer userspace library cannot communicate with the older kernel
module, causing "system not yet initialized" on all CUDA operations.

Remove the cuda-compat upgrade and allowlist CVE-2025-33219 until
host drivers are updated to >= 580.159.
@Eren-Jeager123 Eren-Jeager123 changed the title fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility fix: remove cuda-compat upgrade to prevent driver version mismatch May 20, 2026
single-gpu passes (CodeBuild) but EFA fails (DLAMI p4d) with same
image. Add torch.cuda.is_available() + nvidia-smi -L inside the
container on p4d to determine if CUDA is accessible at all.

Remove CACHE_BUST ARG since image is confirmed correct (single-gpu
passes).
@Eren-Jeager123 Eren-Jeager123 changed the title fix: remove cuda-compat upgrade to prevent driver version mismatch Fix EFA Tests May 20, 2026
Revert all Dockerfile/allowlist changes. Keep only EFA test diagnostics:
- test_efa.py: CUDA driver info, ldconfig, nvidia-smi, SG rules
- nccl_allreduce.sh: pre-flight checks, full log dump on failure
@Eren-Jeager123 Eren-Jeager123 changed the title Fix EFA Tests debug: add diagnostic logging to EFA NCCL test May 20, 2026
Confirmed: base nvidia/cuda:13.0.2 image works on DLAMI p4d (A100,
driver 580.150). Our image fails because dnf upgrade cuda-compat-*
pulls 580.159 which is incompatible with the DLAMI's embargo driver.

Remove the upgrade and allowlist CVE-2025-33219 until DLAMI ships
the public 580.159+ driver.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant