debug: add diagnostic logging to EFA NCCL test by Eren-Jeager123 · Pull Request #6114 · aws/deep-learning-containers

Eren-Jeager123 · 2026-05-20T01:01:39Z

Summary

EFA NCCL test fails with Test CUDA failure util.cu:557 'system not yet initialized' on p4d.24xlarge instances. Adding diagnostic logging to capture the failure context for root cause analysis.

Changes

test_efa.py: Dump CUDA driver info (ldconfig, cuda-compat libs, nvidia-smi, driver version) and security group rules before running tests
nccl_allreduce.sh: Pre-flight checks (LD_LIBRARY_PATH, ofi-nccl plugin, nvidia-smi) and full NCCL log dump on failure

Known observations

nvidia-smi works inside container (GPU visible)
Host driver: 580.150
cuda-compat: 580.95.05 (base image version)
all_reduce_perf crashes immediately with system not yet initialized before producing any NCCL debug output
Same image passes single-gpu-test on CodeBuild (L4 GPU)

Test plan

EFA test runs and produces diagnostic output for analysis

EFA installer >= 1.44.0 installs the aws-ofi-nccl plugin as libnccl-net-ofi.so (not libnccl-net.so). NCCL's default plugin search looks for libnccl-net.so which no longer exists, causing NCCL to fall back to sockets and fail on EFA-only instances. Setting NCCL_NET_PLUGIN=ofi tells NCCL to look for libnccl-net-ofi.so instead, which is what the EFA installer provides. Also adds a build-time verification that the OFI plugin .so exists after EFA installation, matching the pattern in scripts/common/.

Print plugin file existence, ldd output, and env vars before running the NCCL test. Also explicitly pass NCCL_NET_PLUGIN via mpirun -x to ensure it reaches all ranks.

Shell diagnostics in nccl_allreduce.sh weren't visible in CI logs because Fabric only captures the final command's stdout/stderr on failure. Move diagnostics to the Python test as a separate run_on_container call whose output goes to pytest's captured log.

- Use print() instead of LOGGER.info() so output appears in pytest's "Captured stdout call" section (logger output was being swallowed) - Use warn=True on allreduce call so we can capture and print the full stdout/stderr/log file content on failure instead of just getting the UnexpectedExit traceback

Previous LOGGER.info wasn't visible in CI. Use print() which pytest -s captures. Also dump SG rules to check if the all-traffic self-ref rule is missing (known previous issue). Use warn=True on allreduce so we can capture and print stdout/stderr/log file on failure.

The real failure is 'CUDA system not yet initialized' — the p4d host driver is older than what CUDA 13.0.2 requires. The cuda-compat package (installed via dnf upgrade cuda-compat-*) provides a forward-compatible libcuda.so at /usr/local/cuda/compat/. Prepend /usr/local/cuda/compat to LD_LIBRARY_PATH in the container ENV so the compat driver is always used, regardless of whether the command runs via the entrypoint or via docker exec. Also add the compat path in the allreduce test script as a belt-and- suspenders measure.

The cuda-compat block unconditionally prepends /usr/local/cuda/compat to LD_LIBRARY_PATH when libcuda.so.1 exists there. On p4d instances with --runtime=nvidia, the real host driver libcuda.so is mounted by the NVIDIA container runtime. The compat stub overrides it, causing 'CUDA system not yet initialized' because the compat library can't communicate with the actual GPU hardware. The entrypoint.sh already handles cuda-compat correctly (comparing driver versions), but docker exec commands bypass the entrypoint. The test script should not touch cuda-compat at all.

The cuda-compat RPM registers /usr/local/cuda/compat in the ldconfig cache via /etc/ld.so.conf.d/cuda-compat.conf. This makes the compat libcuda.so visible system-wide, overriding the real host driver mounted by --runtime=nvidia. Result: 'system not yet initialized'. Fix: remove the ldconfig conf file after installing cuda-compat so the compat libs are only used when explicitly prepended to LD_LIBRARY_PATH (which the entrypoint does after driver version comparison).

Host driver (580.150) is older than container cuda-compat (580.159.04). The --runtime=nvidia mounts host libcuda.so which overrides LD_LIBRARY_PATH. Use LD_PRELOAD to force the cuda-compat version for forward compatibility.

…cation

dnf upgrade -y cuda-compat-* pulled 580.159.04 which is newer than the DLAMI host driver (580.150). The 580.159 userspace can't initialize with the 580.150 kernel module, causing "system not yet initialized". Remove the cuda-compat upgrade — the base image already ships a compatible version. The host AMI's driver is the ceiling.

The nvidia container runtime uses ldconfig (not LD_LIBRARY_PATH) to resolve libcuda.so. Previously we removed cuda-compat.conf causing ldconfig to resolve to the host driver at /lib64/ instead of the cuda-compat version at /usr/local/cuda/compat/. This broke EFA tests when the container's cuda-compat (580.159) is newer than the host driver (580.150). Fix: write /usr/local/cuda/compat to cuda-compat.conf so ldconfig prefers the forward-compatible library. Also restores dnf upgrade cuda-compat-* for security patching.

cuda-compat 580.159 is incompatible with DLAMI host driver 580.150 — the newer userspace cannot talk to the older kernel module, causing "system not yet initialized" in all CUDA operations. Exclude cuda-compat from dnf upgrade --security and allowlist CVE-2025-33219 until the DLAMI is updated to driver >= 580.159. Keep cuda-compat.conf for forward compat when AMI does get updated.

dnf upgrade -y cuda-compat-* pulls 580.159 which is newer than the host driver (580.150) on both DLAMI and CodeBuild GPU runners. The newer userspace library cannot communicate with the older kernel module, causing "system not yet initialized" on all CUDA operations. Remove the cuda-compat upgrade and allowlist CVE-2025-33219 until host drivers are updated to >= 580.159.

…llreduce capture)

single-gpu passes (CodeBuild) but EFA fails (DLAMI p4d) with same image. Add torch.cuda.is_available() + nvidia-smi -L inside the container on p4d to determine if CUDA is accessible at all. Remove CACHE_BUST ARG since image is confirmed correct (single-gpu passes).

…a-smi only

Revert all Dockerfile/allowlist changes. Keep only EFA test diagnostics: - test_efa.py: CUDA driver info, ldconfig, nvidia-smi, SG rules - nccl_allreduce.sh: pre-flight checks, full log dump on failure

Confirmed: base nvidia/cuda:13.0.2 image works on DLAMI p4d (A100, driver 580.150). Our image fails because dnf upgrade cuda-compat-* pulls 580.159 which is incompatible with the DLAMI's embargo driver. Remove the upgrade and allowlist CVE-2025-33219 until DLAMI ships the public 580.159+ driver.

Trigger EFA Test

77675f7

aws-deep-learning-containers-ci Bot added the authorized label May 20, 2026

Eren-Jeager123 added 15 commits May 20, 2026 02:49

debug: add EFA/NCCL plugin diagnostics to allreduce test

6eda96f

Print plugin file existence, ldd output, and env vars before running the NCCL test. Also explicitly pass NCCL_NET_PLUGIN via mpirun -x to ensure it reaches all ranks.

fix: use LD_PRELOAD for cuda-compat in EFA NCCL test

401badb

Host driver (580.150) is older than container cuda-compat (580.159.04). The --runtime=nvidia mounts host libcuda.so which overrides LD_LIBRARY_PATH. Use LD_PRELOAD to force the cuda-compat version for forward compatibility.

fix: use resolved path for LD_PRELOAD cuda-compat and add load verifi…

8d9f5f1

…cation

cleanup: remove LD_PRELOAD hack, real fix is in Dockerfile cuda-compat

8b71703

chore: retrigger CI

4dbc5bd

Eren-Jeager123 changed the title ~~Trigger EFA Test~~ fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility May 20, 2026

Eren-Jeager123 added 5 commits May 20, 2026 16:35

fix: add CACHE_REFRESH ARG to invalidate cached security patch layer

d672e8b

cleanup: remove NCCL_NET_PLUGIN=ofi debug env var

d211053

fix: remove trailing backslash from ENV after NCCL_NET_PLUGIN removal

fdc20ba

remove .claude from tracking

50e12ca

Eren-Jeager123 changed the title ~~fix: add cuda-compat to LD_LIBRARY_PATH for driver forward compatibility~~ fix: remove cuda-compat upgrade to prevent driver version mismatch May 20, 2026

Eren-Jeager123 added 6 commits May 20, 2026 17:11

debug: add pre-flight checks and log dump for EFA NCCL test

f1db9ad

fix: add CACHE_BUST ARG to force rebuild without stale cuda-compat

0a626a5

fix: restore EFA test diagnostics (CUDA driver info, SG rules, NCCL a…

792ef2c

…llreduce capture)

fix: nested quote syntax error in CUDA diagnostic command

46db310

fix: remove python torch check that breaks nested quoting, keep nvidi…

058afca

…a-smi only

Eren-Jeager123 changed the title ~~fix: remove cuda-compat upgrade to prevent driver version mismatch~~ Fix EFA Tests May 20, 2026

Eren-Jeager123 added 2 commits May 20, 2026 20:15

fix: add diagnostic logging to EFA test for failure analysis

3fc5b6e

Revert all Dockerfile/allowlist changes. Keep only EFA test diagnostics: - test_efa.py: CUDA driver info, ldconfig, nvidia-smi, SG rules - nccl_allreduce.sh: pre-flight checks, full log dump on failure

remove .claude from tracking

3e7cfc1

Eren-Jeager123 changed the title ~~Fix EFA Tests~~ debug: add diagnostic logging to EFA NCCL test May 20, 2026

Eren-Jeager123 added 3 commits May 20, 2026 20:17

chore: trigger EFA test rebuild

51c24c6

fix: pin EFA test to pre-embargo DLAMI (before 2026-05-05)

495a4c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug: add diagnostic logging to EFA NCCL test#6114

debug: add diagnostic logging to EFA NCCL test#6114
Eren-Jeager123 wants to merge 32 commits into
mainfrom
debug-efa

Eren-Jeager123 commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Eren-Jeager123 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Known observations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Eren-Jeager123 commented May 20, 2026 •

edited

Loading