[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596
[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596nv-lschneider wants to merge 4 commits into
Conversation
📝 WalkthroughWalkthrough
ChangesNCCL Window Allocator: Symmetric Failure Tracking & CUDA Error Clearing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp (1)
63-63: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMinor naming-convention deviations.
failureSize(Line 360) is a function-scope literal constant, which the guidelines name with akprefix (kFailureSize). The anonymous-namespace countergCudaGetLastErrorCallCount(Line 63) has internal linkage, where ansprefix is the encouraged convention rather thang.As per coding guidelines: "function-scope magic-number/literal constants are camelCase with prefix 'k'" and file-scope/anonymous-namespace variables use an
sprefix.Also applies to: 360-360
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp` at line 63, Rename the anonymous-namespace counter gCudaGetLastErrorCallCount to follow the internal-linkage naming convention with an s prefix, and update the function-scope literal constant failureSize to use the k prefix as required by the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols gCudaGetLastErrorCallCount and failureSize, and keep any references consistent after renaming.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/common/ncclUtils.cpp`:
- Around line 379-391: Only cache fallback thresholds for symmetric memory
allocation failures, not all invalid buffers. In ncclUtils.cpp, update
allocateAndRegisterBuffer() to distinguish why it returns an empty buffer (e.g.
ncclMemAlloc failure vs ncclCommWindowRegister failure), and only call
recordSymmetricFailureLocked from the invalid-buffer path when the failure came
from the symmetric-allocation fallback. Keep the relevant logic in
allocateAndRegisterBuffer, recordSymmetricFailureLocked, and the commBuffers
handling aligned so registration errors do not poison future window allocation
attempts for the communicator.
---
Nitpick comments:
In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp`:
- Line 63: Rename the anonymous-namespace counter gCudaGetLastErrorCallCount to
follow the internal-linkage naming convention with an s prefix, and update the
function-scope literal constant failureSize to use the k prefix as required by
the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols
gCudaGetLastErrorCallCount and failureSize, and keep any references consistent
after renaming.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ff5ba59d-2505-4f74-ac02-c8d13f665b35
📒 Files selected for processing (3)
cpp/tensorrt_llm/common/ncclUtils.cppcpp/tensorrt_llm/common/ncclUtils.hcpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp
5664998 to
38507b4
Compare
…e sticky ncclMemAlloc errors NCCLWindowAllocator::allocateAndRegisterBuffer used a per-rank cudaMalloc'd flag + ncclAllReduce(min) to make a per-rank ncclMemAlloc failure non-hanging. The intent was that all ranks reach the collective allreduce regardless of ncclMemAlloc's outcome, so the failing rank can signal "abort" instead of silently skipping the collective. However the cudaMalloc for the rank-sync flag ran AFTER ncclMemAlloc. When ncclMemAlloc returns ncclUnhandledCudaError (the symptom in the MGMN Llama-3.3-70B TP=2 run), the CUDA runtime can be left with a sticky last-error. The next runtime CUDA call (cudaMalloc for the rank-sync flag) then inherits that error, TLLM_CUDA_CHECK throws, and the failing rank unwinds out of allocateAndRegisterBuffer without ever reaching the collective ncclAllReduce(min). Healthy ranks proceed into the collective and block, producing the hang the bug reports. Fix: - Pre-allocate rankSyncFlag before ncclMemAlloc, so the flag is always ready on every rank irrespective of what ncclMemAlloc does. - After a failed ncclMemAlloc, drain any sticky CUDA last-error via cudaGetLastError() so the subsequent cudaMemcpy/ncclAllReduce on the CUDA stream see a clean device state. Both changes are confined to allocateAndRegisterBuffer. On the success path they are no-ops; on the failure path they keep the failing rank on the same collective control flow as healthy ranks, restoring the graceful fallback to plain NCCL allreduce that PR NVIDIA#12125 originally introduced. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…mAlloc fallback fires When ncclMemAlloc fails on any rank, allocateAndRegisterBuffer collectively agrees on the failure and returns an empty NCCLWindowBuffer so all ranks fall back to plain NCCL. Previously requestBuffer pushed that empty buffer back into the per-comm pool with inUse=true; releaseBuffer is a no-op for a nullptr pointer, so the entry stayed pinned forever. Repeated calls (e.g., autotuner warmup steps) accumulated stale, permanently "in use" entries. Only cache valid buffers. This complements the rank-sync ordering fix in allocateAndRegisterBuffer. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…lity After the collective ncclAllReduce(min) inside allocateAndRegisterBuffer agrees that at least one rank could not ncclMemAlloc, all subsequent requestBuffer() calls for the same comm short-circuit to an empty NCCLWindowBuffer instead of re-running the failing ncclMemAlloc + rank-sync allreduce on every autotuner trial. This is cluster-safe because the unavailable decision is driven by a collective allreduce, so all ranks agree and remain in sync without further communication. The cache is invalidated when the comm is torn down. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
38507b4 to
663ac00
Compare
Summary by CodeRabbit
Bug Fixes
Tests
Description
Follow-up to #14315.
This builds on the original NCCL window allocation OOM fix and addresses the review feedback from that PR. The previous approach cached symmetric-memory unavailability per communicator, which could disable NCCL window buffers too broadly after a single large allocation failure.
This PR makes the cache size-aware:
ncclMemAllocso every rank can still reach the collective fallback path.It also adds tests for the size-aware failure cache, pooled-buffer reuse after failure, smallest-threshold tracking, and the CUDA error-drain branch.
Test Coverage
SBSA TRT-LLM container build via
sbatch:cmake --build cpp/build --target ncclUtilsTest -- -j32Two-rank NCCL allocator tests via
sbatch:mpirun --allow-run-as-root --oversubscribe -np 2 --bind-to none ./cpp/build/tests/unit_tests/multi_gpu/ncclUtilsTest --gtest_filter=NCCLWindowAllocatorTest.*Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.