Skip to content

[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596

Open
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:pr-14315-oom-size-aware-cache
Open

[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:pr-14315-oom-size-aware-cache

Conversation

@nv-lschneider

@nv-lschneider nv-lschneider commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes

    • Improved reliability of large GPU buffer allocation and cleanup.
    • Reduced repeated allocation failures by remembering the smallest known failing request size.
    • Prevented stale error state from interfering with later distributed operations.
    • Ensured previously created buffers can still be reused safely after a failure.
  • Tests

    • Added coverage for failure tracking, buffer reuse, and CUDA error clearing behavior.

Description

Follow-up to #14315.

This builds on the original NCCL window allocation OOM fix and addresses the review feedback from that PR. The previous approach cached symmetric-memory unavailability per communicator, which could disable NCCL window buffers too broadly after a single large allocation failure.

This PR makes the cache size-aware:

  • Reuses already-pooled valid NCCL window buffers before consulting the failure cache.
  • Records the smallest known failing symmetric allocation size per communicator.
  • Skips only new symmetric allocations at or above that known failing size.
  • Keeps smaller requests eligible, since symmetric-memory availability can be size/resource dependent.
  • Avoids caching invalid window buffers after fallback.
  • Clears sticky CUDA error state after a failed local ncclMemAlloc so every rank can still reach the collective fallback path.

It also adds tests for the size-aware failure cache, pooled-buffer reuse after failure, smallest-threshold tracking, and the CUDA error-drain branch.

Test Coverage

  • SBSA TRT-LLM container build via sbatch: cmake --build cpp/build --target ncclUtilsTest -- -j32

  • Two-rank NCCL allocator tests via sbatch:

    • mpirun --allow-run-as-root --oversubscribe -np 2 --bind-to none ./cpp/build/tests/unit_tests/multi_gpu/ncclUtilsTest --gtest_filter=NCCLWindowAllocatorTest.*
    • 15 tests passed per rank
  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

NCCLWindowAllocator is extended with a per-communicator mMinSymmetricFailureSize map to skip window-buffer allocations at or above a known-failing size. A new clearCudaErrorIfSymmetricAllocationFailed helper prevents sticky CUDA errors from blocking collectives. allocateAndRegisterBuffer is restructured to pre-allocate rankSyncFlag before ncclMemAlloc. Cleanup paths erase stale failure entries. Four new unit tests cover the added behaviors.

Changes

NCCL Window Allocator: Symmetric Failure Tracking & CUDA Error Clearing

Layer / File(s) Summary
Header: failure-size cache and helper declarations
cpp/tensorrt_llm/common/ncclUtils.h
Adds mMinSymmetricFailureSize private member, recordSymmetricFailureLocked and clearCudaErrorIfSymmetricAllocationFailed declarations, a cudaGetLastError function pointer type alias, and a friend test-access class to NCCLWindowAllocator.
requestBuffer: early-exit and failure recording
cpp/tensorrt_llm/common/ncclUtils.cpp
requestBuffer consults mMinSymmetricFailureSize to return an empty buffer immediately when the requested size meets or exceeds the recorded threshold; after a failed allocateAndRegisterBuffer, it records the smallest failing size and skips caching the invalid buffer.
allocateAndRegisterBuffer: pre-allocate rankSyncFlag and clear CUDA errors
cpp/tensorrt_llm/common/ncclUtils.cpp
rankSyncFlag is now allocated via cudaMalloc before the ncclMemAlloc call; when the symmetric allocation fails locally, clearCudaErrorIfSymmetricAllocationFailed is invoked to prevent sticky CUDA errors from propagating to subsequent collectives.
cleanupBuffersForComm: erase failure thresholds
cpp/tensorrt_llm/common/ncclUtils.cpp
Both cleanup code paths in cleanupBuffersForComm now erase the mMinSymmetricFailureSize entry for the communicator on teardown.
Unit tests: failure-cache behavior and CUDA error clearing
cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp
Adds NCCLWindowAllocatorTestAccess, a fake cudaGetLastError with a call counter, and four test cases covering size-aware failure caching, buffer reuse, smallest-size retention, and CUDA error clearing verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • hyukn
  • Tabrizian
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the required ticket/type format and clearly summarizes the main change: size-aware NCCL window allocation failure caching.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description includes a clear problem/solution summary, test coverage, and checklist confirmation, matching the required template well.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp (1)

63-63: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Minor naming-convention deviations.

failureSize (Line 360) is a function-scope literal constant, which the guidelines name with a k prefix (kFailureSize). The anonymous-namespace counter gCudaGetLastErrorCallCount (Line 63) has internal linkage, where an s prefix is the encouraged convention rather than g.

As per coding guidelines: "function-scope magic-number/literal constants are camelCase with prefix 'k'" and file-scope/anonymous-namespace variables use an s prefix.

Also applies to: 360-360

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp` at line 63, Rename the
anonymous-namespace counter gCudaGetLastErrorCallCount to follow the
internal-linkage naming convention with an s prefix, and update the
function-scope literal constant failureSize to use the k prefix as required by
the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols
gCudaGetLastErrorCallCount and failureSize, and keep any references consistent
after renaming.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/common/ncclUtils.cpp`:
- Around line 379-391: Only cache fallback thresholds for symmetric memory
allocation failures, not all invalid buffers. In ncclUtils.cpp, update
allocateAndRegisterBuffer() to distinguish why it returns an empty buffer (e.g.
ncclMemAlloc failure vs ncclCommWindowRegister failure), and only call
recordSymmetricFailureLocked from the invalid-buffer path when the failure came
from the symmetric-allocation fallback. Keep the relevant logic in
allocateAndRegisterBuffer, recordSymmetricFailureLocked, and the commBuffers
handling aligned so registration errors do not poison future window allocation
attempts for the communicator.

---

Nitpick comments:
In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp`:
- Line 63: Rename the anonymous-namespace counter gCudaGetLastErrorCallCount to
follow the internal-linkage naming convention with an s prefix, and update the
function-scope literal constant failureSize to use the k prefix as required by
the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols
gCudaGetLastErrorCallCount and failureSize, and keep any references consistent
after renaming.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff5ba59d-2505-4f74-ac02-c8d13f665b35

📥 Commits

Reviewing files that changed from the base of the PR and between eb0cbdb and b3b4c93.

📒 Files selected for processing (3)
  • cpp/tensorrt_llm/common/ncclUtils.cpp
  • cpp/tensorrt_llm/common/ncclUtils.h
  • cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp

Comment thread cpp/tensorrt_llm/common/ncclUtils.cpp
@nv-lschneider nv-lschneider force-pushed the pr-14315-oom-size-aware-cache branch 4 times, most recently from 663ac00 to a79e22a Compare June 25, 2026 12:56
tensorrt-cicd and others added 4 commits June 25, 2026 08:06
…e sticky ncclMemAlloc errors

NCCLWindowAllocator::allocateAndRegisterBuffer used a per-rank cudaMalloc'd
flag + ncclAllReduce(min) to make a per-rank ncclMemAlloc failure non-hanging.
The intent was that all ranks reach the collective allreduce regardless of
ncclMemAlloc's outcome, so the failing rank can signal "abort" instead of
silently skipping the collective.

However the cudaMalloc for the rank-sync flag ran AFTER ncclMemAlloc.  When
ncclMemAlloc returns ncclUnhandledCudaError (the symptom in the MGMN
Llama-3.3-70B TP=2 run), the CUDA runtime can be left with a sticky
last-error.  The next runtime CUDA call (cudaMalloc for the rank-sync flag)
then inherits that error, TLLM_CUDA_CHECK throws, and the failing rank
unwinds out of allocateAndRegisterBuffer without ever reaching the collective
ncclAllReduce(min).  Healthy ranks proceed into the collective and block,
producing the hang the bug reports.

Fix:
- Pre-allocate rankSyncFlag before ncclMemAlloc, so the flag is always ready
  on every rank irrespective of what ncclMemAlloc does.
- After a failed ncclMemAlloc, drain any sticky CUDA last-error via
  cudaGetLastError() so the subsequent cudaMemcpy/ncclAllReduce on the CUDA
  stream see a clean device state.

Both changes are confined to allocateAndRegisterBuffer.  On the success path
they are no-ops; on the failure path they keep the failing rank on the same
collective control flow as healthy ranks, restoring the graceful fallback to
plain NCCL allreduce that PR NVIDIA#12125 originally introduced.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…mAlloc fallback fires

When ncclMemAlloc fails on any rank, allocateAndRegisterBuffer collectively
agrees on the failure and returns an empty NCCLWindowBuffer so all ranks fall
back to plain NCCL. Previously requestBuffer pushed that empty buffer back into
the per-comm pool with inUse=true; releaseBuffer is a no-op for a nullptr
pointer, so the entry stayed pinned forever. Repeated calls (e.g., autotuner
warmup steps) accumulated stale, permanently "in use" entries.

Only cache valid buffers. This complements the rank-sync ordering fix in
allocateAndRegisterBuffer.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
…lity

After the collective ncclAllReduce(min) inside allocateAndRegisterBuffer agrees that
at least one rank could not ncclMemAlloc, all subsequent requestBuffer() calls for the
same comm short-circuit to an empty NCCLWindowBuffer instead of re-running the failing
ncclMemAlloc + rank-sync allreduce on every autotuner trial. This is cluster-safe
because the unavailable decision is driven by a collective allreduce, so all ranks
agree and remain in sync without further communication. The cache is invalidated when
the comm is torn down.

Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
@nv-lschneider nv-lschneider force-pushed the pr-14315-oom-size-aware-cache branch from a79e22a to bcccd8d Compare June 25, 2026 13:06
@nv-lschneider

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55787 [ run ] triggered by Bot. Commit: bcccd8d Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants