[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size by nv-lschneider · Pull Request #15596 · NVIDIA/TensorRT-LLM

nv-lschneider · 2026-06-24T15:28:26Z

Summary by CodeRabbit

Bug Fixes
- Improved reliability of large GPU buffer allocation and cleanup.
- Reduced repeated allocation failures by remembering the smallest known failing request size.
- Prevented stale error state from interfering with later distributed operations.
- Ensured previously created buffers can still be reused safely after a failure.
Tests
- Added coverage for failure tracking, buffer reuse, and CUDA error clearing behavior.

Description

Follow-up to #14315.

This builds on the original NCCL window allocation OOM fix and addresses the review feedback from that PR. The previous approach cached symmetric-memory unavailability per communicator, which could disable NCCL window buffers too broadly after a single large allocation failure.

This PR makes the cache size-aware:

Reuses already-pooled valid NCCL window buffers before consulting the failure cache.
Records the smallest known failing symmetric allocation size per communicator.
Skips only new symmetric allocations at or above that known failing size.
Keeps smaller requests eligible, since symmetric-memory availability can be size/resource dependent.
Avoids caching invalid window buffers after fallback.
Clears sticky CUDA error state after a failed local ncclMemAlloc so every rank can still reach the collective fallback path.

It also adds tests for the size-aware failure cache, pooled-buffer reuse after failure, smallest-threshold tracking, and the CUDA error-drain branch.

Test Coverage

SBSA TRT-LLM container build via sbatch: cmake --build cpp/build --target ncclUtilsTest -- -j32
Two-rank NCCL allocator tests via sbatch:
- mpirun --allow-run-as-root --oversubscribe -np 2 --bind-to none ./cpp/build/tests/unit_tests/multi_gpu/ncclUtilsTest --gtest_filter=NCCLWindowAllocatorTest.*
- 15 tests passed per rank
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-24T15:34:15Z

📝 Walkthrough

Walkthrough

NCCLWindowAllocator is extended with a per-communicator mMinSymmetricFailureSize map to skip window-buffer allocations at or above a known-failing size. A new clearCudaErrorIfSymmetricAllocationFailed helper prevents sticky CUDA errors from blocking collectives. allocateAndRegisterBuffer is restructured to pre-allocate rankSyncFlag before ncclMemAlloc. Cleanup paths erase stale failure entries. Four new unit tests cover the added behaviors.

Changes

NCCL Window Allocator: Symmetric Failure Tracking & CUDA Error Clearing

Layer / File(s)	Summary
Header: failure-size cache and helper declarations `cpp/tensorrt_llm/common/ncclUtils.h`	Adds `mMinSymmetricFailureSize` private member, `recordSymmetricFailureLocked` and `clearCudaErrorIfSymmetricAllocationFailed` declarations, a `cudaGetLastError` function pointer type alias, and a `friend` test-access class to `NCCLWindowAllocator`.
requestBuffer: early-exit and failure recording `cpp/tensorrt_llm/common/ncclUtils.cpp`	`requestBuffer` consults `mMinSymmetricFailureSize` to return an empty buffer immediately when the requested size meets or exceeds the recorded threshold; after a failed `allocateAndRegisterBuffer`, it records the smallest failing size and skips caching the invalid buffer.
allocateAndRegisterBuffer: pre-allocate rankSyncFlag and clear CUDA errors `cpp/tensorrt_llm/common/ncclUtils.cpp`	`rankSyncFlag` is now allocated via `cudaMalloc` before the `ncclMemAlloc` call; when the symmetric allocation fails locally, `clearCudaErrorIfSymmetricAllocationFailed` is invoked to prevent sticky CUDA errors from propagating to subsequent collectives.
cleanupBuffersForComm: erase failure thresholds `cpp/tensorrt_llm/common/ncclUtils.cpp`	Both cleanup code paths in `cleanupBuffersForComm` now erase the `mMinSymmetricFailureSize` entry for the communicator on teardown.
Unit tests: failure-cache behavior and CUDA error clearing `cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp`	Adds `NCCLWindowAllocatorTestAccess`, a fake `cudaGetLastError` with a call counter, and four test cases covering size-aware failure caching, buffer reuse, smallest-size retention, and CUDA error clearing verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

hyukn
Tabrizian

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title matches the required ticket/type format and clearly summarizes the main change: size-aware NCCL window allocation failure caching.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description includes a clear problem/solution summary, test coverage, and checklist confirmation, matching the required template well.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp (1)
63-63: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Minor naming-convention deviations.

failureSize (Line 360) is a function-scope literal constant, which the guidelines name with a k prefix (kFailureSize). The anonymous-namespace counter gCudaGetLastErrorCallCount (Line 63) has internal linkage, where an s prefix is the encouraged convention rather than g.

As per coding guidelines: "function-scope magic-number/literal constants are camelCase with prefix 'k'" and file-scope/anonymous-namespace variables use an s prefix.

Also applies to: 360-360
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp` at line 63, Rename the
anonymous-namespace counter gCudaGetLastErrorCallCount to follow the
internal-linkage naming convention with an s prefix, and update the
function-scope literal constant failureSize to use the k prefix as required by
the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols
gCudaGetLastErrorCallCount and failureSize, and keep any references consistent
after renaming.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/common/ncclUtils.cpp`:
- Around line 379-391: Only cache fallback thresholds for symmetric memory
allocation failures, not all invalid buffers. In ncclUtils.cpp, update
allocateAndRegisterBuffer() to distinguish why it returns an empty buffer (e.g.
ncclMemAlloc failure vs ncclCommWindowRegister failure), and only call
recordSymmetricFailureLocked from the invalid-buffer path when the failure came
from the symmetric-allocation fallback. Keep the relevant logic in
allocateAndRegisterBuffer, recordSymmetricFailureLocked, and the commBuffers
handling aligned so registration errors do not poison future window allocation
attempts for the communicator.

---

Nitpick comments:
In `@cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp`:
- Line 63: Rename the anonymous-namespace counter gCudaGetLastErrorCallCount to
follow the internal-linkage naming convention with an s prefix, and update the
function-scope literal constant failureSize to use the k prefix as required by
the guidelines. Make the changes in ncclUtilsTest.cpp around the symbols
gCudaGetLastErrorCallCount and failureSize, and keep any references consistent
after renaming.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ff5ba59d-2505-4f74-ac02-c8d13f665b35

📥 Commits

Reviewing files that changed from the base of the PR and between eb0cbdb and b3b4c93.

📒 Files selected for processing (3)

cpp/tensorrt_llm/common/ncclUtils.cpp
cpp/tensorrt_llm/common/ncclUtils.h
cpp/tests/unit_tests/multi_gpu/ncclUtilsTest.cpp

…e sticky ncclMemAlloc errors NCCLWindowAllocator::allocateAndRegisterBuffer used a per-rank cudaMalloc'd flag + ncclAllReduce(min) to make a per-rank ncclMemAlloc failure non-hanging. The intent was that all ranks reach the collective allreduce regardless of ncclMemAlloc's outcome, so the failing rank can signal "abort" instead of silently skipping the collective. However the cudaMalloc for the rank-sync flag ran AFTER ncclMemAlloc. When ncclMemAlloc returns ncclUnhandledCudaError (the symptom in the MGMN Llama-3.3-70B TP=2 run), the CUDA runtime can be left with a sticky last-error. The next runtime CUDA call (cudaMalloc for the rank-sync flag) then inherits that error, TLLM_CUDA_CHECK throws, and the failing rank unwinds out of allocateAndRegisterBuffer without ever reaching the collective ncclAllReduce(min). Healthy ranks proceed into the collective and block, producing the hang the bug reports. Fix: - Pre-allocate rankSyncFlag before ncclMemAlloc, so the flag is always ready on every rank irrespective of what ncclMemAlloc does. - After a failed ncclMemAlloc, drain any sticky CUDA last-error via cudaGetLastError() so the subsequent cudaMemcpy/ncclAllReduce on the CUDA stream see a clean device state. Both changes are confined to allocateAndRegisterBuffer. On the success path they are no-ops; on the failure path they keep the failing rank on the same collective control flow as healthy ranks, restoring the graceful fallback to plain NCCL allreduce that PR NVIDIA#12125 originally introduced. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

…mAlloc fallback fires When ncclMemAlloc fails on any rank, allocateAndRegisterBuffer collectively agrees on the failure and returns an empty NCCLWindowBuffer so all ranks fall back to plain NCCL. Previously requestBuffer pushed that empty buffer back into the per-comm pool with inUse=true; releaseBuffer is a no-op for a nullptr pointer, so the entry stayed pinned forever. Repeated calls (e.g., autotuner warmup steps) accumulated stale, permanently "in use" entries. Only cache valid buffers. This complements the rank-sync ordering fix in allocateAndRegisterBuffer. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

…lity After the collective ncclAllReduce(min) inside allocateAndRegisterBuffer agrees that at least one rank could not ncclMemAlloc, all subsequent requestBuffer() calls for the same comm short-circuit to an empty NCCLWindowBuffer instead of re-running the failing ncclMemAlloc + rank-sync allreduce on every autotuner trial. This is cluster-safe because the unavailable decision is driven by a collective allreduce, so all ranks agree and remain in sync without further communication. The cache is invalidated when the comm is torn down. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

github-actions Bot assigned nv-lschneider Jun 24, 2026

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/common/ncclUtils.cpp

nv-lschneider requested review from Superjomn and Tabrizian June 24, 2026 15:38

nv-lschneider force-pushed the pr-14315-oom-size-aware-cache branch 2 times, most recently from 5664998 to 38507b4 Compare June 24, 2026 15:54

tensorrt-cicd and others added 4 commits June 24, 2026 15:11

[nvbugs/6062416][fix] Cache NCCL window allocation failures by size

663ac00

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

nv-lschneider force-pushed the pr-14315-oom-size-aware-cache branch from 38507b4 to 663ac00 Compare June 24, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596

[https://nvbugs/6062416][fix] Cache NCCL window allocation failures by size#15596
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:pr-14315-oom-size-aware-cache

nv-lschneider commented Jun 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nv-lschneider commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nv-lschneider commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading