Skip to content

Commit 467abbb

Browse files
tensorrt-cicdnv-lschneider
authored andcommitted
[nvbugs/6062416][fix] Cache per-comm NCCL symmetric-memory unavailability
After the collective ncclAllReduce(min) inside allocateAndRegisterBuffer agrees that at least one rank could not ncclMemAlloc, all subsequent requestBuffer() calls for the same comm short-circuit to an empty NCCLWindowBuffer instead of re-running the failing ncclMemAlloc + rank-sync allreduce on every autotuner trial. This is cluster-safe because the unavailable decision is driven by a collective allreduce, so all ranks agree and remain in sync without further communication. The cache is invalidated when the comm is torn down. Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>
1 parent fb0c8cc commit 467abbb

2 files changed

Lines changed: 26 additions & 0 deletions

File tree

cpp/tensorrt_llm/common/ncclUtils.cpp

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -320,6 +320,17 @@ NCCLWindowBuffer NCCLWindowAllocator::requestBuffer(ncclComm_t comm, size_t size
320320
// This is cheap even if no buffers exist yet - cleanup will just return early
321321
registerBufferCleanup(comm);
322322

323+
// If a previous allocateAndRegisterBuffer call collectively concluded that this comm
324+
// cannot use NCCL symmetric memory, short-circuit so callers transparently fall back to
325+
// regular allreduce. This avoids re-running ncclMemAlloc + the rank-sync allreduce on
326+
// every autotuner trial, which would otherwise spam warnings and stress the failing path.
327+
// The decision is collective (driven by an ncclAllReduce(min) inside allocateAndRegisterBuffer),
328+
// so all ranks reach the same conclusion and stay in sync without further communication.
329+
if (mSymmetricUnavailable.find(comm) != mSymmetricUnavailable.end())
330+
{
331+
return NCCLWindowBuffer();
332+
}
333+
323334
// Check if we have an available buffer of at least the requested size for this communicator
324335
// Use best-fit: find the smallest buffer that's >= requested size
325336
auto& commBuffers = mBufferPool[comm];
@@ -371,6 +382,13 @@ NCCLWindowBuffer NCCLWindowAllocator::requestBuffer(ncclComm_t comm, size_t size
371382
{
372383
commBuffers.push_back({buffer, true});
373384
}
385+
else
386+
{
387+
// The collective allreduce inside allocateAndRegisterBuffer agreed that at least one
388+
// rank could not allocate symmetric memory. Mark this comm so future requests don't
389+
// retry the failing path on every autotuner trial.
390+
mSymmetricUnavailable.insert(comm);
391+
}
374392

375393
return buffer;
376394
}
@@ -604,6 +622,7 @@ void NCCLWindowAllocator::cleanupBuffersForComm(ncclComm_t comm) noexcept
604622
{
605623
// No buffers to clean up, but mark as cleaned
606624
mRegisteredComms.erase(comm);
625+
mSymmetricUnavailable.erase(comm);
607626
return;
608627
}
609628

@@ -679,6 +698,7 @@ void NCCLWindowAllocator::cleanupBuffersForComm(ncclComm_t comm) noexcept
679698

680699
mBufferPool.erase(commIt);
681700
mRegisteredComms.erase(comm);
701+
mSymmetricUnavailable.erase(comm);
682702
}
683703

684704
#endif // NCCL_VERSION_CODE >= NCCL_VERSION(2, 28, 0)

cpp/tensorrt_llm/common/ncclUtils.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -289,6 +289,12 @@ class NCCLWindowAllocator
289289
mutable std::mutex mMutex;
290290
std::unordered_map<ncclComm_t, std::vector<BufferEntry>> mBufferPool;
291291
std::unordered_set<ncclComm_t> mRegisteredComms;
292+
// Comms whose symmetric memory path is known to fail collectively (e.g. H100 PCIe without
293+
// NVLink fabric where ncclMemAlloc returns ncclUnhandledCudaError on at least one rank).
294+
// Once recorded, subsequent requestBuffer() calls short-circuit to NCCLWindowBuffer{} so we
295+
// don't repeatedly trigger the warning, the rank-sync allreduce, and the sticky-error drain
296+
// for every autotuner trial.
297+
std::unordered_set<ncclComm_t> mSymmetricUnavailable;
292298
};
293299

294300
// RAII wrapper for NCCL window buffers

0 commit comments

Comments
 (0)