Commit 467abbb
[nvbugs/6062416][fix] Cache per-comm NCCL symmetric-memory unavailability
After the collective ncclAllReduce(min) inside allocateAndRegisterBuffer agrees that
at least one rank could not ncclMemAlloc, all subsequent requestBuffer() calls for the
same comm short-circuit to an empty NCCLWindowBuffer instead of re-running the failing
ncclMemAlloc + rank-sync allreduce on every autotuner trial. This is cluster-safe
because the unavailable decision is driven by a collective allreduce, so all ranks
agree and remain in sync without further communication. The cache is invalidated when
the comm is torn down.
Signed-off-by: tensorrt-cicd <90828364+tensorrt-cicd@users.noreply.github.com>1 parent fb0c8cc commit 467abbb
2 files changed
Lines changed: 26 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
320 | 320 | | |
321 | 321 | | |
322 | 322 | | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
323 | 334 | | |
324 | 335 | | |
325 | 336 | | |
| |||
371 | 382 | | |
372 | 383 | | |
373 | 384 | | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
374 | 392 | | |
375 | 393 | | |
376 | 394 | | |
| |||
604 | 622 | | |
605 | 623 | | |
606 | 624 | | |
| 625 | + | |
607 | 626 | | |
608 | 627 | | |
609 | 628 | | |
| |||
679 | 698 | | |
680 | 699 | | |
681 | 700 | | |
| 701 | + | |
682 | 702 | | |
683 | 703 | | |
684 | 704 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
289 | 289 | | |
290 | 290 | | |
291 | 291 | | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
292 | 298 | | |
293 | 299 | | |
294 | 300 | | |
| |||
0 commit comments