Skip to content

Commit 9141b2e

Browse files
committed
DistArray::lazy_deleter: skip lazy_sync when invoked from fence's do_cleanup
Use the new MADNESS `WorldGopInterface::is_in_do_cleanup()` flag to short-circuit the cross-rank `lazy_sync` handshake when `lazy_deleter` is called from inside `fence_impl`'s deferred-cleanup phase: `delete pimpl` directly, decrement `cleanup_counter_`, return. Why it is safe: - `fence_impl` runs the global-termination protocol before calling `deferred_->do_cleanup()`, so all ranks are at the same point with no AM in flight. - `defer_deleter_to_next_fence` is, by contract, used collectively, so every rank's deferred list holds the same set of pimpls at this point and every rank performs the matching delete in lockstep. - The `lazy_sync` handshake exists to guarantee that no peer is still about to send AM addressed to this object before we delete it; the fence already establishes that. Why it matters: the original `lazy_sync` path enqueues a `lazy_sync_children` task on this world's taskq *after* the fence's drain loop has exited. Such tasks survive the fence and are picked up later by some other fence that drives the global ThreadPool. If the world is destroyed in the meantime (e.g. einsum's per-Hadamard sub-Worlds torn down at function exit or during exception unwind), the stranded task runs `delete pimpl` against a world whose taskq / gop are already freed; `~WorldObject` then trips its `World::exists(&world)` assertion and aborts, masking any real error. The fast path avoids ever scheduling that task. The general (non-deferred) path is unchanged: `lazy_deleter` invoked outside `do_cleanup` still goes through `lazy_sync` because we cannot rely on synchronization with peers in that case.
1 parent 7d18300 commit 9141b2e

1 file changed

Lines changed: 18 additions & 0 deletions

File tree

src/TiledArray/array_impl.h

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -480,6 +480,24 @@ class ArrayImpl : public TensorImpl<Policy>,
480480
// wait for all DelayedSet's to vanish
481481
world.await([&]() { return (pimpl->num_live_ds() == 0); }, true);
482482

483+
// Fast path when invoked from inside the fence's deferred-cleanup
484+
// phase: the global-termination protocol has already established
485+
// global quiescence (no in-flight AM, all ranks at the same point),
486+
// and symmetric collective use of `defer_deleter_to_next_fence()`
487+
// guarantees every rank has this same pimpl in its deferred list
488+
// and so reaches this same delete in lockstep. The cross-rank
489+
// lazy_sync handshake below is therefore redundant; it would also
490+
// schedule a lazy_sync_children task on this world's taskq that the
491+
// fence cannot drain (do_cleanup runs after the drain loop) and
492+
// that would later be run by some unrelated fence -- against freed
493+
// state if this world is destroyed before then (e.g. einsum's
494+
// per-Hadamard sub-Worlds).
495+
if (world.gop.is_in_do_cleanup()) {
496+
delete pimpl;
497+
cleanup_counter_--;
498+
return;
499+
}
500+
483501
try {
484502
world.gop.lazy_sync(id, [pimpl]() {
485503
delete pimpl;

0 commit comments

Comments
 (0)