[Data] Replace on_exit hook with __ray_shutdown__ to fix UDF cleanup race condition (ray-project#61700)

rayhhome · web-flow · commit 16eaed0ec487 · 2026-03-20T16:41:02.000-07:00
## Description Replaces `_MapWorker.on_exit()` with `_MapWorker.__ray_shutdown__()` and removes the `DataContext._enable_actor_pool_on_exit_hook` workaround flag.. ### What changed and why: The old approach called `actor.on_exit.remote()` (a regular actor task) in _release_running_actor, then used `ray.wait(..., timeout=30s)` to block until the hook finished. This had two problems: - Opt-in only. The hook was gated behind `DataContext._enable_actor_pool_on_exit_hook`, which defaulted to `False`. UDF cleanup was silently skipped unless users knew to set the private flag. - Fault-tolerance race condition. Because `on_exit` was submitted as a regular task, a lineage-reconstruction retry could be routed to the same actor after `on_exit` had already deleted the UDF. This may cause the retried task to execute against a `None` UDF instance. ### The new approach: - Renames `on_exit()` to `__ray_shutdown__()` on `_MapWorker`, using Ray Core's native actor shutdown hook, which is called directly by the worker process before it exits. - Replaces `.options().remote()` with `._remote()` for actor task submission. `ActorMethod.options()` creates a `FuncWrapper` closure that captures the `ActorMethod` (and therefore the `ActorHandle`) in a closure cell, forming a reference cycle. This cycle prevents actor handles from being collected by reference counting alone, meaning `__ray_shutdown__` would never fire without explicit `gc.collect()`. Using `._remote()` directly avoids the `FuncWrapper` entirely, so actor handles are collected properly by reference counting once all strong references are dropped. - Relies on passive GC (reference counting) to trigger `__ray_shutdown__`. During graceful shutdown, the actor pool drops its references to actor handles in `_release_running_actor`. - UDF cleanup is now unconditional. `__ray_shutdown__` is always called on graceful actor exit with no flag, no timeout, and no explicit termination task. ### Removed: - `DataContext._enable_actor_pool_on_exit_hook` (the flag is no longer needed because cleanup is now zero-cost and unconditional). - `_MapWorker.on_exit()` (replaced by `__ray_shutdown__()`). - The on_exit_refs collection and `ray.wait()` call in _release_running_actors. - `_ActorPool._ACTOR_POOL_GRACEFUL_SHUTDOWN_TIMEOUT_S`. ## Related issues Related to ray-project#53249 and partially resolves ray-project#60453. ## Additional information The race condition in question from old `on_exit` approach: - Actor A is processing Task T. - `_release_running_actor` submits `actor.on_exit.remote()`; task added to actor's queue. - Task T fails and retry task is routed back to Actor A. - on_exit runs and deletes UDF. - Retry arrives and executes against `None` UDF, leading to crash. With the new approach: - `_release_running_actor` drops all pool references to the actor handle.. - Once `_data_tasks` are cleared during shutdown, the actor handle's refcount reaches zero and the actor exits gracefully. - Ray Core calls `__ray_shutdown__` directly in the worker process before exit, after all pending tasks complete. - `__ray_shutdown__` runs as part of the actor's exit sequence, guaranteed to be the last thing before the process terminates. No FIFO queuing issue (race conditions) because of this. The old `_enable_actor_pool_on_exit_hook` was a private, temporary workaround documented as having this race condition. It has been removed entirely as UDF cleanup is now unconditional and safe by default. Users who were setting `ctx._enable_actor_pool_on_exit_hook = True` will get the same behavior automatically with no code changes. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: HFFuture <ray.huang@anyscale.com>
diff --git a/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py b/python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
@@ -185,7 +185,6 @@ def _create_actor_pool(
             create_actor_fn=self._start_actor,
             config=config,
             map_worker_cls_name=self._map_worker_cls_name,
-            _enable_actor_pool_on_exit_hook=self.data_context._enable_actor_pool_on_exit_hook,
         )
 
     def _create_actor_pool_config(
@@ -392,16 +391,17 @@ def _try_schedule_tasks_internal(self) -> int:
             )
             actor_task_args = dict(self._ray_actor_task_remote_args)
             extra_labels = actor_task_args.pop("_labels", None) or {}
-            gen = actor.submit.options(
+
+            # Call _remote() directly instead of .options().remote() to
+            # avoid the FuncWrapper closure in ActorMethod.options(), which
+            # creates a reference cycle that prevents the ActorHandle from
+            # being collected by reference counting alone.
+            gen = actor.submit._remote(
+                args=[self.data_context, ctx, *input_blocks],
+                kwargs={"slices": bundle.slices, **self.get_map_task_kwargs()},
                 num_returns="streaming",
                 _labels={self._OPERATOR_ID_LABEL_KEY: self.id, **extra_labels},
                 **actor_task_args,
-            ).remote(
-                self.data_context,
-                ctx,
-                *input_blocks,
-                slices=bundle.slices,
-                **self.get_map_task_kwargs(),
             )
 
             def _task_done_callback(actor_to_return):
@@ -700,12 +700,19 @@ def __repr__(self):
         # This can happen during actor restarts or initialization failures.
         return f"MapWorker({getattr(self, 'src_fn_name', '<initializing>')})"
 
-    def on_exit(self):
-        """Called when the actor is about to exist.
-        This enables performing cleanup operations via `UDF.__del__`.
+    def __ray_shutdown__(self):
+        """Called by Ray Core when the actor exits gracefully.
+
+        Triggered when all Python actor handles go out of scope and the handle
+        is collected by reference counting.
+
+        During graceful shutdown, ActorPoolMapOperator clears _data_tasks and
+        drops pool references so handles become collectible immediately.
+        Ray Core guarantees this is called after all pending tasks complete
+        and before the actor process exits.
 
-        Note, this only ensures cleanup is performed when the job exists gracefully.
-        If the driver or the actor is forcefully killed, `__del__` will not be called.
+        Note: this is NOT called if the actor is forcefully killed (e.g. via
+        `ray.kill(actor)`) or crashes unexpectedly.
         """
         # `_map_actor_context` is a global variable that references the UDF object.
         # Delete it to trigger `UDF.__del__`.
@@ -738,15 +745,13 @@ class _ActorPool(AutoscalingActorPool):
     """
 
     _ACTOR_POOL_SCALE_DOWN_DEBOUNCE_PERIOD_S = 10
-    _ACTOR_POOL_GRACEFUL_SHUTDOWN_TIMEOUT_S = 30
 
     def __init__(
         self,
         create_actor_fn: Callable[[Dict[str, str]], Tuple[ActorHandle, ObjectRef[Any]]],
         config: AutoscalingActorConfig,
         map_worker_cls_name: str = "MapWorker",
         debounce_period_s: int = _ACTOR_POOL_SCALE_DOWN_DEBOUNCE_PERIOD_S,
-        _enable_actor_pool_on_exit_hook: bool = False,
     ):
         """Initialize the actor pool.
 
@@ -760,13 +765,10 @@ def __init__(
                 purposes.
             debounce_period_s: Debounce period for scaling down after scaling
                 up.
-            _enable_actor_pool_on_exit_hook: Whether to enable the actor pool
-                on exit hook.
         """
         super().__init__(config=config)
 
         self._create_actor_fn = create_actor_fn
-        self._enable_actor_pool_on_exit_hook = _enable_actor_pool_on_exit_hook
         self._map_worker_cls_name = map_worker_cls_name
         self._debounce_period_s = debounce_period_s
         # Timestamp of the last scale up action
@@ -1095,35 +1097,24 @@ def _release_pending_actors(self, force: bool):
     def _release_running_actors(self, force: bool):
         running = list(self._running_actors.keys())
 
-        on_exit_refs = []
-
-        # First release actors and collect their shutdown hook object-refs
         for actor in running:
-            ref = self._release_running_actor(actor)
-            if ref:
-                on_exit_refs.append(ref)
-
-        # Wait for all actors to shutdown gracefully before killing them
-        ray.wait(on_exit_refs, timeout=self._ACTOR_POOL_GRACEFUL_SHUTDOWN_TIMEOUT_S)
+            self._release_running_actor(actor)
 
         # NOTE: Actors can't be brought back after being ``ray.kill``-ed,
         #       hence we're only doing that if this is a forced release
         if force:
             for actor in running:
                 ray.kill(actor)
 
-    def _release_running_actor(self, actor: ActorHandle) -> Optional[ObjectRef]:
-        """Remove the given actor from the pool and trigger its `on_exit` callback.
-
-        This method returns a ``ref`` to the result
-        """
+    def _release_running_actor(self, actor: ActorHandle):
+        """Remove the given actor from the pool by dropping all pool references."""
         # NOTE: By default, we remove references to the actor and let ref counting
         # garbage collect the actor, instead of using ray.kill.
         #
         # Otherwise, actor cannot be reconstructed for the purposes of produced
         # object's lineage reconstruction.
         if actor not in self._running_actors:
-            return None
+            return
 
         # Update cached statistics before removing the actor
         actor_state = self._running_actors[actor]
@@ -1139,17 +1130,9 @@ def _release_running_actor(self, actor: ActorHandle) -> Optional[ObjectRef]:
         if actor_state.is_restarting:
             self._num_restarting_actors -= 1
 
-        if self._enable_actor_pool_on_exit_hook:
-            # Call `on_exit` to trigger `UDF.__del__` which may perform
-            # cleanup operations.
-            ref = actor.on_exit.remote()
-        else:
-            ref = None
         del self._running_actors[actor]
         del self._actor_to_logical_id[actor]
 
-        return ref
-
     def _rank_actors(
         self,
         actors: List[ActorHandle],
diff --git a/python/ray/data/context.py b/python/ray/data/context.py
@@ -723,12 +723,6 @@ class DataContext:
     override_object_store_memory_limit_fraction: float = None
     memory_usage_poll_interval_s: Optional[float] = 1
     dataset_logger_id: Optional[str] = None
-    # This is a temporary workaround to allow actors to perform cleanup
-    # until https://github.com/ray-project/ray/issues/53169 is fixed.
-    # This hook is known to have a race condition bug in fault tolerance.
-    # I.E., after the hook is triggered and the UDF is deleted, another
-    # retry task may still be scheduled to this actor and it will fail.
-    _enable_actor_pool_on_exit_hook: bool = False
 
     issue_detectors_config: "IssueDetectorsConfiguration" = field(
         default_factory=_issue_detectors_config_factory
diff --git a/python/ray/data/tests/test_actor_pool_map_operator.py b/python/ray/data/tests/test_actor_pool_map_operator.py
@@ -76,7 +76,7 @@ def __init__(self, node_id: str = "node1"):
     def get_location(self) -> str:
         return self.node_id
 
-    def on_exit(self):
+    def __ray_shutdown__(self):
         pass
 
 
@@ -170,7 +170,6 @@ def _create_actor_pool(
             create_actor_fn=self._create_actor_fn,
             map_worker_cls_name=map_worker_cls_name,
             config=config,
-            _enable_actor_pool_on_exit_hook=False,
         )
         return pool
 
@@ -805,7 +804,6 @@ def create_actor_fn(
         map_worker_cls_name="MapWorker(TestOp)",
         config=config,
         debounce_period_s=0,
-        _enable_actor_pool_on_exit_hook=False,
     )
 
     with caplog.at_level(logging.DEBUG, logger=logger_name):
diff --git a/python/ray/data/tests/test_map.py b/python/ray/data/tests/test_map.py
@@ -888,8 +888,6 @@ def test_actor_udf_cleanup(
     """Test that for the actor map operator, the UDF object is deleted properly."""
     ray.shutdown()
     ray.init(num_cpus=2)
-    ctx = DataContext.get_current()
-    ctx._enable_actor_pool_on_exit_hook = True
 
     test_file = tmp_path / "test.txt"