Commit 16eaed0
authored
[Data] Replace on_exit hook with __ray_shutdown__ to fix UDF cleanup race condition (ray-project#61700)
## Description
Replaces `_MapWorker.on_exit()` with `_MapWorker.__ray_shutdown__()` and
removes the `DataContext._enable_actor_pool_on_exit_hook` workaround
flag..
### What changed and why:
The old approach called `actor.on_exit.remote()` (a regular actor task)
in _release_running_actor, then used `ray.wait(..., timeout=30s)` to
block until the hook finished. This had two problems:
- Opt-in only. The hook was gated behind
`DataContext._enable_actor_pool_on_exit_hook`, which defaulted to
`False`. UDF cleanup was silently skipped unless users knew to set the
private flag.
- Fault-tolerance race condition. Because `on_exit` was submitted as a
regular task, a lineage-reconstruction retry could be routed to the same
actor after `on_exit` had already deleted the UDF. This may cause the
retried task to execute against a `None` UDF instance.
### The new approach:
- Renames `on_exit()` to `__ray_shutdown__()` on `_MapWorker`, using Ray
Core's native actor shutdown hook, which is called directly by the
worker process before it exits.
- Replaces `.options().remote()` with `._remote()` for actor task
submission. `ActorMethod.options()` creates a `FuncWrapper` closure that
captures the `ActorMethod` (and therefore the `ActorHandle`) in a
closure cell, forming a reference cycle. This cycle prevents actor
handles from being collected by reference counting alone, meaning
`__ray_shutdown__` would never fire without explicit `gc.collect()`.
Using `._remote()` directly avoids the `FuncWrapper` entirely, so actor
handles are collected properly by reference counting once all strong
references are dropped.
- Relies on passive GC (reference counting) to trigger
`__ray_shutdown__`. During graceful shutdown, the actor pool drops its
references to actor handles in `_release_running_actor`.
- UDF cleanup is now unconditional. `__ray_shutdown__` is always called
on graceful actor exit with no flag, no timeout, and no explicit
termination task.
### Removed:
- `DataContext._enable_actor_pool_on_exit_hook` (the flag is no longer
needed because cleanup is now zero-cost and unconditional).
- `_MapWorker.on_exit()` (replaced by `__ray_shutdown__()`).
- The on_exit_refs collection and `ray.wait()` call in
_release_running_actors.
- `_ActorPool._ACTOR_POOL_GRACEFUL_SHUTDOWN_TIMEOUT_S`.
## Related issues
Related to ray-project#53249 and partially resolves ray-project#60453.
## Additional information
The race condition in question from old `on_exit` approach:
- Actor A is processing Task T.
- `_release_running_actor` submits `actor.on_exit.remote()`; task added
to actor's queue.
- Task T fails and retry task is routed back to Actor A.
- on_exit runs and deletes UDF.
- Retry arrives and executes against `None` UDF, leading to crash.
With the new approach:
- `_release_running_actor` drops all pool references to the actor
handle..
- Once `_data_tasks` are cleared during shutdown, the actor handle's
refcount reaches zero and the actor exits gracefully.
- Ray Core calls `__ray_shutdown__` directly in the worker process
before exit, after all pending tasks complete.
- `__ray_shutdown__` runs as part of the actor's exit sequence,
guaranteed to be the last thing before the process terminates. No FIFO
queuing issue (race conditions) because of this.
The old `_enable_actor_pool_on_exit_hook` was a private, temporary
workaround documented as having this race condition. It has been removed
entirely as UDF cleanup is now unconditional and safe by default. Users
who were setting `ctx._enable_actor_pool_on_exit_hook = True` will get
the same behavior automatically with no code changes.
---------
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: HFFuture <ray.huang@anyscale.com>1 parent f996fa0 commit 16eaed0
4 files changed
Lines changed: 25 additions & 52 deletions
File tree
- python/ray/data
- _internal/execution/operators
- tests
Lines changed: 24 additions & 41 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
185 | 185 | | |
186 | 186 | | |
187 | 187 | | |
188 | | - | |
189 | 188 | | |
190 | 189 | | |
191 | 190 | | |
| |||
392 | 391 | | |
393 | 392 | | |
394 | 393 | | |
395 | | - | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
396 | 402 | | |
397 | 403 | | |
398 | 404 | | |
399 | | - | |
400 | | - | |
401 | | - | |
402 | | - | |
403 | | - | |
404 | | - | |
405 | 405 | | |
406 | 406 | | |
407 | 407 | | |
| |||
700 | 700 | | |
701 | 701 | | |
702 | 702 | | |
703 | | - | |
704 | | - | |
705 | | - | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
706 | 713 | | |
707 | | - | |
708 | | - | |
| 714 | + | |
| 715 | + | |
709 | 716 | | |
710 | 717 | | |
711 | 718 | | |
| |||
738 | 745 | | |
739 | 746 | | |
740 | 747 | | |
741 | | - | |
742 | 748 | | |
743 | 749 | | |
744 | 750 | | |
745 | 751 | | |
746 | 752 | | |
747 | 753 | | |
748 | 754 | | |
749 | | - | |
750 | 755 | | |
751 | 756 | | |
752 | 757 | | |
| |||
760 | 765 | | |
761 | 766 | | |
762 | 767 | | |
763 | | - | |
764 | | - | |
765 | 768 | | |
766 | 769 | | |
767 | 770 | | |
768 | 771 | | |
769 | | - | |
770 | 772 | | |
771 | 773 | | |
772 | 774 | | |
| |||
1095 | 1097 | | |
1096 | 1098 | | |
1097 | 1099 | | |
1098 | | - | |
1099 | | - | |
1100 | | - | |
1101 | 1100 | | |
1102 | | - | |
1103 | | - | |
1104 | | - | |
1105 | | - | |
1106 | | - | |
1107 | | - | |
| 1101 | + | |
1108 | 1102 | | |
1109 | 1103 | | |
1110 | 1104 | | |
1111 | 1105 | | |
1112 | 1106 | | |
1113 | 1107 | | |
1114 | 1108 | | |
1115 | | - | |
1116 | | - | |
1117 | | - | |
1118 | | - | |
1119 | | - | |
| 1109 | + | |
| 1110 | + | |
1120 | 1111 | | |
1121 | 1112 | | |
1122 | 1113 | | |
1123 | 1114 | | |
1124 | 1115 | | |
1125 | 1116 | | |
1126 | | - | |
| 1117 | + | |
1127 | 1118 | | |
1128 | 1119 | | |
1129 | 1120 | | |
| |||
1139 | 1130 | | |
1140 | 1131 | | |
1141 | 1132 | | |
1142 | | - | |
1143 | | - | |
1144 | | - | |
1145 | | - | |
1146 | | - | |
1147 | | - | |
1148 | 1133 | | |
1149 | 1134 | | |
1150 | 1135 | | |
1151 | | - | |
1152 | | - | |
1153 | 1136 | | |
1154 | 1137 | | |
1155 | 1138 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
723 | 723 | | |
724 | 724 | | |
725 | 725 | | |
726 | | - | |
727 | | - | |
728 | | - | |
729 | | - | |
730 | | - | |
731 | | - | |
732 | 726 | | |
733 | 727 | | |
734 | 728 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | | - | |
| 79 | + | |
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
173 | | - | |
174 | 173 | | |
175 | 174 | | |
176 | 175 | | |
| |||
805 | 804 | | |
806 | 805 | | |
807 | 806 | | |
808 | | - | |
809 | 807 | | |
810 | 808 | | |
811 | 809 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
888 | 888 | | |
889 | 889 | | |
890 | 890 | | |
891 | | - | |
892 | | - | |
893 | 891 | | |
894 | 892 | | |
895 | 893 | | |
| |||
0 commit comments