Lock the goal-handle / result-future registry in ActionServer#1668
Lock the goal-handle / result-future registry in ActionServer#1668zeon01 wants to merge 1 commit into
Conversation
fujitatomoya
left a comment
There was a problem hiding this comment.
@zeon01 thanks for the patch.
i believe this is a defensive patch, this could mitigate the situation a bit, but not a real fix.
to address and get to the root cause, i think that the proper fix is an asyncio.Lock (or threading.Lock depending on which executor path can touch these) that covers the pair of dicts as a single unit of state.
|
CC: @Barry-Xu-2018 |
`ActionServer` tracks goals in two dicts, `_goal_handles` and `_result_futures`, that must be mutated and read as a pair. With the MultiThreadedExecutor, `_execute_expire_goals` can remove a goal from both dicts on one thread while `_execute_goal`, `_execute_get_result_request`, `_execute_cancel_request` or `ServerGoalHandle._set_result` reads them on another, raising `KeyError` out of the executor and crashing the node. Add `_goal_registry_lock` and take it around every access to the dict pair so a reader never observes one dict updated without the other. The lock is only ever held across synchronous sections (no `await`) and is kept disjoint from the existing `self._lock`, so it cannot deadlock. With the registry kept consistent, `_execute_get_result_request` reliably detects an expired goal and returns `STATUS_UNKNOWN` instead of crashing, and `_execute_goal` / `_set_result` skip a result future that has already been removed. Signed-off-by: Saad Sharif Ahmed <saad.sharifahmed@gmail.com>
ca3ce21 to
44f4473
Compare
|
Reworked to a lock-based fix — It's a The lock is taken around every access to the pair ( Verified: regression tests for #1236 and #1667 fail with Two out-of-scope notes:
|
Summary
Closes #1236. Closes #1667.
ActionServertracks goals in two dicts —_goal_handlesand_result_futures— that are mutated and read as a pair. With theMultiThreadedExecutor,_execute_expire_goalscan remove a goal from both dicts on one thread while_execute_goal,_execute_get_result_request,_execute_cancel_requestorServerGoalHandle._set_resultreads them on another. The reader then hits aKeyErrorand the executor crashes.This adds
_goal_registry_lockand takes it around every access to the dict pair, so a reader never observes one dict updated without the other.Design notes
threading.Lock, notasyncio.Lock.MultiThreadedExecutorsteps these coroutines on real OS threads;asyncio.Lockis not thread-safe. The race holds even underMutuallyExclusiveCallbackGroup, because_execute_goalis scheduled as its own executor task vianotify_execute(executor.create_task) and is therefore not serialized against the waitable's_execute_expire_goalsby the callback group.await, no user-callback invocation. Where a coroutine acts on aFuture, the reference is fetched under the lock andset_result()/add_done_callback()is called after release.self._lockand the two are never nested, so it cannot deadlock._execute_get_result_requestreliably detects an expired goal and returnsSTATUS_UNKNOWNinstead of crashing.This replaces the earlier defensive-lookup-only revision of this PR, per review feedback that the root-cause fix is locking the dict pair as a single unit of state.
Test plan
Two regression tests added to
test_action_server.py:test_get_result_request_handles_expired_future(KeyError in ActionServer._execute_goal #1236) — removes the result future, sends a get-result request; asserts aSTATUS_UNKNOWNresponse instead of a crash.test_execute_goal_handles_expired_future(KeyError in ActionServer._execute_goal under goal-timeout abort (race on _result_futures) #1667) — removes the result future while the execute callback runs; asserts_execute_goaldoes not raise.Verified against
osrf/ros:rolling-desktop:KeyError(_execute_goal,_execute_get_result_request).test_action_server.py— 28 tests — pass.ament_flake8/ament_pep257/ament_copyrightpass.Out of scope / follow-ups
Dict[bytes, GoalEntry]; kept out of this PR to stay minimal — can follow up if preferred.succeed(response)/abort(response)set the result future and_execute_goalthen sets it again; rclpy'sFuture.set_result()overwrites + re-fires callbacks rather than raising, so the result can be sent twice. Separate issue from this race.