Skip to content

Lock the goal-handle / result-future registry in ActionServer#1668

Open
zeon01 wants to merge 1 commit into
ros2:rollingfrom
zeon01:fix/action-server-expired-result-future
Open

Lock the goal-handle / result-future registry in ActionServer#1668
zeon01 wants to merge 1 commit into
ros2:rollingfrom
zeon01:fix/action-server-expired-result-future

Conversation

@zeon01
Copy link
Copy Markdown

@zeon01 zeon01 commented May 17, 2026

Summary

Closes #1236. Closes #1667.

ActionServer tracks goals in two dicts — _goal_handles and _result_futures — that are mutated and read as a pair. With the MultiThreadedExecutor, _execute_expire_goals can remove a goal from both dicts on one thread while _execute_goal, _execute_get_result_request, _execute_cancel_request or ServerGoalHandle._set_result reads them on another. The reader then hits a KeyError and the executor crashes.

This adds _goal_registry_lock and takes it around every access to the dict pair, so a reader never observes one dict updated without the other.

Design notes

  • threading.Lock, not asyncio.Lock. MultiThreadedExecutor steps these coroutines on real OS threads; asyncio.Lock is not thread-safe. The race holds even under MutuallyExclusiveCallbackGroup, because _execute_goal is scheduled as its own executor task via notify_execute (executor.create_task) and is therefore not serialized against the waitable's _execute_expire_goals by the callback group.
  • The lock is only ever held across synchronous sections — no await, no user-callback invocation. Where a coroutine acts on a Future, the reference is fetched under the lock and set_result() / add_done_callback() is called after release.
  • It is a separate lock from the existing self._lock and the two are never nested, so it cannot deadlock.
  • With the registry consistent, _execute_get_result_request reliably detects an expired goal and returns STATUS_UNKNOWN instead of crashing.

This replaces the earlier defensive-lookup-only revision of this PR, per review feedback that the root-cause fix is locking the dict pair as a single unit of state.

Test plan

Two regression tests added to test_action_server.py:

Verified against osrf/ros:rolling-desktop:

  • Without the patch: both new tests fail with KeyError (_execute_goal, _execute_get_result_request).
  • With the patch: both pass.
  • Full test_action_server.py — 28 tests — pass. ament_flake8 / ament_pep257 / ament_copyright pass.

Out of scope / follow-ups

  • A cleaner long-term shape is to collapse the two dicts into a single Dict[bytes, GoalEntry]; kept out of this PR to stay minimal — can follow up if preferred.
  • succeed(response) / abort(response) set the result future and _execute_goal then sets it again; rclpy's Future.set_result() overwrites + re-fires callbacks rather than raising, so the result can be sent twice. Separate issue from this race.

Copy link
Copy Markdown
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zeon01 thanks for the patch.

i believe this is a defensive patch, this could mitigate the situation a bit, but not a real fix.
to address and get to the root cause, i think that the proper fix is an asyncio.Lock (or threading.Lock depending on which executor path can touch these) that covers the pair of dicts as a single unit of state.

@fujitatomoya
Copy link
Copy Markdown
Collaborator

CC: @Barry-Xu-2018

`ActionServer` tracks goals in two dicts, `_goal_handles` and
`_result_futures`, that must be mutated and read as a pair. With the
MultiThreadedExecutor, `_execute_expire_goals` can remove a goal from
both dicts on one thread while `_execute_goal`,
`_execute_get_result_request`, `_execute_cancel_request` or
`ServerGoalHandle._set_result` reads them on another, raising
`KeyError` out of the executor and crashing the node.

Add `_goal_registry_lock` and take it around every access to the dict
pair so a reader never observes one dict updated without the other.
The lock is only ever held across synchronous sections (no `await`)
and is kept disjoint from the existing `self._lock`, so it cannot
deadlock.

With the registry kept consistent, `_execute_get_result_request`
reliably detects an expired goal and returns `STATUS_UNKNOWN` instead
of crashing, and `_execute_goal` / `_set_result` skip a result future
that has already been removed.

Signed-off-by: Saad Sharif Ahmed <saad.sharifahmed@gmail.com>
@zeon01 zeon01 force-pushed the fix/action-server-expired-result-future branch from ca3ce21 to 44f4473 Compare May 20, 2026 03:16
@zeon01 zeon01 changed the title Defensively look up result future in _execute_get_result_request Lock the goal-handle / result-future registry in ActionServer May 20, 2026
@zeon01
Copy link
Copy Markdown
Author

zeon01 commented May 20, 2026

Reworked to a lock-based fix — _goal_registry_lock now covers the _goal_handles / _result_futures pair as a single unit of state.

It's a threading.Lock: MultiThreadedExecutor steps these coroutines on real OS threads, and asyncio.Lock is not thread-safe. The race holds even under MutuallyExclusiveCallbackGroup_execute_goal is scheduled as its own executor task via notify_execute (executor.create_task), so the callback group does not serialize it against the waitable's _execute_expire_goals.

The lock is taken around every access to the pair (_execute_goal_request, _execute_goal, _execute_get_result_request, _execute_cancel_request, _execute_expire_goals, ServerGoalHandle._set_result, destroy()). It is only ever held across synchronous sections — no await, no user-callback invocation: a Future reference is fetched under the lock and set_result() / add_done_callback() called after release, so done-callbacks can't re-enter it. It is disjoint from self._lock and never nested.

Verified: regression tests for #1236 and #1667 fail with KeyError against the current code and pass with the lock; the full test_action_server.py suite (28 tests) and the flake8/pep257/copyright linters pass.

Two out-of-scope notes:

  • A cleaner long-term shape would collapse _goal_handles and _result_futures into a single Dict[bytes, GoalEntry] so the pair can't desync structurally. I kept this PR to the lock to stay minimal — happy to follow up if you'd prefer.
  • ServerGoalHandle.succeed(response) / abort(response) call _set_result, and _execute_goal then sets the same future again, so passing a non-None response sets the future twice. Future.set_result() doesn't raise on a second call — it overwrites the result and re-fires done callbacks — so the result response can be sent twice. Separate from this race; happy to file an issue if it isn't already tracked.

@christophebedard christophebedard added the more-information-needed Further information is required label May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

more-information-needed Further information is required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KeyError in ActionServer._execute_goal under goal-timeout abort (race on _result_futures) KeyError in ActionServer._execute_goal

3 participants