Skip to content

Commit 575ece8

Browse files
committed
fix(action_status_bridge): address PR review feedback
Attribute faults to the action server node FQN (from the status publisher) so they associate with the SOVD entity; heal a still-failed action before pruning it. Per-action lifecycle, sanitized prefix, param validation, CANCELED codes. Unique test domain range, docs-build wiring, and a launch_testing integration test.
1 parent 4fda00c commit 575ece8

12 files changed

Lines changed: 844 additions & 102 deletions

File tree

docs/Doxyfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ INPUT = ../src/ros2_medkit_gateway/include \
1616
../src/ros2_medkit_diagnostic_bridge/src \
1717
../src/ros2_medkit_log_bridge/include \
1818
../src/ros2_medkit_log_bridge/src \
19+
../src/ros2_medkit_action_status_bridge/include \
20+
../src/ros2_medkit_action_status_bridge/src \
1921
../src/ros2_medkit_serialization/include \
2022
../src/ros2_medkit_serialization/src \
2123
../src/ros2_medkit_msgs

docs/api/cpp.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,14 @@ Bridge node that promotes ROS 2 /rosout log entries to FaultManager faults.
9191
.. doxygenclass:: ros2_medkit_log_bridge::LogBridgeNode
9292
:members:
9393

94+
ros2_medkit_action_status_bridge
95+
--------------------------------
96+
97+
Bridge node that turns terminal ROS 2 action goal states into FaultManager faults.
98+
99+
.. doxygenclass:: ros2_medkit_action_status_bridge::ActionStatusBridgeNode
100+
:members:
101+
94102
ros2_medkit_serialization
95103
-------------------------
96104

docs/changelog.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ This page aggregates changelogs from all ros2_medkit packages.
1616

1717
.. include:: ../src/ros2_medkit_log_bridge/CHANGELOG.rst
1818

19+
.. include:: ../src/ros2_medkit_action_status_bridge/CHANGELOG.rst
20+
1921
.. include:: ../src/ros2_medkit_serialization/CHANGELOG.rst
2022

2123
.. include:: ../src/ros2_medkit_msgs/CHANGELOG.rst

docs/introduction.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ ros2_medkit consists of several ROS 2 packages:
6262
**ros2_medkit_log_bridge**
6363
Bridge node that promotes ROS 2 ``/rosout`` log entries to fault manager faults.
6464

65+
**ros2_medkit_action_status_bridge**
66+
Bridge node that turns terminal ROS 2 action goal states (aborted) into fault manager faults.
67+
6568
**ros2_medkit_msgs**
6669
Message and service definitions for fault management.
6770

src/ros2_medkit_action_status_bridge/CHANGELOG.rst

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,17 @@ Forthcoming
66
-----------
77
* Initial release: generic action-status bridge. Watches every
88
``/<action>/_action/status`` topic and turns ABORTED goals into FaultManager
9-
faults (``<PREFIX>_<ACTION>_ABORTED``), heals on SUCCEEDED, with per-goal
10-
dedup. Captures the terminal action verdict that the diagnostic and log
11-
bridges cannot see.
9+
faults (``<PREFIX>_<ACTION>_ABORTED``), heals on a non-failing terminal state.
10+
Captures the terminal action verdict that the diagnostic and log bridges
11+
cannot see.
12+
* Fault state is per-ACTION, derived from the whole ``GoalStatusArray`` on each
13+
message: a fault is raised only on the ``healthy -> failed`` transition and
14+
healed only on ``failed -> healthy``, so it is order-independent and resilient
15+
to dropped terminal messages. Per-goal dedup now only suppresses duplicate log
16+
lines.
17+
* ``canceled_is_fault`` emits a status-aware ``<PREFIX>_<ACTION>_CANCELED`` code;
18+
such a fault heals on the next non-failing terminal or when the canceled goal
19+
ages out of the retained status array.
20+
* Vanished actions (lifecycle deactivate, one-shot nodes) are pruned on rescan.
21+
* Parameters are range-checked/normalized at load; an incompatible action status
22+
QoS is now warned about instead of silently dropping faults.

src/ros2_medkit_action_status_bridge/CMakeLists.txt

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ ament_export_dependencies(rclcpp action_msgs ros2_medkit_msgs ros2_medkit_fault_
8686
if(BUILD_TESTING)
8787
find_package(ament_lint_auto REQUIRED)
8888
find_package(ament_cmake_gtest REQUIRED)
89+
find_package(launch_testing_ament_cmake REQUIRED)
8990

9091
set(ament_cmake_clang_format_CONFIG_FILE "${CMAKE_CURRENT_SOURCE_DIR}/../../.clang-format")
9192
list(APPEND AMENT_LINT_AUTO_EXCLUDE ament_cmake_uncrustify ament_cmake_cpplint ament_cmake_clang_tidy)
@@ -94,7 +95,7 @@ if(BUILD_TESTING)
9495
ros2_medkit_clang_tidy()
9596

9697
include(ROS2MedkitTestDomain)
97-
medkit_init_test_domains(START 90 END 99)
98+
medkit_init_test_domains(START 215 END 219)
9899

99100
ament_add_gtest(test_action_status_bridge test/test_action_status_bridge.cpp)
100101
target_link_libraries(test_action_status_bridge action_status_bridge_lib)
@@ -106,6 +107,16 @@ if(BUILD_TESTING)
106107
target_link_options(test_action_status_bridge PRIVATE --coverage)
107108
endif()
108109

110+
# Integration test (launch_testing): fault_manager + bridge + synthetic status
111+
install(DIRECTORY test
112+
DESTINATION share/${PROJECT_NAME}
113+
)
114+
add_launch_test(
115+
test/test_integration.test.py
116+
TARGET test_integration
117+
TIMEOUT 120
118+
)
119+
109120
ros2_medkit_relax_vendor_warnings()
110121
endif()
111122

src/ros2_medkit_action_status_bridge/README.md

Lines changed: 59 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,33 @@ generically, by observing the goal-status topic.
2121
## What it does
2222

2323
- Discovers every action on the graph by scanning for `*/_action/status`
24-
topics (re-scanned on a timer to catch actions that appear later).
24+
topics (re-scanned on a timer to catch actions that appear later, and to drop
25+
actions that vanish, e.g. a Nav2 lifecycle deactivate or a one-shot node).
2526
- `ABORTED (6)` -> fault (`SEVERITY_ERROR` by default).
2627
- `CANCELED (5)` -> fault only if `canceled_is_fault` (off by default; cancel is
27-
usually intentional).
28-
- `SUCCEEDED (4)` -> `PASSED` to heal the action's ABORTED code (if enabled).
29-
- `fault_code` is `<PREFIX>_<ACTION>_ABORTED`, e.g.
30-
`ACTION_NAVIGATE_TO_POSE_ABORTED`. `source_id` is the action name.
31-
- Per-goal dedup keyed on `goal_id` so a latched terminal status is reported
32-
once, not on every status publication.
28+
usually intentional). When enabled it emits a `_CANCELED` code.
29+
- `SUCCEEDED (4)` -> `PASSED` to heal the action's fault code (if enabled).
30+
- `fault_code` is `<PREFIX>_<ACTION>_ABORTED` (or `_CANCELED`), e.g.
31+
`ACTION_NAVIGATE_TO_POSE_ABORTED`. `source_id` is the action **server's node
32+
FQN** (resolved from the status topic's publisher, e.g. `/bt_navigator`), so
33+
the fault associates with that node's SOVD entity; it falls back to the action
34+
name only if no publisher is visible.
35+
36+
## Per-action state, not per-goal
37+
38+
The fault is a property of the **action**, not of an individual goal. On every
39+
status message the whole `GoalStatusArray` is scanned for the net state:
40+
41+
- if **any** goal is `ABORTED` (or `CANCELED` when `canceled_is_fault`), the
42+
action is **failed** (order of goals in the array does not matter);
43+
- otherwise, if the array has terminal goals and none are failing, the action is
44+
**healthy**.
45+
46+
A fault is raised only on the `healthy -> failed` transition and healed only on
47+
`failed -> healthy`. This means one failed goal cannot heal an action while
48+
another goal is still failed, and a dropped terminal message cannot leave a
49+
fault stuck. Per-goal dedup (keyed on `goal_id:status`) only suppresses
50+
duplicate log lines; it never gates the transition.
3351

3452
## Scope: the terminal verdict, not the reason
3553

@@ -57,4 +75,37 @@ ros2 launch ros2_medkit_action_status_bridge action_status_bridge.launch.py
5775
| `code_prefix` | `ACTION` | prefix for generated codes |
5876
| `exclude_actions` | `[]` | action-name substrings to skip |
5977
| `include_only_actions` | `[]` | if set, only watch these |
60-
| `dedup_capacity` | `4096` | remembered goal/status keys |
78+
| `dedup_capacity` | `4096` | remembered goal/status log keys |
79+
80+
`exclude_actions` / `include_only_actions` match as **unanchored substrings** of
81+
the fully-qualified action name (e.g. `nav` matches `/navigate_to_pose`). Use a
82+
longer fragment (or the full name) for exact targeting.
83+
84+
`aborted_severity`, `code_prefix` and `dedup_capacity` are range-checked and
85+
normalized at load (out-of-range severity clamps to ERROR, a non-snake-case
86+
`code_prefix` is upper-snake-cased, a non-positive `dedup_capacity` falls back to
87+
the default), with a warning.
88+
89+
## Limitations
90+
91+
- **Flapping**: a retry loop that issues a fresh `goal_id` each attempt does not
92+
re-raise while the action stays failed (only net-state changes transition),
93+
but a true flap (fail -> succeed -> fail) does produce one raise + heal per
94+
cycle. There is no per-code rate throttle in this bridge; lower noise via the
95+
FaultReporter local filter if needed.
96+
- **No per-code throttle**: every action-level transition is forwarded.
97+
- **CANCELED heal semantics**: with `canceled_is_fault`, a canceled action heals
98+
on the next non-failing terminal (a later SUCCEEDED), or when the canceled goal
99+
ages out of the action server's retained status array and a non-failed terminal
100+
remains. It is not healed by an explicit "uncancel" because none exists.
101+
- **QoS**: the bridge requests the standard action status QoS (reliable +
102+
transient_local). A non-standard server (volatile/best-effort) is logged with
103+
an incompatible-QoS warning rather than silently yielding zero faults.
104+
- **Vanish while failed**: if an action that is currently failed disappears from
105+
the graph (lifecycle deactivate, one-shot node), the bridge heals its fault
106+
before forgetting it (when `heal_on_succeeded` is set), so it is not left stuck
107+
active. With healing disabled the fault persists by design.
108+
- **Healing threshold**: the bridge emits one `PASSED` per recovery (a discrete
109+
event, not a stream), so a fault reaches `HEALED` only when the FaultManager's
110+
`healing_threshold` is `0`. With a higher threshold the fault stays `CONFIRMED`
111+
after the action recovers.

src/ros2_medkit_action_status_bridge/config/action_status_bridge.yaml

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,19 @@ action_status_bridge:
22
ros__parameters:
33
# Severity for an ABORTED goal (0=INFO 1=WARN 2=ERROR 3=CRITICAL).
44
aborted_severity: 2
5-
# Treat a CANCELED goal as a fault. Off by default (cancel is usually
6-
# an intentional operator/client action, not a failure).
5+
# Treat a CANCELED goal as a fault (emits a _CANCELED code). Off by default
6+
# (cancel is usually an intentional operator/client action, not a failure).
77
canceled_is_fault: false
8-
# On a SUCCEEDED goal, send PASSED to heal the action's ABORTED fault code.
8+
# On a non-failing terminal state, send PASSED to heal the action's fault.
99
heal_on_succeeded: true
10-
# How often to rescan the graph for new action status topics.
10+
# How often to rescan the graph for new (and vanished) action status topics.
1111
rescan_period_sec: 2.0
12-
# Prefix for generated fault codes (<PREFIX>_<ACTION>_ABORTED).
12+
# Prefix for generated fault codes (<PREFIX>_<ACTION>_ABORTED/_CANCELED).
13+
# Normalized to UPPER_SNAKE at load.
1314
code_prefix: "ACTION"
14-
# Action-name substrings to skip.
15+
# Action-name substrings to skip (unanchored substring match).
1516
exclude_actions: []
1617
# If non-empty, ONLY watch actions matching these substrings.
1718
include_only_actions: []
18-
# Max remembered (goal_id:status) keys for dedup.
19+
# Max remembered (goal_id:status) keys for log dedup.
1920
dedup_capacity: 4096

src/ros2_medkit_action_status_bridge/include/ros2_medkit_action_status_bridge/action_status_bridge_node.hpp

Lines changed: 76 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414

1515
#pragma once
1616

17+
#include <array>
1718
#include <cstdint>
1819
#include <deque>
1920
#include <map>
@@ -43,37 +44,71 @@ namespace ros2_medkit_action_status_bridge {
4344
/// Status mapping (action_msgs/msg/GoalStatus):
4445
/// - ABORTED (6) -> fault (severity configurable, default ERROR)
4546
/// - CANCELED (5) -> fault only if canceled_is_fault (usually intentional)
46-
/// - SUCCEEDED (4)-> PASSED (heals the per-action ABORTED code) if enabled
47+
/// - SUCCEEDED (4)-> PASSED (heals the per-action fault code) if enabled
48+
///
49+
/// Fault state is per-ACTION, not per-goal: every message is scanned for the net
50+
/// state of the whole GoalStatusArray and a fault is raised/healed only on the
51+
/// action-level transition. See `derive_state`.
4752
class ActionStatusBridgeNode : public rclcpp::Node {
4853
public:
4954
explicit ActionStatusBridgeNode(const rclcpp::NodeOptions & options = rclcpp::NodeOptions());
5055

56+
/// Net fault state of an action, derived from a whole GoalStatusArray.
57+
/// - kUnknown: no terminal goal in the array yet (no transition)
58+
/// - kHealthy: array has terminal goals and none of them are failing
59+
/// - kFailed: at least one goal is failing (ABORTED, or CANCELED when
60+
/// canceled_is_fault)
61+
enum class ActionState { kUnknown, kHealthy, kFailed };
62+
5163
/// Derive the action name from a `/<action>/_action/status` topic name.
5264
/// Returns empty when the topic is not an action status topic.
5365
static std::string action_name_from_status_topic(const std::string & topic);
5466

55-
/// Build the ABORTED fault code for an action name.
56-
/// Format: <PREFIX>_<ACTION>_ABORTED, charset/length per medkit rules.
57-
std::string aborted_fault_code(const std::string & action_name) const;
67+
/// Build the fault code for an action name and terminal status.
68+
/// Format: <PREFIX>_<ACTION>_<ABORTED|CANCELED>, charset/length per medkit
69+
/// rules. `canceled` selects the CANCELED suffix.
70+
std::string fault_code_for(const std::string & action_name, bool canceled) const;
5871

5972
/// Lowercase hex of a 16-byte goal UUID, for dedup keys and short display.
6073
static std::string uuid_to_hex(const std::array<uint8_t, 16> & uuid);
6174

75+
/// Scan a whole GoalStatusArray and return the action-level net state.
76+
/// `canceled_is_fault` decides whether CANCELED counts as failing. Order of
77+
/// the goals in the array does not affect the result (any failing goal wins).
78+
static ActionState derive_state(const action_msgs::msg::GoalStatusArray & msg, bool canceled_is_fault);
79+
80+
/// Update per-action state from a message and act on the transition only.
81+
/// Returns the state that was reported on (kFailed on raise, kHealthy on
82+
/// heal) or kUnknown when nothing was reported. Side-effect free w.r.t.
83+
/// reporting when `reporter` is null (test seam): the transition decision and
84+
/// stored per-action state still update, so tests assert on the return value.
85+
ActionState apply_message(const std::string & action_name, const action_msgs::msg::GoalStatusArray & msg,
86+
ros2_medkit_fault_reporter::FaultReporter * reporter);
87+
6288
private:
6389
void rescan_actions();
64-
void status_callback(const std::string & action_name,
65-
const action_msgs::msg::GoalStatusArray::ConstSharedPtr & msg);
90+
void status_callback(const std::string & action_name, const action_msgs::msg::GoalStatusArray::ConstSharedPtr & msg);
6691

6792
ros2_medkit_fault_reporter::FaultReporter * reporter_for(const std::string & action_name);
6893

69-
/// Returns true if this (goal, status) pair was not handled before, marking
70-
/// it handled. Bounded to avoid unbounded growth.
71-
bool mark_handled(const std::string & goal_status_key);
94+
/// Resolve the action server's node FQN from its status-topic publisher, for
95+
/// use as the fault source_id so faults associate with the gateway's SOVD
96+
/// entity. Falls back to the action name if no publisher is visible.
97+
std::string server_fqn_for_action(const std::string & action_name);
98+
99+
/// Returns true if this (goal, status) pair was not logged before, marking it
100+
/// logged. Bounded to avoid unbounded growth. Suppresses duplicate LOG lines
101+
/// only; never gates the action-level state transition.
102+
bool mark_logged(const std::string & goal_status_key);
72103

73104
bool action_is_eligible(const std::string & action_name) const;
74105

75106
void load_parameters();
76107

108+
/// Drop subscriptions, reporters and per-action state for actions whose status
109+
/// topic has vanished from `present_topics`.
110+
void prune_vanished(const std::map<std::string, std::string> & present_topics);
111+
77112
static std::string to_upper_snake(const std::string & in, size_t max_len);
78113

79114
rclcpp::TimerBase::SharedPtr rescan_timer_;
@@ -82,11 +117,15 @@ class ActionStatusBridgeNode : public rclcpp::Node {
82117
std::map<std::string, std::unique_ptr<ros2_medkit_fault_reporter::FaultReporter>> reporters_;
83118
std::mutex reporters_mutex_;
84119

85-
// Bounded dedup of handled (goal_id:status) keys.
86-
std::unordered_set<std::string> handled_;
87-
std::deque<std::string> handled_order_;
88-
std::mutex handled_mutex_;
89-
size_t handled_capacity_;
120+
// Last reported action-level state, keyed by action name. Drives transitions.
121+
std::map<std::string, ActionState> last_reported_state_;
122+
std::mutex state_mutex_;
123+
124+
// Bounded dedup of logged (goal_id:status) keys (LOG suppression only).
125+
std::unordered_set<std::string> logged_;
126+
std::deque<std::string> logged_order_;
127+
std::mutex logged_mutex_;
128+
size_t logged_capacity_;
90129

91130
// Configuration
92131
uint8_t aborted_severity_;
@@ -96,6 +135,29 @@ class ActionStatusBridgeNode : public rclcpp::Node {
96135
std::string code_prefix_;
97136
std::vector<std::string> exclude_actions_;
98137
std::vector<std::string> include_only_actions_;
138+
139+
friend class ActionStatusBridgeTestAccess;
140+
};
141+
142+
/// Test-only accessor for the bridge's private maps and prune path. Lets unit
143+
/// tests exercise rescan add+prune without a live action graph.
144+
class ActionStatusBridgeTestAccess {
145+
public:
146+
explicit ActionStatusBridgeTestAccess(ActionStatusBridgeNode * node) : node_(node) {
147+
}
148+
149+
/// Seed a watched action (subscription placeholder + per-action state) as if
150+
/// rescan had discovered its `/<action>/_action/status` topic.
151+
void add_watched(const std::string & action_name);
152+
153+
/// Run the prune pass against a set of still-present action names.
154+
void prune_to(const std::vector<std::string> & present_action_names);
155+
156+
bool is_watched(const std::string & action_name) const;
157+
bool has_state(const std::string & action_name) const;
158+
159+
private:
160+
ActionStatusBridgeNode * node_;
99161
};
100162

101163
} // namespace ros2_medkit_action_status_bridge

0 commit comments

Comments
 (0)