Skip to content

Commit 8569f21

Browse files
mfaferek93bburda
authored andcommitted
feat(fault_manager): entity-scoped rosbag capture by default
Default "entity" mode buffers broadly but writes only the faulting source node's topics (+ /tf) on confirm, resolved from the fault's reporting source and intersected with the subscribed set. Broad modes (all/auto/entity) keep re-resolving so topics whose publishers appear after startup are still captured. Adds faithful per-topic QoS matching, a ring-buffer RAM cap, and segment-anchored sensor-topic auto-exclude. Entity resolution is guarded so a sqlite get_fault() throw on the capture thread cannot crash the node. Closes #426
1 parent cc0e22d commit 8569f21

12 files changed

Lines changed: 659 additions & 30 deletions

File tree

docs/config/fault-manager.rst

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -217,12 +217,15 @@ Capture continuous rosbag recordings around fault events.
217217
enabled: false # Enable rosbag recording
218218
duration_sec: 5.0 # Pre-fault buffer duration
219219
duration_after_sec: 1.0 # Post-fault recording duration
220-
topics: "config" # Topic selection: "config", "all", or "none"
220+
topics: "entity" # Topic selection: "entity" (default), "config", "all", "explicit"
221221
include_topics: [] # Additional topics to include
222222
exclude_topics: [] # Topics to exclude
223+
exclude_sensor_topics: true # Auto-exclude image/points/depth/compressed in broad modes
223224
lazy_start: false # Start recording on first fault
224225
format: "sqlite3" # Storage format
226+
qos_match: true # Match each topic's publisher QoS
225227
storage_path: "" # Custom storage path
228+
max_buffer_mb: 256 # Ring-buffer RAM cap
226229
max_bag_size_mb: 50 # Max size per bag file
227230
max_total_storage_mb: 500 # Max total storage
228231
auto_cleanup: true # Auto-delete old bags
@@ -244,8 +247,21 @@ Capture continuous rosbag recordings around fault events.
244247
- ``1.0``
245248
- How long to record after fault.
246249
* - ``rosbag.topics``
247-
- ``config``
248-
- Topic selection mode: ``config`` (per-fault), ``all``, or ``none``.
250+
- ``entity``
251+
- Topic selection mode: ``entity`` (default; write only the faulting node's
252+
topics + ``/tf``), ``config`` (per-fault), ``all``, or ``explicit``.
253+
* - ``rosbag.exclude_sensor_topics``
254+
- ``true``
255+
- In broad modes (``all``/``entity``), auto-exclude high-bandwidth sensor
256+
topics (image/points/depth/compressed) to bound memory. Excluded topics are
257+
dropped silently; ``include_topics`` re-adds any you need.
258+
* - ``rosbag.qos_match``
259+
- ``true``
260+
- Subscribe with each topic's publisher-offered QoS for faithful capture
261+
instead of forcing best-effort.
262+
* - ``rosbag.max_buffer_mb``
263+
- ``256``
264+
- Ring-buffer RAM cap; oldest buffered messages drop past it.
249265
* - ``rosbag.lazy_start``
250266
- ``false``
251267
- Start recording only when first fault occurs.

docs/tutorials/snapshots.rst

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,12 +332,27 @@ Rosbag Configuration Options
332332
- Post-fault recording duration. After a fault is confirmed, recording
333333
continues for this many seconds to capture immediate system response.
334334
* - ``snapshots.rosbag.topics``
335-
- ``"config"``
335+
- ``"entity"``
336336
- Topic selection mode:
337337

338+
- ``"entity"`` - Default. Subscribe broadly for pre-roll, but on fault
339+
confirmation write only the faulting source node's topics (resolved from
340+
the fault's reporting source) plus ``/tf`` and ``/tf_static``. Falls back
341+
to the full buffer if the source is not a live node.
338342
- ``"config"`` - Use same topics as JSON snapshots (from config file)
339343
- ``"all"`` or ``"auto"`` - Auto-discover and record all available topics
340344
- ``"explicit"`` - Use only topics from ``include_topics`` list
345+
346+
.. note::
347+
348+
The default changed from ``"config"`` to ``"entity"`` so an enabled black
349+
box produces a useful, fault-scoped bag with no per-topic configuration.
350+
Set ``topics: "config"`` to restore the previous behavior.
351+
352+
Faults reported by the ``diagnostic_bridge`` carry the bridge's own node
353+
as the source, so in ``entity`` mode the bag is scoped to the bridge
354+
(e.g. ``/diagnostics``) rather than the node that actually failed. Use a
355+
manual mode if you need a different scope for bridge-sourced faults.
341356
* - ``snapshots.rosbag.include_topics``
342357
- ``[]``
343358
- Explicit list of topics to record (only used when ``topics: "explicit"``).
@@ -346,6 +361,19 @@ Rosbag Configuration Options
346361
- ``[]``
347362
- Topics to exclude from recording (applies to all modes). Useful for
348363
filtering high-bandwidth topics like camera images.
364+
* - ``snapshots.rosbag.exclude_sensor_topics``
365+
- ``true``
366+
- In broad modes (``all``/``entity``), auto-exclude high-bandwidth sensor
367+
topics (image/points/depth/compressed) to bound memory. Dropped silently;
368+
``include_topics`` re-adds any you specifically need.
369+
* - ``snapshots.rosbag.qos_match``
370+
- ``true``
371+
- Subscribe with each topic's publisher-offered QoS (reliable/transient-local
372+
where offered) for faithful capture, instead of forcing best-effort.
373+
* - ``snapshots.rosbag.max_buffer_mb``
374+
- ``256``
375+
- In-memory ring-buffer cap; oldest buffered messages drop once exceeded, so a
376+
broad subscribe set cannot grow memory without bound.
349377
* - ``snapshots.rosbag.format``
350378
- ``"sqlite3"``
351379
- Bag storage format: ``"sqlite3"`` (default, widely compatible) or

src/ros2_medkit_fault_manager/CMakeLists.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,13 @@ if(BUILD_TESTING)
217217
TIMEOUT 60
218218
)
219219
set_tests_properties(test_entity_thresholds_integration PROPERTIES LABELS "integration")
220+
221+
add_launch_test(
222+
test/test_rosbag_entity_scope.test.py
223+
TARGET test_rosbag_entity_scope
224+
TIMEOUT 120
225+
)
226+
set_tests_properties(test_rosbag_entity_scope PROPERTIES LABELS "integration")
220227
ros2_medkit_relax_vendor_warnings()
221228
endif()
222229

src/ros2_medkit_fault_manager/config/snapshots.yaml

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -86,13 +86,18 @@ rosbag:
8686
# Continue recording after fault is confirmed to capture aftermath
8787
duration_after_sec: 1.0
8888

89-
# Topic selection mode (default: "config")
89+
# Topic selection mode (default: "entity")
9090
# Options:
91+
# "entity" - DEFAULT. Subscribe broadly for pre-roll, but on fault confirmation
92+
# write only the faulting source node's topics (+ /tf, /tf_static),
93+
# resolved from the fault's reporting source. Useful black box with
94+
# zero per-topic config. Falls back to the full buffer if the source
95+
# cannot be resolved to a live node.
9196
# "config" - Use same topics as JSON snapshots (from this config file)
92-
# "all" - Record all available topics (WARNING: high memory usage!)
97+
# "all" - Record all available topics
9398
# "auto" - Alias for "all" (auto-discover all topics)
9499
# "explicit" - Use only topics from include_topics list below
95-
topics: "config"
100+
topics: "entity"
96101

97102
# Explicit topic list (only used when topics: "explicit")
98103
# include_topics:
@@ -107,6 +112,11 @@ rosbag:
107112
# - /camera/depth/image_raw
108113
# - /pointcloud
109114

115+
# Auto-exclude high-bandwidth sensor topics (image/points/depth/compressed) in
116+
# broad modes ("all"/"auto"/"entity") to bound memory (default: true).
117+
# Excluded topics are dropped silently; list them in include_topics to re-add.
118+
exclude_sensor_topics: true
119+
110120
# Lazy start mode (default: false)
111121
# When true, ring buffer only starts when fault enters PREFAILED state
112122
# Saves resources but may miss early context if fault confirms quickly
@@ -134,6 +144,16 @@ rosbag:
134144
# Oldest bags are deleted when this limit is exceeded
135145
max_total_storage_mb: 500
136146

147+
# Maximum in-memory ring buffer size in MB (default: 256)
148+
# Oldest buffered messages are dropped once the buffer exceeds this, so a broad
149+
# subscribe set on a busy robot cannot grow memory without bound.
150+
max_buffer_mb: 256
151+
152+
# Match each topic's publisher-offered QoS for faithful capture (default: true)
153+
# Captures reliable / transient-local topics without dropping; set false to force
154+
# best-effort SensorDataQoS on every subscription.
155+
qos_match: true
156+
137157
# Auto-cleanup bag files when fault is cleared (default: true)
138158
# When true, bag files are deleted when ClearFault is called
139159
# When false, bag files persist until manually deleted or storage limit hit

src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/rosbag_capture.hpp

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,10 @@ class RosbagCapture {
109109
return config_.enabled;
110110
}
111111

112+
/// Whether a topic is a high-bandwidth sensor stream (image/points/depth/compressed),
113+
/// auto-excluded from broad-mode capture. Static + public so it is directly testable.
114+
static bool is_high_bandwidth_topic(const std::string & topic);
115+
112116
private:
113117
/// Initialize subscriptions for configured topics
114118
void init_subscriptions();
@@ -123,6 +127,17 @@ class RosbagCapture {
123127
/// Resolve which topics to record based on config
124128
std::vector<std::string> resolve_topics() const;
125129

130+
/// Resolve the publisher-offered QoS for a topic so capture is faithful
131+
/// (falls back to SensorDataQoS when no publisher is known or qos_match is off)
132+
rclcpp::QoS resolve_topic_qos(const std::string & topic) const;
133+
134+
/// In "entity" mode, compute the set of topics to write for a confirmed fault
135+
/// (the faulting source node's pub/sub topics + /tf). Empty set = write all.
136+
void resolve_entity_topics(const std::string & fault_code);
137+
138+
/// Whether a topic should be written to the bag given the active entity filter
139+
bool should_capture_topic(const std::string & topic) const;
140+
126141
/// Get message type for a topic
127142
std::string get_topic_type(const std::string & topic) const;
128143

@@ -167,6 +182,13 @@ class RosbagCapture {
167182
/// Ring buffer for messages
168183
mutable std::mutex buffer_mutex_;
169184
std::deque<BufferedMessage> message_buffer_;
185+
/// Running byte size of message_buffer_ (guarded by buffer_mutex_), for the RAM cap
186+
size_t buffer_bytes_{0};
187+
188+
/// Topics to write for the in-flight capture in "entity" mode (guarded below).
189+
/// Empty = no entity filter, write everything buffered (manual modes / fallback).
190+
mutable std::mutex capture_topics_mutex_;
191+
std::set<std::string> active_capture_topics_;
170192

171193
/// Subscriptions (kept alive for continuous recording)
172194
std::vector<rclcpp::GenericSubscription::SharedPtr> subscriptions_;
@@ -197,6 +219,16 @@ class RosbagCapture {
197219
rclcpp::TimerBase::SharedPtr discovery_retry_timer_;
198220
std::vector<std::string> pending_topics_;
199221
int discovery_retry_count_{0};
222+
223+
/// Topics already subscribed (guards against duplicate subscriptions when the
224+
/// broad-mode discovery timer re-resolves the topic set).
225+
std::set<std::string> subscribed_topics_;
226+
227+
/// True in broad modes ("all"/"auto"/"entity"): the discovery timer keeps
228+
/// re-resolving for the capture's lifetime so topics whose publishers appear
229+
/// after startup are still subscribed (dynamic capture). False in fixed modes
230+
/// (config/explicit/list), where only the initial set is retried.
231+
bool dynamic_discovery_{false};
200232
};
201233

202234
} // namespace ros2_medkit_fault_manager

src/ros2_medkit_fault_manager/include/ros2_medkit_fault_manager/snapshot_capture.hpp

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,15 +38,26 @@ struct RosbagConfig {
3838
/// Duration in seconds to continue recording after fault confirmation
3939
double duration_after_sec{1.0};
4040

41-
/// Topic selection mode: "config" (reuse JSON config), "all", or comma-separated list
42-
std::string topics{"config"};
41+
/// Topic selection mode (default "entity"):
42+
/// "entity" - subscribe broadly for pre-roll, but on confirm write only the
43+
/// faulting entity's topics (resolved from the fault's reporting
44+
/// source node) plus always-on context (/tf, /tf_static)
45+
/// "config" - reuse JSON snapshot config topics
46+
/// "all"/"auto" - every discovered topic
47+
/// "explicit" - only include_topics
48+
/// "<a,b,c>" - comma-separated topic list
49+
std::string topics{"entity"};
4350

4451
/// Additional topics to include (added to resolved list)
4552
std::vector<std::string> include_topics;
4653

4754
/// Topics to exclude from recording
4855
std::vector<std::string> exclude_topics;
4956

57+
/// In broad modes ("all"/"auto"/"entity"), skip high-bandwidth sensor topics
58+
/// (image/points/depth/compressed) to bound memory; include_topics re-adds them.
59+
bool exclude_sensor_topics{true};
60+
5061
/// If true, start ring buffer only when fault enters PREFAILED state
5162
/// If false (default), ring buffer runs continuously from startup
5263
bool lazy_start{false};
@@ -64,6 +75,13 @@ struct RosbagConfig {
6475
/// Maximum total storage for all bag files in MB
6576
size_t max_total_storage_mb{500};
6677

78+
/// Cap on the in-memory ring buffer in MB; oldest messages drop past it
79+
size_t max_buffer_mb{256};
80+
81+
/// Subscribe with each topic's publisher-offered QoS for faithful capture
82+
/// (reliable/transient-local where offered) instead of forcing best-effort
83+
bool qos_match{true};
84+
6785
/// If true, delete bag file when fault is cleared
6886
bool auto_cleanup{true};
6987
};

src/ros2_medkit_fault_manager/package.xml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,8 +27,10 @@
2727
<test_depend>ament_cmake_clang_format</test_depend>
2828
<test_depend>ament_cmake_clang_tidy</test_depend>
2929
<test_depend>ament_cmake_gtest</test_depend>
30+
<test_depend>launch_ros</test_depend>
3031
<test_depend>launch_testing_ament_cmake</test_depend>
3132
<test_depend>launch_testing_ros</test_depend>
33+
<test_depend>rosbag2_py</test_depend>
3234
<test_depend>sensor_msgs</test_depend>
3335
<test_depend>std_msgs</test_depend>
3436

src/ros2_medkit_fault_manager/src/fault_manager_node.cpp

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -744,11 +744,13 @@ SnapshotConfig FaultManagerNode::create_snapshot_config() {
744744
config.rosbag.duration_after_sec = 0.0;
745745
}
746746

747-
config.rosbag.topics = declare_parameter<std::string>("snapshots.rosbag.topics", "config");
747+
config.rosbag.topics = declare_parameter<std::string>("snapshots.rosbag.topics", "entity");
748748
config.rosbag.include_topics =
749749
declare_parameter<std::vector<std::string>>("snapshots.rosbag.include_topics", std::vector<std::string>{});
750750
config.rosbag.exclude_topics =
751751
declare_parameter<std::vector<std::string>>("snapshots.rosbag.exclude_topics", std::vector<std::string>{});
752+
config.rosbag.exclude_sensor_topics = declare_parameter<bool>("snapshots.rosbag.exclude_sensor_topics", true);
753+
config.rosbag.qos_match = declare_parameter<bool>("snapshots.rosbag.qos_match", true);
752754

753755
config.rosbag.lazy_start = declare_parameter<bool>("snapshots.rosbag.lazy_start", false);
754756
config.rosbag.format = declare_parameter<std::string>("snapshots.rosbag.format", "sqlite3");
@@ -768,14 +770,22 @@ SnapshotConfig FaultManagerNode::create_snapshot_config() {
768770
}
769771
config.rosbag.max_total_storage_mb = static_cast<size_t>(max_total_storage);
770772

773+
int64_t max_buffer = declare_parameter<int64_t>("snapshots.rosbag.max_buffer_mb", 256);
774+
if (max_buffer <= 0) {
775+
RCLCPP_WARN(get_logger(), "snapshots.rosbag.max_buffer_mb must be positive. Using 256MB");
776+
max_buffer = 256;
777+
}
778+
config.rosbag.max_buffer_mb = static_cast<size_t>(max_buffer);
779+
771780
config.rosbag.auto_cleanup = declare_parameter<bool>("snapshots.rosbag.auto_cleanup", true);
772781

773782
RCLCPP_INFO(get_logger(),
774-
"Rosbag capture enabled (duration=%.1fs+%.1fs, topics=%s, lazy=%s, format=%s, "
775-
"max_bag=%zuMB, max_total=%zuMB)",
783+
"Rosbag capture enabled (duration=%.1fs+%.1fs, topics=%s, qos_match=%s, lazy=%s, format=%s, "
784+
"max_buffer=%zuMB, max_bag=%zuMB, max_total=%zuMB)",
776785
config.rosbag.duration_sec, config.rosbag.duration_after_sec, config.rosbag.topics.c_str(),
777-
config.rosbag.lazy_start ? "true" : "false", config.rosbag.format.c_str(),
778-
config.rosbag.max_bag_size_mb, config.rosbag.max_total_storage_mb);
786+
config.rosbag.qos_match ? "true" : "false", config.rosbag.lazy_start ? "true" : "false",
787+
config.rosbag.format.c_str(), config.rosbag.max_buffer_mb, config.rosbag.max_bag_size_mb,
788+
config.rosbag.max_total_storage_mb);
779789
}
780790

781791
if (config.enabled) {

0 commit comments

Comments
 (0)