Skip to content

Commit 547c6bb

Browse files
committed
fix(log_bridge): address PR review feedback
Stable FNV-1a fault_code, uniform node_source_id for code/eligibility/self-exclusion, sanitized code_prefix, LRU-bounded reporters, per-code forward cooldown, severity_floor validation, and a null guard in map_level_to_severity. Unique test domain range, dropped unused test_depends, added unit coverage; docs corrected (WARN two-stage debounce, namespaced-node limit).
1 parent 260b374 commit 547c6bb

7 files changed

Lines changed: 410 additions & 85 deletions

File tree

src/ros2_medkit_log_bridge/CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ if(BUILD_TESTING)
9696
ros2_medkit_clang_tidy()
9797

9898
include(ROS2MedkitTestDomain)
99-
medkit_init_test_domains(START 90 END 99)
99+
medkit_init_test_domains(START 130 END 134)
100100

101101
ament_add_gtest(test_log_bridge test/test_log_bridge.cpp)
102102
target_link_libraries(test_log_bridge log_bridge_lib)

src/ros2_medkit_log_bridge/README.md

Lines changed: 54 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -21,22 +21,49 @@ above a severity floor to the FaultManager:
2121
| ERROR (40) | `SEVERITY_ERROR` |
2222
| FATAL (50) | `SEVERITY_CRITICAL` |
2323

24-
- `source_id` of each fault is the originating node (the `Log.name` field), via
25-
a per-node `FaultReporter`, so faults attribute correctly and each node gets
26-
its own local debounce.
27-
- `fault_code` is auto-generated as `<PREFIX>_<NODE>_<HASH>`, where the hash is
28-
taken over a normalized message template (numbers / hex / paths stripped) so
29-
the same logical message maps to the same code across occurrences.
30-
31-
## WARN as PREFAILED
32-
33-
The bridge forwards WARN immediately. Whether a WARN shows as `PREFAILED`
34-
(suspected, kept out of the confirmed-fault list) or `CONFIRMED` is decided by
35-
the FaultManager's `confirmation_threshold`, not the bridge. For the
36-
visible-but-quiet behaviour, launch the FaultManager with
37-
`confirmation_threshold:=-2` or lower (or an entity threshold for `LOG_*`
38-
codes). With the shipped default (`-1`), every WARN confirms on first
39-
occurrence.
24+
- `source_id` of each fault is the originating node's fully-qualified name. It
25+
is derived from `Log.name` by taking the first dotted segment (a `Log.name`
26+
may carry a sub-logger suffix, e.g. `controller_manager.resource_manager`, and
27+
node names cannot contain `.`) and prefixing `/`, giving e.g.
28+
`/controller_manager`. The gateway discovers entities by node FQN, so this is
29+
the form that lets a fault (and its snapshots / rosbag) associate with the
30+
entity in the SOVD tree. Each node gets its own per-node `FaultReporter` and
31+
therefore its own client-side debounce.
32+
- `fault_code` is auto-generated as `<PREFIX>_<NODE>_<HASH>`. `<HASH>` is a fixed
33+
FNV-1a 32-bit digest (8 lowercase hex) of a normalized message template
34+
(numbers / hex / paths stripped, isolated single-letter tokens dropped) so the
35+
same logical message maps to the same code across occurrences. `<NODE>` is the
36+
upper-snake of `source_id`. The 8-hex hash is never truncated; if the 64-char
37+
cap is hit the node part is trimmed instead.
38+
39+
> Namespaced-node limitation: `Log.name` encodes a node's namespace with the same
40+
> `.` separator as a sub-logger suffix, so the two are indistinguishable from the
41+
> string alone. `source_id` takes the first dotted segment, which is right for a
42+
> non-namespaced node with a sub-logger but collapses a namespaced node
43+
> (`robot1.planner_server` -> `/robot1`) to its namespace, so same-named nodes in
44+
> different namespaces share one code. Multi-robot fleets typically isolate robots
45+
> by `ROS_DOMAIN_ID` (one gateway per robot, federated by peer aggregation), which
46+
> sidesteps this.
47+
48+
## Forwarding, the LocalFilter, and confirmation
49+
50+
Two independent debounces sit between a log line and a confirmed fault:
51+
52+
1. Per-node `FaultReporter` `LocalFilter` (client-side). WARN is held until
53+
`default_threshold` (3) occurrences within `default_window_sec` (10s).
54+
ERROR/FATAL have severity `>= bypass_severity` (2) and bypass the filter,
55+
forwarding immediately.
56+
2. Bridge per-fault_code `report_cooldown_sec` cooldown. Because ERROR/FATAL
57+
bypass the LocalFilter, a flood would otherwise hit the FaultManager on every
58+
line; the bridge forwards the first occurrence of a code immediately and
59+
suppresses the same code for `report_cooldown_sec` (default 5s, `0.0`
60+
disables).
61+
62+
Whether a forwarded fault then shows as `PREFAILED` (suspected) or `CONFIRMED`
63+
is a separate, gateway-side decision driven by the FaultManager's
64+
`confirmation_threshold` - not by this bridge and not by the client-side
65+
LocalFilter. For visible-but-quiet WARNs, launch the FaultManager with a low
66+
`confirmation_threshold` (or an entity threshold for `LOG_*` codes).
4067

4168
## Hard limitations (by construction)
4269

@@ -59,7 +86,14 @@ ros2 launch ros2_medkit_log_bridge log_bridge.launch.py
5986
| Param | Default | Meaning |
6087
|-------|---------|---------|
6188
| `rosout_topic` | `/rosout` | log topic to subscribe |
62-
| `severity_floor` | `30` (WARN) | minimum level promoted; raise to `40` on chatty / constrained targets |
63-
| `code_prefix` | `LOG` | prefix for generated fault codes |
64-
| `exclude_nodes` | `[]` | node-name substrings to skip |
65-
| `include_only_nodes` | `[]` | if set, only promote these nodes |
89+
| `severity_floor` | `30` (WARN) | minimum level promoted; raise to `40` on chatty / constrained targets. Clamped to `[0, 50]` at load (a value out of range is corrected with a warning) |
90+
| `code_prefix` | `LOG` | prefix for generated fault codes; normalized to `[A-Z0-9_]` at load |
91+
| `exclude_nodes` | `[]` | node-FQN substrings to skip |
92+
| `include_only_nodes` | `[]` | if set, only promote nodes whose FQN matches |
93+
| `max_tracked_nodes` | `512` | cap on per-node reporters; least-recently-used nodes evicted past this |
94+
| `report_cooldown_sec` | `5.0` | per-fault_code forward debounce; `0.0` disables |
95+
96+
`exclude_nodes` / `include_only_nodes` match as **unanchored substrings**
97+
against the node FQN: `planner` matches `/planner_server` and
98+
`/robot1/planner_server`. Use a longer, more specific substring (e.g.
99+
`/planner_server`) to avoid accidental matches.

src/ros2_medkit_log_bridge/config/log_bridge.yaml

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,14 @@ log_bridge:
88
severity_floor: 30
99
# Prefix for auto-generated fault codes (<PREFIX>_<NODE>_<HASH>).
1010
code_prefix: "LOG"
11-
# Originating-node name substrings to skip (e.g. noisy debug nodes).
11+
# Originating-node FQN substrings to skip (e.g. noisy debug nodes).
1212
exclude_nodes: []
1313
# If non-empty, ONLY promote logs from nodes matching these substrings.
1414
include_only_nodes: []
15+
# Cap on per-node FaultReporters; least-recently-used nodes are evicted
16+
# past this to bound memory under transient-node churn.
17+
max_tracked_nodes: 512
18+
# Per-fault_code forward debounce (seconds). First occurrence of a code is
19+
# forwarded immediately; the same code within the window is suppressed.
20+
# 0.0 disables. Tames ERROR/FATAL floods, which bypass the per-node filter.
21+
report_cooldown_sec: 5.0

src/ros2_medkit_log_bridge/include/ros2_medkit_log_bridge/log_bridge_node.hpp

Lines changed: 39 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,11 @@
1414

1515
#pragma once
1616

17-
#include <map>
17+
#include <list>
1818
#include <memory>
1919
#include <mutex>
2020
#include <string>
21+
#include <unordered_map>
2122
#include <vector>
2223

2324
#include "rclcpp/rclcpp.hpp"
@@ -30,11 +31,11 @@ namespace ros2_medkit_log_bridge {
3031
///
3132
/// Subscribes to /rosout (rcl_interfaces/msg/Log) and forwards entries at or
3233
/// above a configurable severity floor to the FaultManager, attributing each
33-
/// fault to the originating node (the Log.name field) via a per-source
34+
/// fault to the originating node's fully-qualified name via a per-source
3435
/// FaultReporter. Drop-in compat adapter, same category as
3536
/// ros2_medkit_diagnostic_bridge: native FaultReporter instrumentation stays
3637
/// the canonical path; this bridge is the fallback for nodes that only log.
37-
/// Level mapping and the WARN-as-PREFAILED caveat are documented in README.md.
38+
/// Level mapping and the WARN/LocalFilter caveat are documented in README.md.
3839
///
3940
/// Hard limitation by construction: only sees rclcpp logs that reach /rosout
4041
/// from a still-alive node. Console-only loggers and crash-before-flush are out
@@ -47,19 +48,20 @@ class LogBridgeNode : public rclcpp::Node {
4748
/// Returns false when the level is below the floor / not promotable.
4849
static bool map_level_to_severity(uint8_t log_level, uint8_t severity_floor, uint8_t * severity_out);
4950

50-
/// Auto-generate a stable fault code from the originating node name and the
51+
/// Auto-generate a stable fault code from the originating node's FQN and the
5152
/// log message. Numbers/hex/paths in the message are normalized away so the
5253
/// same logical message maps to the same code across occurrences.
5354
/// Format: <PREFIX>_<NODE>_<HASH>, clamped to medkit's [A-Z0-9_] / 64-char rule.
54-
std::string generate_fault_code(const std::string & node_name, const std::string & message) const;
55+
std::string generate_fault_code(const std::string & source_id, const std::string & message) const;
5556

5657
/// Normalize a log message into a stable template (lowercased, digit/hex/path
57-
/// runs stripped, whitespace collapsed). Exposed for unit testing.
58+
/// runs stripped, whitespace collapsed, isolated single-letter tokens dropped).
59+
/// Exposed for unit testing.
5860
static std::string normalize_message(const std::string & message);
5961

6062
/// Whether a given originating node should be promoted, honouring the
6163
/// include/exclude lists. Exposed for unit testing.
62-
bool node_is_eligible(const std::string & node_name) const;
64+
bool node_is_eligible(const std::string & source_id) const;
6365

6466
/// Map an rcl_interfaces/msg/Log.name (a logger name, e.g. "bt_navigator" or
6567
/// "controller_manager.resource_manager") to the originating node's
@@ -69,29 +71,54 @@ class LogBridgeNode : public rclcpp::Node {
6971
/// the entity in the SOVD tree. Exposed for unit testing.
7072
static std::string node_source_id(const std::string & log_name);
7173

72-
private:
73-
void log_callback(const rcl_interfaces::msg::Log::ConstSharedPtr & msg);
74+
/// FNV-1a 32-bit hash, fixed spec, emitted as 8 lowercase hex chars. Exposed
75+
/// for unit testing so a known input asserts a known constant.
76+
static std::string fnv1a_hex(const std::string & in);
77+
78+
/// Whether a fault_code may be forwarded now under the per-code cooldown
79+
/// (first occurrence passes; same code within report_cooldown_sec is
80+
/// suppressed; 0.0 disables). Exposed for unit testing.
81+
bool cooldown_allows(const std::string & fault_code, rclcpp::Time now);
7482

7583
/// Fetch (or lazily create) the per-source FaultReporter for an originating
7684
/// node, so the fault's source_id is the node that logged, not the bridge.
77-
ros2_medkit_fault_reporter::FaultReporter * reporter_for(const std::string & node_name);
85+
/// Bounded by max_tracked_nodes_ with LRU eviction. Exposed for unit testing.
86+
ros2_medkit_fault_reporter::FaultReporter * reporter_for(const std::string & source_id);
87+
88+
/// Number of currently tracked per-node reporters. Exposed for unit testing.
89+
size_t tracked_reporter_count();
90+
91+
private:
92+
void log_callback(const rcl_interfaces::msg::Log::ConstSharedPtr & msg);
7893

7994
void load_parameters();
8095

8196
static std::string to_upper_snake(const std::string & in, size_t max_len);
8297

8398
rclcpp::Subscription<rcl_interfaces::msg::Log>::SharedPtr log_sub_;
8499

85-
// One FaultReporter per originating node (correct source_id + own LocalFilter).
86-
std::map<std::string, std::unique_ptr<ros2_medkit_fault_reporter::FaultReporter>> reporters_;
100+
// One FaultReporter per originating node (correct source_id + own LocalFilter),
101+
// LRU-ordered: front() is the least-recently-used entry.
102+
struct ReporterEntry {
103+
std::string source_id;
104+
std::unique_ptr<ros2_medkit_fault_reporter::FaultReporter> reporter;
105+
};
106+
std::list<ReporterEntry> reporters_lru_;
107+
std::unordered_map<std::string, std::list<ReporterEntry>::iterator> reporters_;
87108
std::mutex reporters_mutex_;
88109

110+
// Per-fault_code forward cooldown: last time a code was forwarded.
111+
std::unordered_map<std::string, rclcpp::Time> last_forward_;
112+
std::mutex cooldown_mutex_;
113+
89114
// Configuration
90115
std::string rosout_topic_;
91116
uint8_t severity_floor_;
92117
std::string code_prefix_;
93118
std::vector<std::string> exclude_nodes_;
94119
std::vector<std::string> include_only_nodes_;
120+
int max_tracked_nodes_;
121+
double report_cooldown_sec_;
95122
std::string own_node_name_;
96123
};
97124

src/ros2_medkit_log_bridge/package.xml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,6 @@
2121
<test_depend>ament_cmake_clang_format</test_depend>
2222
<test_depend>ament_cmake_clang_tidy</test_depend>
2323
<test_depend>ament_cmake_gtest</test_depend>
24-
<test_depend>launch_testing_ament_cmake</test_depend>
25-
<test_depend>launch_testing_ros</test_depend>
26-
<test_depend>ros2_medkit_fault_manager</test_depend>
2724

2825
<export>
2926
<build_type>ament_cmake</build_type>

0 commit comments

Comments
 (0)