Skip to content

Commit dab79c9

Browse files
committed
feat(demos/ota): reactive fault narrative + reproducible artefact build
Round of work to make the OTA demo's fault story actually visible AND to lock the build/test path against regression. Reactive SCAN_PHANTOM_RETURN fault: - broken_lidar subscribes /cmd_vel (Twist + TwistStamped, since Nav2 Jazzy uses the stamped variant) and only reports the fault while the controller is actively driving. Empty dashboard at boot, lights up the moment the operator publishes /goal_pose, stays active through recovery behaviors. The phantom itself is now a 21-ray, 0.5 m wedge centered on index 180 - one ray at 1 m the planner just routes around because the costmap raytraces it away. - fixed_lidar fires EVENT_PASSED at 500 ms (4x the broken_lidar tick) so the FaultManager debounce counter overtakes whatever broken_lidar accumulated, no matter how long the operator was driving before the OTA. Stops being load-bearing once the fault flips to HEALED. - demo.launch.py sets fault_manager to in-memory storage (clean dashboard at every boot - SQLite was persisting last session's CONFIRMED entry) and turns on healing_enabled with threshold 2. Reproducible artefact build: - ota_update_server/Dockerfile is now multi-stage. Stage 1 (ros:jazzy) clones ros2_medkit at the same ref the gateway image pins, builds ros2_medkit_msgs, builds fixed_lidar + obstacle_classifier_v2 from the demo's ros2_packages/, and runs pack_artifact.py to produce tarballs + catalog.json. Stage 2 ships the slim FastAPI server + the tarballs via COPY --from. `docker compose build` is now the reproducible path - no "build artefacts on host first" step. - CI: dropped the separate "Build artifacts inside ros:jazzy" step; `docker compose up -d --build` does it atomically. - scripts/build_artifacts.sh stays as an opt-in dev convenience. It no longer silently bootstraps msgs - if the env doesn't have it on the prefix path it errors out with instructions to either source an overlay or use `docker compose build`. Less hidden state. Demo narrative regression test: - tests/smoke_test_demo_narrative.sh exercises the full beat: publish /goal_pose, assert SCAN_PHANTOM_RETURN appears (reactive proof - if the fault appears, broken_lidar's /cmd_vel subscription saw motion, so we don't need a brittle `ros2 topic echo` snapshot), trigger OTA prepare+execute, assert process flip, assert fault clears, assert /cmd_vel settles. CI gets a new ota-demo-narrative job that runs this in isolation from the API-only smoke check.
1 parent 08b0c21 commit dab79c9

11 files changed

Lines changed: 635 additions & 24 deletions

File tree

.github/workflows/ci.yml

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,55 @@ jobs:
184184
- name: Checkout repository
185185
uses: actions/checkout@v4
186186

187+
- name: Build and start OTA demo
188+
working-directory: demos/ota_nav2_sensor_fix
189+
run: docker compose up -d --build
190+
191+
- name: Run smoke tests
192+
run: ./tests/smoke_test_ota.sh
193+
194+
- name: Show gateway logs on failure
195+
if: failure()
196+
working-directory: demos/ota_nav2_sensor_fix
197+
run: docker compose logs gateway --tail=200
198+
199+
- name: Show update server logs on failure
200+
if: failure()
201+
working-directory: demos/ota_nav2_sensor_fix
202+
run: docker compose logs ota_update_server --tail=200
203+
204+
- name: Teardown
205+
if: always()
206+
working-directory: demos/ota_nav2_sensor_fix
207+
run: docker compose down
208+
209+
# Separate job from build-and-test-ota: this one publishes /goal_pose,
210+
# waits for the controller to actually try to drive, asserts the
211+
# reactive SCAN_PHANTOM_RETURN fault appears, and only then runs the
212+
# OTA swap. Catches regressions in the demo narrative (broken_lidar
213+
# subscribing /cmd_vel, fault_manager debounce, fixed_lidar not
214+
# reporting). Slower than the API-only smoke job because it has to
215+
# wait for nav2 lifecycle to settle and for /cmd_vel to actually fire,
216+
# so it's split out and can fail in isolation without blocking the
217+
# quick OTA-endpoint check.
218+
ota-demo-narrative:
219+
needs: lint
220+
runs-on: ubuntu-24.04
221+
steps:
222+
- name: Show triggering source
223+
if: github.event_name == 'repository_dispatch'
224+
run: |
225+
SHA="${{ github.event.client_payload.sha }}"
226+
RUN_URL="${{ github.event.client_payload.run_url }}"
227+
echo "## Triggered by ros2_medkit" >> "$GITHUB_STEP_SUMMARY"
228+
echo "- Commit: \`${SHA:-unknown}\`" >> "$GITHUB_STEP_SUMMARY"
229+
if [ -n "$RUN_URL" ]; then
230+
echo "- Run: [View triggering run]($RUN_URL)" >> "$GITHUB_STEP_SUMMARY"
231+
fi
232+
233+
- name: Checkout repository
234+
uses: actions/checkout@v4
235+
187236
- name: Build artifacts (catalog + tarballs) inside ros:jazzy
188237
working-directory: demos/ota_nav2_sensor_fix
189238
run: |
@@ -211,25 +260,19 @@ jobs:
211260
cd ..
212261
./scripts/build_artifacts.sh
213262
'
214-
# Restore ownership of files the container created as root.
215263
sudo chown -R "$USER:$USER" .
216264
217265
- name: Build and start OTA demo
218266
working-directory: demos/ota_nav2_sensor_fix
219267
run: docker compose up -d --build
220268

221-
- name: Run smoke tests
222-
run: ./tests/smoke_test_ota.sh
269+
- name: Run demo narrative smoke
270+
run: ./tests/smoke_test_demo_narrative.sh
223271

224272
- name: Show gateway logs on failure
225273
if: failure()
226274
working-directory: demos/ota_nav2_sensor_fix
227-
run: docker compose logs gateway --tail=200
228-
229-
- name: Show update server logs on failure
230-
if: failure()
231-
working-directory: demos/ota_nav2_sensor_fix
232-
run: docker compose logs ota_update_server --tail=200
275+
run: docker compose logs gateway --tail=300
233276

234277
- name: Teardown
235278
if: always()
Lines changed: 89 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,100 @@
1-
FROM python:3.11-slim
1+
# Copyright 2026 bburda
2+
# Apache 2.0
3+
#
4+
# Multi-stage build: stage 1 produces the SOVD catalog + tarballs from
5+
# the demo's ros2_packages/ + the gateway repo (for ros2_medkit_msgs);
6+
# stage 2 ships a slim FastAPI server that serves those artefacts. This
7+
# keeps `docker compose build` self-contained and reproducible on CI -
8+
# no separate "build artifacts on host" step required.
9+
#
10+
# Build context is the demo root (so we can pull in ros2_packages/ +
11+
# scripts/ + ota_update_server/). docker-compose.yml wires that up.
12+
13+
# =============================================================================
14+
# Stage 1: build artefacts + catalog inside ros:jazzy
15+
# =============================================================================
16+
FROM ros:jazzy AS artefact_builder
17+
18+
ARG GATEWAY_REPO=https://github.com/selfpatch/ros2_medkit.git
19+
ARG GATEWAY_REF=fix/component-logs-aggregation
20+
21+
RUN apt-get update && apt-get install -y --no-install-recommends \
22+
git \
23+
python3-colcon-common-extensions \
24+
python3-catkin-pkg \
25+
python3-venv \
26+
python3-pip \
27+
build-essential \
28+
cmake \
29+
ros-jazzy-rclcpp \
30+
ros-jazzy-sensor-msgs \
31+
ros-jazzy-geometry-msgs \
32+
ros-jazzy-visualization-msgs \
33+
&& rm -rf /var/lib/apt/lists/*
34+
35+
# ros2_medkit_msgs is the SOVD ReportFault service def - cloned at the
36+
# same ref the gateway image pins so the message ABI stays in lockstep.
37+
RUN git clone --depth=1 --branch ${GATEWAY_REF} ${GATEWAY_REPO} /tmp/ros2_medkit
38+
RUN . /opt/ros/jazzy/setup.sh && \
39+
cd /tmp/ros2_medkit && \
40+
colcon build --packages-select ros2_medkit_msgs
241

3-
# Build context expected to be the demo root (so we can pull in artifacts/)
4-
# rather than ota_update_server/. Compose wires this up.
42+
# Now bring in the demo's source packages + pack_artifact.py.
43+
WORKDIR /demo
44+
COPY ros2_packages /demo/ros2_packages
45+
COPY scripts /demo/scripts
46+
47+
# Build fixed_lidar + obstacle_classifier_v2 (the only artefacts that
48+
# need a binary - broken_lidar_legacy is uninstall-only). Tarballs +
49+
# catalog.json land in /demo/artifacts.
50+
RUN mkdir -p /demo/ros2_ws/src && \
51+
for pkg in fixed_lidar obstacle_classifier_v2; do \
52+
ln -sfn /demo/ros2_packages/$pkg /demo/ros2_ws/src/$pkg; \
53+
done && \
54+
. /opt/ros/jazzy/setup.sh && \
55+
. /tmp/ros2_medkit/install/setup.sh && \
56+
cd /demo/ros2_ws && \
57+
colcon build --packages-select fixed_lidar obstacle_classifier_v2
58+
59+
# pack_artifact.py uses pure stdlib (json/tarfile/argparse) so we don't
60+
# need the .venv/pytest dance build_artifacts.sh does locally.
61+
RUN mkdir -p /demo/artifacts && \
62+
rm -f /demo/artifacts/catalog.json && \
63+
PACK="python3 /demo/scripts/pack_artifact.py" && \
64+
$PACK --package fixed_lidar --version 2.1.0 \
65+
--kind update --target-component scan_sensor_node \
66+
--executable fixed_lidar_node \
67+
--replaces-executable broken_lidar_node \
68+
--notes "Fix /scan noise filter" \
69+
--skip-build --workspace /demo/ros2_ws \
70+
--out-dir /demo/artifacts --catalog /demo/artifacts/catalog.json && \
71+
$PACK --package obstacle_classifier_v2 --version 1.0.0 \
72+
--kind install --target-component obstacle_classifier \
73+
--executable obstacle_classifier_node \
74+
--notes "Extra safety layer for nav2" \
75+
--skip-build --workspace /demo/ros2_ws \
76+
--out-dir /demo/artifacts --catalog /demo/artifacts/catalog.json && \
77+
$PACK --package broken_lidar_legacy --version "" \
78+
--kind uninstall --target-component broken_lidar_legacy \
79+
--notes "Clean up deprecated package" \
80+
--skip-build --workspace /demo/ros2_ws \
81+
--out-dir /demo/artifacts --catalog /demo/artifacts/catalog.json
82+
83+
# =============================================================================
84+
# Stage 2: slim runtime image
85+
# =============================================================================
86+
FROM python:3.11-slim
587

688
WORKDIR /app
789
COPY ota_update_server/pyproject.toml ./
890
COPY ota_update_server/ota_update_server ./ota_update_server
991
RUN pip install --no-cache-dir .
1092

11-
# Bake the demo catalog + tarballs into the image so the container is
12-
# self-contained. Bind-mounting artifacts/ at runtime is unreliable on
13-
# WSL2 + Docker Desktop, so we ship them in the image instead.
14-
COPY artifacts /artifacts
93+
COPY --from=artefact_builder /demo/artifacts /artifacts
1594

16-
ENV OTA_ARTIFACTS_DIR=/artifacts
17-
ENV OTA_HOST=0.0.0.0
18-
ENV OTA_PORT=9000
95+
ENV OTA_ARTIFACTS_DIR=/artifacts \
96+
OTA_HOST=0.0.0.0 \
97+
OTA_PORT=9000
1998
EXPOSE 9000
2099

21100
CMD ["ota-update-server"]

demos/ota_nav2_sensor_fix/ros2_packages/broken_lidar/CMakeLists.txt

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,11 @@ endif()
1212
find_package(ament_cmake REQUIRED)
1313
find_package(rclcpp REQUIRED)
1414
find_package(sensor_msgs REQUIRED)
15+
find_package(geometry_msgs REQUIRED)
16+
find_package(ros2_medkit_msgs REQUIRED)
1517

1618
add_executable(broken_lidar_node src/broken_lidar_node.cpp)
17-
ament_target_dependencies(broken_lidar_node rclcpp sensor_msgs)
19+
ament_target_dependencies(broken_lidar_node rclcpp sensor_msgs geometry_msgs ros2_medkit_msgs)
1820

1921
install(TARGETS broken_lidar_node DESTINATION lib/${PROJECT_NAME})
2022

demos/ota_nav2_sensor_fix/ros2_packages/broken_lidar/package.xml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@
1010

1111
<depend>rclcpp</depend>
1212
<depend>sensor_msgs</depend>
13+
<depend>geometry_msgs</depend>
14+
<depend>ros2_medkit_msgs</depend>
1315

1416
<export>
1517
<build_type>ament_cmake</build_type>

demos/ota_nav2_sensor_fix/ros2_packages/broken_lidar/src/broken_lidar_node.cpp

Lines changed: 116 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,66 @@
11
// Copyright 2026 bburda. Apache-2.0.
2+
//
3+
// Pre-OTA scan publisher: emits a 360-ray LaserScan at 10 Hz with a
4+
// phantom 1 m return at index 180 (straight ahead). Nav2's local
5+
// costmap traces it as an obstacle and the planner refuses to drive
6+
// through.
7+
//
8+
// Reactive fault reporting (the demo narrative pivots on this beat):
9+
// the node subscribes to /cmd_vel. While the controller is actively
10+
// commanding non-zero motion - i.e. the operator just sent a goal and
11+
// nav2 is trying to drive - we raise SCAN_PHANTOM_RETURN against
12+
// scan_sensor_node every 5 s. After 10 s of idle /cmd_vel we emit
13+
// report_passed and the fault clears from the dashboard.
14+
//
15+
// Result on screen: the Faults Dashboard is empty when the robot is
16+
// idle, lights up the moment the operator publishes /goal_pose and
17+
// the controller starts spinning, and clears once they stop driving
18+
// or the OTA swap installs fixed_lidar (which doesn't make any of
19+
// these reports).
20+
221
#include <chrono>
322
#include <cmath>
423
#include <memory>
24+
#include <string>
525

626
#include <rclcpp/rclcpp.hpp>
727
#include <sensor_msgs/msg/laser_scan.hpp>
28+
#include <geometry_msgs/msg/twist.hpp>
29+
#include <geometry_msgs/msg/twist_stamped.hpp>
30+
#include <ros2_medkit_msgs/srv/report_fault.hpp>
831

932
using std::chrono_literals::operator""ms;
33+
using std::chrono_literals::operator""s;
1034

1135
class BrokenLidarNode : public rclcpp::Node {
1236
public:
1337
BrokenLidarNode() : Node("scan_sensor_node") {
1438
pub_ = create_publisher<sensor_msgs::msg::LaserScan>("scan", 10);
1539
timer_ = create_wall_timer(100ms, [this]() { publish_scan(); });
40+
41+
// Nav2 Jazzy publishes /cmd_vel as TwistStamped; older stacks (and
42+
// teleop) still use plain Twist. Subscribe to both so the reactive
43+
// fault works regardless of which side is driving.
44+
constexpr double kThresh = 0.01;
45+
auto handle_motion = [this](double linear_x, double angular_z) {
46+
if (std::fabs(linear_x) > kThresh || std::fabs(angular_z) > kThresh) {
47+
last_motion_command_ = now();
48+
}
49+
};
50+
cmd_vel_sub_ = create_subscription<geometry_msgs::msg::Twist>(
51+
"cmd_vel", 10,
52+
[handle_motion](const geometry_msgs::msg::Twist::SharedPtr msg) {
53+
handle_motion(msg->linear.x, msg->angular.z);
54+
});
55+
cmd_vel_stamped_sub_ = create_subscription<geometry_msgs::msg::TwistStamped>(
56+
"cmd_vel", 10,
57+
[handle_motion](const geometry_msgs::msg::TwistStamped::SharedPtr msg) {
58+
handle_motion(msg->twist.linear.x, msg->twist.angular.z);
59+
});
60+
61+
fault_client_ = create_client<ros2_medkit_msgs::srv::ReportFault>(
62+
"/fault_manager/report_fault");
63+
fault_timer_ = create_wall_timer(2s, [this]() { tick_fault(); });
1664
}
1765

1866
private:
@@ -27,13 +75,79 @@ class BrokenLidarNode : public rclcpp::Node {
2775
msg.range_max = 10.0f;
2876
constexpr int kRays = 360;
2977
msg.ranges.assign(kRays, msg.range_max);
30-
// Inject a 1 m phantom return at angle 0 (straight ahead, ray index 180)
31-
msg.ranges[180] = 1.0f;
78+
// Phantom obstacle: a 21-ray (~20 degree) wedge of 0.5 m returns
79+
// centered straight ahead. A single ray at 1.0 m the local costmap
80+
// happily plans around because nav2 marks one cell, raytracing
81+
// clears it on the next sweep, and the controller drives forward
82+
// anyway. A close wide wedge gets stamped into the local costmap
83+
// as a continuous wall the planner has to swerve to avoid - which
84+
// it can't reliably do because the wedge stays anchored to base_scan
85+
// (it rotates with the robot). End result: the controller stalls,
86+
// BT navigator times out, NavigateToPose returns ABORTED.
87+
constexpr int kPhantomCenter = 180;
88+
constexpr int kPhantomHalfWidth = 10;
89+
constexpr float kPhantomRange = 0.5f;
90+
for (int i = kPhantomCenter - kPhantomHalfWidth;
91+
i <= kPhantomCenter + kPhantomHalfWidth;
92+
++i) {
93+
msg.ranges[i] = kPhantomRange;
94+
}
3295
pub_->publish(msg);
3396
}
3497

98+
void tick_fault() {
99+
if (!fault_client_->service_is_ready()) return;
100+
101+
const auto now_ts = now();
102+
const bool driving = last_motion_command_.nanoseconds() > 0
103+
&& (now_ts - last_motion_command_).seconds() < kIdleTimeoutSec;
104+
105+
if (driving) {
106+
send_report(false); // EVENT_FAILED - keep fault active
107+
fault_active_ = true;
108+
} else if (fault_active_) {
109+
send_report(true); // EVENT_PASSED - clear fault
110+
fault_active_ = false;
111+
}
112+
}
113+
114+
void send_report(bool passed) {
115+
auto req = std::make_shared<ros2_medkit_msgs::srv::ReportFault::Request>();
116+
req->fault_code = "SCAN_PHANTOM_RETURN";
117+
req->event_type = passed
118+
? ros2_medkit_msgs::srv::ReportFault::Request::EVENT_PASSED
119+
: ros2_medkit_msgs::srv::ReportFault::Request::EVENT_FAILED;
120+
req->severity = 3; // ERROR
121+
req->description = "LaserScan ray index 180 reports a constant 1.0 m return "
122+
"(straight ahead). Nav2 traces it as a phantom obstacle "
123+
"and the controller cannot make progress.";
124+
req->source_id = "scan_sensor_node";
125+
126+
auto cb = [this, passed](rclcpp::Client<ros2_medkit_msgs::srv::ReportFault>::SharedFuture fut) {
127+
try {
128+
auto result = fut.get();
129+
if (!result->accepted) {
130+
RCLCPP_DEBUG(get_logger(), "FaultManager rejected SCAN_PHANTOM_RETURN report (passed=%d)",
131+
passed ? 1 : 0);
132+
}
133+
} catch (const std::exception & e) {
134+
RCLCPP_DEBUG(get_logger(), "ReportFault call failed: %s", e.what());
135+
}
136+
};
137+
auto fut = fault_client_->async_send_request(req, cb);
138+
(void)fut;
139+
}
140+
141+
static constexpr double kIdleTimeoutSec = 10.0;
142+
35143
rclcpp::Publisher<sensor_msgs::msg::LaserScan>::SharedPtr pub_;
36144
rclcpp::TimerBase::SharedPtr timer_;
145+
rclcpp::Subscription<geometry_msgs::msg::Twist>::SharedPtr cmd_vel_sub_;
146+
rclcpp::Subscription<geometry_msgs::msg::TwistStamped>::SharedPtr cmd_vel_stamped_sub_;
147+
rclcpp::Client<ros2_medkit_msgs::srv::ReportFault>::SharedPtr fault_client_;
148+
rclcpp::TimerBase::SharedPtr fault_timer_;
149+
rclcpp::Time last_motion_command_{0, 0, RCL_ROS_TIME};
150+
bool fault_active_{false};
37151
};
38152

39153
int main(int argc, char ** argv) {

demos/ota_nav2_sensor_fix/ros2_packages/fixed_lidar/CMakeLists.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,10 @@ endif()
1212
find_package(ament_cmake REQUIRED)
1313
find_package(rclcpp REQUIRED)
1414
find_package(sensor_msgs REQUIRED)
15+
find_package(ros2_medkit_msgs REQUIRED)
1516

1617
add_executable(fixed_lidar_node src/fixed_lidar_node.cpp)
17-
ament_target_dependencies(fixed_lidar_node rclcpp sensor_msgs)
18+
ament_target_dependencies(fixed_lidar_node rclcpp sensor_msgs ros2_medkit_msgs)
1819

1920
install(TARGETS fixed_lidar_node DESTINATION lib/${PROJECT_NAME})
2021

demos/ota_nav2_sensor_fix/ros2_packages/fixed_lidar/package.xml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111
<depend>rclcpp</depend>
1212
<depend>sensor_msgs</depend>
13+
<depend>ros2_medkit_msgs</depend>
1314

1415
<export>
1516
<build_type>ament_cmake</build_type>

0 commit comments

Comments
 (0)