System-level specification for the pure DDS approach of the robot fleet demo. This document maps the high-level scenario requirements (see top-level README) to a pure-DDS implementation — pub-sub for all data flows, DDS request-reply for commands and charging-station slot negotiation. No gRPC, no Zeroconf, no bolted-on discovery.
robot_node_spec.md ·
robot_ui_spec.md ·
charging_station_spec.md
source ../setup.sourceme dds # set NDDSHOME, LD_LIBRARY_PATH, etc.
./demo_start.sh all # launch stations + robots + UI → http://localhost:5000
./demo_stop.sh # stop everythingIndividual components:
./demo_start.sh stations # all charging stations
./demo_start.sh robots # all robots
./demo_start.sh ui # dashboard UI
./demo_start.sh robot tractor1 # single robot
./demo_start.sh station station1 # single station (dock coords auto-derived)The CLI is uniform across all three approaches — same commands, same
arguments. Robot/station names and dock coordinates come from
../shared/fleet_common.sh. No port assignments needed — DDS SPDP
handles discovery.
This approach implements the Data-Centric Pub-Sub and Services architecture described in the top-level README:
- Pure Python, RTI Connext DDS for all communication.
- DDS pub-sub topics for KinematicState, OperationalState, Intent, Telemetry, and Video.
- DDS request-reply for commands and charging-station slot management.
- No gRPC, no Zeroconf — DDS provides discovery, presence, QoS, and reliable delivery as first-class features.
- Every participant joins the same DDS domain — no manual peer addresses, no port lists, no connection management.
| Requirement | How the DDS approach handles it |
|---|---|
| Late joiners converge quickly | DDS TRANSIENT_LOCAL durability on OperationalState and Intent — a new participant receives the last published value for every key immediately on discovery (sub-second, typically < 100 ms). |
| Presence detection ≤ 100 ms | DDS AUTOMATIC_LIVELINESS_QOS with a 100 ms lease duration. The middleware fires a on_liveliness_changed callback — zero application code. |
| Robots appear dynamically | DDS Simple Participant Discovery Protocol (SPDP) uses UDP multicast — no configuration, no registry, no mDNS. |
| KinematicState known within tolerance | KinematicState topic published at 10 Hz with BEST_EFFORT / VOLATILE QoS. Each robot reads the latest value per key — always fresh, never queued. |
| OperationalState delivered reliably | OperationalState topic with RELIABLE / TRANSIENT_LOCAL QoS — guaranteed delivery plus late-joiner convergence. Published on change. |
| Intent delivered reliably | Intent topic with RELIABLE / TRANSIENT_LOCAL QoS. Includes path waypoints and path index. Published on change. |
| Telemetry loss tolerable | Telemetry topic with BEST_EFFORT / VOLATILE QoS — 1 Hz, no retransmissions. |
| Commands are reliable | DDS request-reply over RELIABLE / KEEP_ALL topics with correlation IDs — guaranteed delivery with explicit success/failure response. |
| Robots come and go | DDS handles this natively. Participant departure triggers liveliness callbacks on all peers. No reconnect loops, no cleanup code. |
| IP addresses change | DDS discovery is transport-agnostic — participants re-discover each other automatically after IP change (built-in SPDP re-announcement). |
| Video streaming | VideoFrame topic — keyed by robot_id, BEST_EFFORT / VOLATILE. Each robot publishes JPEG frames; the UI subscribes with a content filter for the selected robot. |
| Charging stations | DDS request-reply for slot negotiation (RequestSlot / ConfirmSlot / ReleaseSlot). Station status published as a StationStatus topic — the UI subscribes to see live queue state. |
- Runtime dependency — requires RTI Connext DDS 7.6.0+ installed and licensed.
- Learning curve — QoS policy combinations are powerful but have a higher learning curve. Mismatched QoS can cause applications to not communicate. gRPC is simpler as it effectively has just one QoS (TCP semantics).
- Web deployment friction — Less friendly to web stacks harder to deploy with load balances, no web proxying
The goal is not to port the gRPC approach line-for-line, but to solve the same scenario requirements in the most natural way for DDS:
- No delivery loops — each robot calls
writer.write()once; the middleware delivers to all matched readers via multicast. There are no per-subscriber streaming threads. - No reconnect logic — DDS discovery and liveliness are built-in. There is
no
_connect_to_peerretry loop. - No Zeroconf / mDNS — DDS SPDP replaces the bolted-on
fleet_discovery.pyfrom the gRPC approach. - QoS per topic — each data flow gets the reliability, durability, and history policy that matches its semantics (see §5).
- Separate topics for separate concerns —
KinematicState(10 Hz, best-effort) andOperationalState(on change, reliable) are distinct topics with distinct QoS, not a singleRobotStateblob. - Keyed topics —
robot_idis the DDS key on every topic. The middleware manages per-key state automatically: one instance per robot, last value cached, liveliness tracked per instance. - Content filtering — the UI can subscribe to video from a single robot without receiving (and discarding) frames from all others. The filter expression is evaluated in the middleware, not in application code.
┌─────────────┐ ┌─────────────┐
│ robot_node │ DDS Domain 0 │ robot_node │
│ (robot1) │◄═══════════════════════► │ (robot2) │
└──────┬──────┘ multicast pub-sub └──────┬──────┘
║ ┌─────────────┐ ║
╚══════════════│ robot_node │════════════╝
│ (robot3) │
└──────┬──────┘
║
══════════════════════╩═══════════════════════
║ ║ ║
┌──────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ robot_ui │ │ charging_station │ │ charging_station │
│ (Flask UI) │ │ (station1) │ │ (station2) │
└──────────────┘ └──────────────────┘ └──────────────────┘
│
│ HTTP / SSE / MJPEG
▼
┌──────────────┐
│ Browser │
└──────────────┘
Key difference from the gRPC approach: there are no point-to-point connections. All participants join the same DDS domain. The middleware handles discovery, matching, and delivery. The double lines (═) represent logical topic subscriptions, not TCP connections.
| Program | Count | Role |
|---|---|---|
robot_node.py |
1 per tractor (5 in default config: tractor1, tractor2, tractor3, tractor4, tractor5) | Autonomous tractor: publishes KinematicState / OperationalState / Intent / Telemetry / Video; subscribes to peer state for collision avoidance; handles command requests; negotiates charging slots |
charging_station.py |
1 per station (2 in default config) | Charging dock manager: FIFO queue, slot negotiation via DDS request-reply, publishes StationStatus |
robot_ui.py |
1 | Fleet dashboard: subscribes to all robot topics + station status, serves web UI with live map + tables + command panel + video feed + charging panel |
Every participant is a DDS DomainParticipant — it discovers all others automatically via SPDP multicast. There is no client/server distinction.
Shared data space — every participant reads and writes the topics it cares about. The middleware handles fan-out. Charging stations and the UI are not special — they are just participants that subscribe to different topics and serve different request-reply endpoints.
| File | Purpose | Detailed Spec |
|---|---|---|
robot_types.idl |
DDS type definitions (IDL) — single source of truth for all topic types, enums, and structs | — |
robot_types.py |
Python dataclasses generated by rtiddsgen — do not edit by hand |
— (auto-generated) |
robot_qos.xml |
QoS profiles for each topic type (Pattern-based), incl. CoverageProfile (PERSISTENT durability) | — |
robot_node.py |
Robot node — publishes own state, subscribes to peers, handles commands, path following, collision avoidance, charging, video rendering, coverage tracking | robot_node_spec.md |
charging_station.py |
Charging station — FIFO dock queue, slot negotiation via DDS request-reply, status publishing | charging_station_spec.md |
robot_ui.py |
Flask web dashboard — DDS subscriber (7 readers incl. CoveragePoint), SSE publisher, command proxy, video proxy, charging panel, coverage map overlay | robot_ui_spec.md |
fleet_config.sh |
Transport config for this approach (DDS domain ID); sources ../shared/fleet_common.sh for scenario data |
— |
demo_start.sh |
Uniform launch script: all, robots, stations, ui, persistence, robot <name>, station <name> |
— |
demo_stop.sh |
Kill all running robot, station, and UI processes | — |
types_generate.sh |
Regenerate robot_types.py from robot_types.idl via rtiddsgen |
— |
../setup.sourceme |
Environment setup — run source ../setup.sourceme dds from repo root |
— |
Shared assets (in ../shared/):
| File | Purpose |
|---|---|
fleet_common.sh |
Single source of truth for robot names, station positions, UI port — shared by all approaches |
video_renderer.py |
PyBullet headless 3D renderer — compound tractor bodies, waypoint beacons, text overlay |
arena_ground.urdf |
URDF arena ground plane for PyBullet |
field1_map.jpg / field1_texture.jpg |
Ground textures for canvas map and 3D renderer |
robot_node all other robot_nodes + robot_ui
────────── ─────────────────────────────────
writer.write(KinematicState) ──10 Hz──► reader takes: position, velocity, heading
BEST_EFFORT / VOLATILE (multicast — single write, N deliveries)
KEEP_LAST depth=1
DDS advantage: the robot calls write() once. The middleware delivers to
all matched readers via multicast. No per-subscriber threads, no delivery
loops.
robot_node all other robot_nodes + robot_ui
────────── ─────────────────────────────────
writer.write(OperationalState) ─on change─► reader takes: status, battery_level
RELIABLE / TRANSIENT_LOCAL (late joiners get last value)
KEEP_LAST depth=1
DDS advantage: TRANSIENT_LOCAL durability means a newly discovered robot
immediately receives the current OperationalState of every live peer — no
manual sync, no "catch-up" protocol.
robot_node all other robot_nodes + robot_ui
────────── ─────────────────────────────────
writer.write(Intent) ─on change──► reader takes: intent_type, target, waypoints
RELIABLE / TRANSIENT_LOCAL (late joiners get last value)
KEEP_LAST depth=1
robot_node robot_ui
────────── ────────
writer.write(Telemetry) ── 1 Hz ──► reader takes: cpu, memory, temperature, signal strength
BEST_EFFORT / VOLATILE (loss tolerable)
robot_node robot_ui
────────── ────────
writer.write(VideoFrame) ── ~5 fps──► reader takes: frame_data (JPEG bytes)
BEST_EFFORT / VOLATILE content-filtered by robot_id
KEEP_LAST depth=1
DDS advantage: the UI uses a ContentFilteredTopic to subscribe to video
from only the selected robot. The filter is evaluated in the middleware — the
publisher doesn't need to know who's watching.
robot_node robot_ui
────────── ────────
writer.write(CoveragePoint) ──on move──► reader takes: robot_id, x, y
RELIABLE / PERSISTENT (via Persistence Service)
KEEP_LAST depth=1000
Each robot publishes CoveragePoint as it moves along its path. The UI
accumulates these into per-robot polylines drawn on the map canvas (lineWidth 4,
alpha 0.45, rounded joins). A toggle button and "Clear Coverage" button control
display. PERSISTENT durability (via RTI Persistence Service) allows late-joining
UIs to see historical coverage.
robot_ui robot_node
──────── ──────────
writer.write(CommandRequest) ──────────► reader takes: command, params
RELIABLE / VOLATILE (keyed by request_id)
KEEP_ALL
◄────────── writer.write(CommandResponse)
(correlated by request_id)
DDS advantage: reliable delivery is a QoS policy, not a transport guarantee. If the target robot is temporarily unreachable, the middleware retransmits automatically — no application-level retry logic.
robot_node charging_station
────────── ────────────────
writer.write(SlotRequest) ──────────► reader takes: robot_id, battery, info_only
RELIABLE / VOLATILE (keyed by request_id)
◄────────── writer.write(SlotOffer)
dock position, wait time, slot_id
writer.write(SlotConfirm) ──────────► reader: commit to queue
◄────────── writer.write(SlotAssignment)
granted, dock/wait position, rank
writer.write(SlotRelease) ──────────► reader: release dock or cancel
Station status (live queue state) is published as a topic:
charging_station robot_ui
──────────────── ────────
writer.write(StationStatus) ── 1 Hz ──► reader takes: queue entries, dock state
RELIABLE / TRANSIENT_LOCAL (late-joining UI gets current state)
Flask /stream ──SSE 5 Hz──► Browser EventSource
Same as the gRPC approach — the Flask backend snapshots the DDS reader caches and pushes JSON SSE events. This is the only non-DDS data flow.
UDP multicast (239.255.0.1)
┌──────────────────────────────────┐
robot_node ◄──────┤ DDS Simple Participant ├──────► robot_node
station ◄──────┤ Discovery Protocol (SPDP) ├──────► robot_ui
└──────────────────────────────────┘
No application code required. Every DomainParticipant on the same domain
ID discovers all others automatically. Topic matching (which reader gets which
writer's data) is handled by the Simple Endpoint Discovery Protocol (SEDP).
The central advantage of DDS over gRPC is per-topic QoS. Each data flow gets exactly the guarantees it needs — no more, no less.
All profiles inherit from RTI Pattern profiles (BuiltinQosLib::Pattern.*)
rather than raw Generic profiles. Each Pattern already encodes the right
reliability / durability / history for its communication pattern — our overrides
are minimal (liveliness, deadline tuning).
| Topic | QoS Profile | Base Pattern | Overrides | Rationale |
|---|---|---|---|---|
KinematicState |
KinematicStateProfile |
Pattern.PeriodicData |
deadline 200 ms writer / 500 ms reader, liveliness 100 ms | High rate, latest-value-only. Loss of one sample is harmless — the next arrives in 100 ms. |
OperationalState |
OperationalStateProfile |
Pattern.Status |
liveliness 100 ms | Must not be lost. Late joiners need current status + battery. |
Intent |
IntentProfile |
Pattern.LastValueCache |
liveliness 100 ms | Must not be lost. Late joiners need current intent + waypoints. |
Telemetry |
TelemetryProfile |
Pattern.Streaming |
deadline 2 s writer / 5 s reader | Loss tolerable. No liveliness needed (covered by KinematicState). |
VideoFrame |
VideoProfile |
Pattern.Streaming |
deadline 1 s writer / 2 s reader | High bandwidth. Dropped frames are replaced by the next one. |
CommandRequest |
CommandProfile |
Pattern.RPC |
(none — pattern defaults) | Every command must be delivered. Tuned heartbeat/NACK for low-latency reply. |
CommandResponse |
CommandProfile |
Pattern.RPC |
(none) | Every response must be delivered. |
SlotRequest/Confirm/Release |
SlotNegotiationProfile |
Pattern.RPC |
(none) | Charging negotiation is correlated request/reply. |
StationStatus |
StationStatusProfile |
Pattern.Status |
(none) | UI late joiners need current station state. |
CoveragePoint |
CoverageProfile |
Pattern.Status |
PERSISTENT durability, KEEP_LAST 1000, max_samples 10000, max_samples_per_instance 2000 | Full coverage trail history survives restarts via RTI Persistence Service. |
DDS liveliness replaces the manual heartbeat / reconnect logic from the gRPC approach:
# QoS configuration (in robot_qos.xml):
# liveliness.kind = AUTOMATIC_LIVELINESS_QOS
# liveliness.lease_duration = 100 ms
#
# Application code:
class MyListener(dds.DataReaderListener):
def on_liveliness_changed(self, reader, status):
if status.alive_count_change < 0:
print(f"Robot DEAD — {status.last_publication_handle}")
elif status.alive_count_change > 0:
print(f"Robot ALIVE — {status.last_publication_handle}")Zero application polling. The middleware asserts liveliness automatically based on write activity. If a writer stops (crash, network loss), the lease expires and all readers are notified within 100 ms.
The types are fully implemented in robot_types.idl. Key data types,
separated by concern and keyed by robot_id:
No
timestampfields. DDS providessource_timestampandreception_timestampin theSampleInfometadata of every sample — nanosecond precision, zero application code. The gRPC approach must put an explicitint64 timestampin every protobuf message; the DDS approach gets it for free.
| Type | Key | Fields | Notes |
|---|---|---|---|
KinematicState |
robot_id |
position{x,y,z}, velocity{x,y,z}, heading |
10 Hz, matches the gRPC approach |
OperationalState |
robot_id |
status (enum), battery_level |
On change; status = MOVING / IDLE / HOLDING / CHARGING |
Intent |
robot_id |
intent_type (enum), target_x, target_y, waypoints[], path_index |
On change; intent = FOLLOW_PATH / GOTO / IDLE / CHARGE_QUEUE / CHARGE_DOCK |
Telemetry |
robot_id |
cpu_usage, memory_usage, temperature, signal_strength |
1 Hz; no tcp_connections / delivery_loops (those are gRPC artifacts). |
VideoFrame |
robot_id |
frame_data (octet sequence) |
~5 fps JPEG |
CoveragePoint |
robot_id |
x, y |
Published on move; PERSISTENT durability via Persistence Service |
CommandRequest |
request_id |
robot_id, command (enum), parameters (JSON string) |
Reliable request |
CommandResponse |
request_id |
robot_id, success, description, resulting_status, resulting_intent |
Reliable response |
SlotRequest |
request_id |
robot_id, battery_level, info_only |
Charging negotiation |
SlotOffer |
request_id |
station_id, slot_id, queue_rank, wait_time_s, dock_x, dock_y |
Station response |
SlotConfirm |
request_id |
robot_id, slot_id |
Robot commits |
SlotAssignment |
request_id |
granted, queue_rank, wait_x, wait_y, dock_x, dock_y |
Station assignment |
SlotRelease |
request_id |
robot_id, slot_id |
Release / cancel |
SlotReleaseAck |
request_id |
success |
Release acknowledgement |
StationStatus |
station_id |
dock_x, dock_y, is_occupied, docked_robot_id, queue[] |
Live station state for UI |
Key design difference from the gRPC approach: Telemetry does not include
tcp_connections or active_delivery_loops — those metrics are gRPC-specific
artifacts. In DDS there are no per-subscriber delivery loops and the
"connection" concept doesn't apply (multicast pub-sub).
RobotStatus: STATUS_UNKNOWN / STATUS_MOVING / STATUS_IDLE / STATUS_HOLDING / STATUS_CHARGINGRobotIntentType: INTENT_UNKNOWN / INTENT_FOLLOW_PATH / INTENT_GOTO / INTENT_IDLE / INTENT_CHARGE_QUEUE / INTENT_CHARGE_DOCKCommand: COMMAND_UNKNOWN / CMD_STOP / CMD_FOLLOW_PATH / CMD_GOTO / CMD_RESUME / CMD_SET_PATH / CMD_CHARGE
These mirror the gRPC approach's protobuf enums exactly — same values, same semantics —
expressed as IDL enumerations in robot_types.idl.
robot_types.idl is the single source of truth for all data types. Python
dataclasses are generated — never hand-written:
$NDDSHOME/bin/rtiddsgen -language python robot_types.idlThis produces robot_types.py (~300 lines, 23 types) with all the correct
@idl.struct decorators, field(default_factory=...) for mutable defaults,
Sequence[T] with idl.bound() for bounded sequences, idl.key annotations,
and idl.octet mappings. The generated file carries a DO NOT MODIFY
header.
Workflow when types change:
- Edit
robot_types.idl(add/remove/rename fields, types, or enums) - Re-run
./types_generate.sh(orrtiddsgen -language python robot_types.idl) - All Python code that imports from
robot_typespicks up the change — no manual struct editing, no missed fields.
Parallel to the gRPC approach: The gRPC approach uses
protocto generaterobot_pb2.pyfromrobot.proto. The DDS approach usesrtiddsgento generaterobot_types.pyfromrobot_types.idl. Same workflow — different IDL, different middleware, same single-source-of-truth principle.
- Python 3.14.3+
- RTI Connext DDS 7.6.0+ (Professional or Community edition)
- RTI Code Generator (
rtiddsgen) — included in the Connext installation - Virtual environment with packages:
rti.connext,flask,pybullet,Pillow,numpy
source ../setup.sourceme dds
# or manually:
export NDDSHOME=/path/to/rti_connext_dds-7.6.0
export PATH=$NDDSHOME/bin:$PATH
export LD_LIBRARY_PATH=$NDDSHOME/lib/x64Darwin17clang9.0:$LD_LIBRARY_PATH # macOS
pip install -r requirements.txt# (Re-)generate Python types from IDL (only needed after editing robot_types.idl):
$NDDSHOME/bin/rtiddsgen -language python robot_types.idl
# All 5 tractors + 2 charging stations:
./demo_start.sh
# Individual tractor:
python robot_node.py --id tractor1
# Or specific tractors in separate terminals:
python robot_node.py --id tractor1
python robot_node.py --id tractor2
python robot_node.py --id tractor3
python robot_node.py --id tractor4
python robot_node.py --id tractor5
# Charging stations:
python charging_station.py --id station1 --dock-x 8 --dock-y 92
python charging_station.py --id station2 --dock-x 92 --dock-y 8
# Dashboard UI:
python robot_ui.py # http://localhost:5000No static mode needed. DDS discovery is automatic — all participants on the same domain ID find each other via SPDP multicast. No port lists, no peer addresses.
./demo_stop.sh # kills all robot_node, charging_station, and robot_ui processesDDS eliminates the N² thread explosion of the gRPC approach. Each robot node needs approximately 6 threads regardless of fleet size:
| Thread | Count | Purpose |
|---|---|---|
| Main | 1 | Blocks on sleep(1) loop |
update_position |
1 | Movement tick at 10 Hz (path following, collision avoidance) |
publish_state |
1 | Publishes KinematicState at 10 Hz |
publish_intent |
1 | Publishes Intent on change (or 1 Hz heartbeat) |
publish_telemetry |
1 | Publishes Telemetry at 1 Hz |
print_status |
1 | Log line every 5 s |
| DDS internal threads | ~3 | Receive, event, database (managed by middleware) |
Compare with the gRPC approach: a 5-robot fleet needs ~19 threads per robot (4 reader threads × 4 peers + connector threads). A 100-robot fleet would need ~400 threads per robot. In DDS it's still ~9.
No delivery loops — in the gRPC approach, each streaming RPC occupies a thread that
runs while: yield; sleep for every subscriber. In DDS, writer.write()
returns immediately; the middleware handles serialisation and multicast
delivery internally.
No reconnect threads — in the gRPC approach, each peer gets a _connect_to_peer
thread that retries every 3 s on failure. In DDS, discovery and reconnection
are handled by the middleware's internal SPDP/SEDP threads.
| Aspect | gRPC (grpc/) |
Hybrid (grpc-dds/) |
DDS (this) |
|---|---|---|---|
| Topology | Full mesh (N²) | Peer-to-peer multicast + gRPC | Peer-to-peer multicast |
| Transport | gRPC / TCP | DDS / UDP + gRPC / TCP | DDS / UDP multicast |
| Discovery | Zeroconf mDNS (bolted on) | DDS SPDP + Zeroconf | DDS SPDP only |
| QoS | None (TCP reliable) | Per-topic (pub-sub only) | Per-topic (everything) |
| Commands | gRPC unary | gRPC unary | DDS request-reply |
| Charging | gRPC unary RPCs | gRPC unary RPCs | DDS request-reply |
| Scalability | Poor (N²) | Good (multicast) | Good (multicast) |
| Presence detection | Seconds (TCP) | Milliseconds (DDS liveliness) | Milliseconds (DDS liveliness) |
| Delivery loops per robot | 4 × (N−1) | 0 | 0 |
| Threads per robot (N=100) | ~400 | ~10 | ~9 |
| Extra protocols | Zeroconf, mDNS | Zeroconf (for gRPC peers) | None |
These are concrete pieces of the gRPC approach code that do not exist in the DDS approach because the middleware handles them:
| the gRPC approach Code | Lines | the DDS approach Equivalent |
|---|---|---|
fleet_discovery.py (Zeroconf mDNS wrapper) |
~200 | DDS SPDP — zero code |
_connect_to_peer reconnect loop |
~60 | DDS automatic reconnection — zero code |
_start_kinematic_stream + reader thread |
~20 × 4 topics | reader.take() in listener callback — 5 lines |
Per-subscriber while: yield; sleep delivery loops |
~15 × 4 topics | writer.write() — 1 line |
_delivery_loop_count / _enter/_exit_delivery_loop |
~20 | Does not exist (no concept of delivery loops) |
| Channel connectivity monitoring | ~30 | on_liveliness_changed callback — 5 lines |
active_video_streams counter |
~10 | Content-filtered subscription — middleware manages |
Manual robot_liveliness timestamp polling |
~25 | DDS liveliness QoS — zero code |
int64 timestamp field in every protobuf message |
8 B × 13 types | SampleInfo.source_timestamp / reception_timestamp — zero payload overhead, nanosecond precision |
protoc → robot_pb2.py + robot_pb2_grpc.py |
2 generated files | rtiddsgen → robot_types.py — 1 generated file (no separate service stubs; DDS topics are declared at runtime) |
Thread pool sizing (max_workers=50) |
— | Middleware manages its own threads |
Estimated code reduction: ~400+ lines of networking / discovery / reconnect / threading / timestamping plumbing eliminated.
These are the observable failure scenarios under DDS — contrast with the gRPC approach's §9.
| Trigger | What happens | Why it's better than gRPC |
|---|---|---|
| WiFi blip | DDS internally buffers reliable samples during the outage and retransmits on reconnect. Best-effort samples are lost (harmless — next one arrives in 100 ms). | No reconnection storm. No per-peer retry loops. No CPU spike. |
| Robot crash / kill | Liveliness lease expires within 100 ms. All readers receive on_liveliness_changed callback. |
Detection in 100 ms vs seconds with TCP keepalive. |
| Slow peer | DDS flow control and writer history depth prevent unbounded buffering. The slow reader loses best-effort samples gracefully. | No cascading failures. No thread pool starvation. |
| Startup burst | All robots discover each other via multicast in parallel. No TCP handshakes, no connection storms. | O(1) discovery messages vs O(N²) TCP connections. |
| Port exhaustion | Does not apply — DDS uses a small fixed number of UDP ports per participant, not one TCP connection per peer. | No ulimit tuning required. |
| IP address change | DDS SPDP re-announces with new address. Peers re-match automatically. | No stale peer addresses. No manual re-registration. |
| Metric | Expected (DDS) | the gRPC approach (gRPC) for comparison |
|---|---|---|
| Connections per robot | ~3 UDP sockets (fixed) | N−1 TCP connections (grows with fleet) |
| Threads per robot | ~9 (fixed) | ~4N (grows with fleet) |
| Delivery loops | 0 | 4 × (N−1) per robot |
| Startup time to full mesh | < 2 s (SPDP discovery) | 3–15 s (sequential TCP connects) |
| Presence detection latency | ≤ 100 ms (liveliness QoS) | 3–10 s (TCP keepalive) |
| Late-joiner convergence | < 100 ms (TRANSIENT_LOCAL) | < 100 ms (first stream yield) |
| CPU during reconnection | Negligible | Spike (N² TCP handshakes) |
| Memory per robot | ~50 MB (fixed) | ~50 MB + buffers (grows with N) |
Subscribe to state from robots in a specific area only:
cft = dds.ContentFilteredTopic(
participant, "NearbyRobots", state_topic,
dds.Filter("x > 40.0 AND x < 60.0 AND y > 40.0 AND y < 60.0")
)
reader = dds.DataReader(subscriber, cft)Receive telemetry at most once per second (even if published faster):
reader_qos.time_based_filter.minimum_separation = dds.Duration(sec=1)Separate robot fleets on the same network:
publisher_qos.partition.name = ["FleetA"] # Fleet A robots
publisher_qos.partition.name = ["FleetB"] # Fleet B robots- Check firewall allows UDP multicast (port 7400+ for SPDP)
- Ensure all robots use the same domain ID (
--domainflag) - Verify
NDDSHOMEis set andrti.connextis installed - On macOS: check that the loopback interface allows multicast
(
sudo route add -net 239.0.0.0/8 -interface lo0)
- Verify QoS compatibility: a
RELIABLEreader requires aRELIABLEwriter - Check durability:
TRANSIENT_LOCALreader needsTRANSIENT_LOCALwriter - Verify topic names and type names match exactly
# Prints discovery data. Shows samples being published
$NDDSHOME/bin/rtiddsspy -domainId 0
# Prints discovery data and data content
$NDDSHOME/bin/rtiddsspy -domainId 0 -printThis shows every DDS sample published on the domain — useful for verifying that topics, types, services, and QoS are matching correctly.