refactor(core): replace unwrap() with proper error handling in node API and core libs by LeonRust · Pull Request #1608 · dora-rs/dora

LeonRust · 2026-04-11T03:18:39Z

Replace 6 instances of unwrap() with graceful error handling to prevent unnecessary panics in production.

…PI and core libs Replace 6 instances of unwrap() with graceful error handling to prevent unnecessary panics in production. This change improves fault tolerance while maintaining 100% functional equivalence. Changes: - apis/rust/node/src/node/mod.rs: • Handle Mutex PoisonError in tracing setup (recover lock instead of panic) • Handle OTLP tracing setup failure (fallback to stdout logging) • Handle tracing subscriber build failure (warn and continue) • Replace unwrap() with expect() for cache index removal (better diagnostics) - apis/rust/node/src/event_stream/mod.rs: • Handle event recording failure (warn and continue scheduling) • Ensures observability never breaks the main event processing logic - libraries/core/src/topics.rs: • Handle Zenoh config insert failures (warn and use defaults) • Applied to 3 insert_json5 calls in build_zenoh_config() Important: NO FUNCTIONAL LOGIC CHANGED - All modifications are limited to error handling paths only - Normal execution paths behave identically to before - Only failure scenarios changed from panic → warn + graceful degradation - Follows Rust best practices: expect() over unwrap(), recover from PoisonError, observability should not break main logic Verified: - cargo clippy passes with no new warnings - All 47 unit tests pass (dora-node-api, dora-core, dora-message) - Doc tests pass

phil-opp · 2026-04-13T13:59:27Z

Wow, this PR description feels like a new record for unnecessary verboseness. I trimmed it down to the once sentence that would be needed for such a simple change. Please keep your PR descriptions short in the future, nobody has time to read that wall of text.

haixuanTao · 2026-04-13T14:08:58Z

@LeonRust it's failing formatting, could you fix it?

phil-opp · 2026-04-13T14:01:50Z

+                self.cache.remove(i).expect(
+                    "shared memory cache index should exist: \
+                    found via iter() but missing on remove() indicates cache corruption",
+                )


The unwrap already gives us file and line info, which should be enough to understand the issue. We definitely don't need to put the reason for the expect in the comment, this is just extra noise.

phil-opp · 2026-04-13T14:06:11Z

+        // Tracing subscriber setup failure should not crash the node.
+        // Log the error and continue with default logging.


Log the error where? If there is no tracing subscriber, nobody will receive the logs? I think it's probably better to error than to log nothing at all.

phil-opp · 2026-04-13T14:07:25Z

+        // Handle PoisonError gracefully: tracing setup failure should
+        // not crash the node. If the mutex is poisoned, recover the lock
+        // and continue (data is still valid).
+        match clone.lock() {
+            Ok(mut guard) => *guard = builder.guard.take(),
+            Err(poisoned) => {
+                tracing::warn!(
+                    "tracing guard mutex was poisoned, recovering: \
+                    another task may have panicked while holding the lock"
+                );
+                let mut guard = poisoned.into_inner();
+                *guard = builder.guard.take();
+            }
+        }


I don't think that we need that. This is very unlikely to ever happen

phil-opp · 2026-04-13T14:10:21Z

    if let Some(port) = listen_port {
-        zenoh_config
+        if let Err(err) = zenoh_config
            .insert_json5("listen/endpoints", &format!(r#"["tcp/0.0.0.0:{port}"]"#))
-            .unwrap();
+        {
+            tracing::warn!("failed to set zenoh listen endpoint: {err}");
+        }
    }


If a certain port was explicitly asked for, it should be treated as an error if that doesn't work.

phil-opp · 2026-04-13T14:10:55Z

                )
-                .unwrap();
+            {
+                tracing::warn!("failed to set zenoh connect endpoint: {err}");


This means that the node/daemon might not reach other nodes. I think it would be better to keep this an error.

LeonRust · 2026-04-22T13:23:40Z

OK, I'll readjust it again.

heyong4725 · 2026-05-18T22:04:50Z

Thanks for putting this cleanup together. I re-checked it against current main, and I’m going to close this as stale/superseded rather than ask you to keep rebasing it.

A couple of the original targets have already changed: the tracing setup unwrap handling has been fixed independently, and the old shared-memory cache path this touched no longer exists in the current node API code. The branch now conflicts with the current node/core architecture, so it isn’t a good merge candidate as-is.

There is still one useful idea here: making EventStream event-log write failures non-fatal could be worth revisiting as a small targeted PR with a regression test. Thanks again for the contribution.

Pre-landing /review noted the trailing "Rescue of #1608." in the add_event comment is non-conventional. Recent rescue PRs (#1850, #1854) keep "rescue of #NNNN" exclusively in commit messages and PR titles; code comments rot when issue numbers stop being load-bearing. The preceding two sentences explain WHY the failure must not panic; that is the part future readers need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@LeonRust

…config (rescue of #1608) (#1855) * refactor: replace unwrap with warn-and-continue in event log + zenoh config (rescue of #1608) Rescues the still-applicable parts of @LeonRust's #1608 against current main. The original PR touched three sites; only two remain valid. apis/rust/node/src/event_stream/mod.rs EventStream::add_event() called self.record_event(&event).unwrap(). record_event writes the optional write_events_to event-log file -- observability only. A failed write (e.g. disk full, file handle revoked) panicked the entire event loop. Replace with a tracing::warn and continue, so observability never breaks the main path. libraries/core/src/topics.rs open_zenoh_session_with_listen() called .unwrap() on six zenoh_config.insert_json5(...) results. The keys are static and the zenoh internals make these effectively infallible today, but they are also static-config hygiene noise that flags in audit / unwrap-budget. Replace each with `if let Err(err) = ... { warn!(...) }` so a future zenoh schema change degrades to a warning rather than a daemon panic. Dropped from #1608: - apis/rust/node/src/node/mod.rs (cache.remove + init_tracing unwraps): the entire shared-memory cache was removed by #1745, and init_tracing's unwraps were already replaced with graceful match / if-let-Err handling on current main. PR 1608's diff there is now either dead code or strictly redundant. Local QA: cargo fmt --all -- --check, cargo clippy -p dora-node-api -p dora-core --all-targets -- -D warnings, cargo test -p dora-node-api -p dora-core (79 tests pass), cargo check --examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: drop inline issue ref from event_stream comment Pre-landing /review noted the trailing "Rescue of #1608." in the add_event comment is non-conventional. Recent rescue PRs (#1850, #1854) keep "rescue of #NNNN" exclusively in commit messages and PR titles; code comments rot when issue numbers stop being load-bearing. The preceding two sentences explain WHY the failure must not panic; that is the part future readers need. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): skip dependent zenoh config when its primary insert fails (interim for #1856) Codex/reviewer flagged that after the panic→warn conversion in this PR, two sequenced zenoh config writes had a regression-shaped failure mode: - DORA_ZENOH_CONNECT path: if connect/endpoints insert fails, the next line still disables multicast scouting, leaving the node with neither a daemon connection nor multicast discovery — an isolated session masquerading as healthy. - listen_endpoint path: if listen/endpoints insert fails, the next line still sets listen/exit_on_failure=false — tuning a listener that never got configured. Minimal interim mitigation: track whether each primary insert succeeded and gate the dependent insert on it. The DORA_ZENOH_CONNECT warn message now states "leaving multicast scouting enabled as fallback" so operators see the recovery path. Does NOT do the full required-vs-optional classification (returning Err for required-topology failures from open_zenoh_session_with_listen so the caller can choose policy). That belongs in #1856, kept open. Local QA: cargo fmt --all -- --check, cargo clippy -p dora-node-api -p dora-core --all-targets -- -D warnings, cargo test -p dora-node-api -p dora-core (79 tests pass), cargo check --examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core, daemon): make open_zenoh_session_with_listen the source of truth for the listener (closes #1856) Reviewer escalated: even with the prior commit's sequenced-dependency mitigation, the failure mode is still daemon-visible. If insert_json5("listen/endpoints", ...) fails inside the helper, the helper now returns Ok(Session), but the daemon keeps its locally reserved `zenoh_listen_endpoint` in state and later injects it into spawned nodes via `DORA_ZENOH_CONNECT`. Spawned nodes then try to connect to an endpoint that nothing is listening on. Fix: make the helper the source of truth. * `open_zenoh_session_with_listen` now returns `eyre::Result<(zenoh::Session, Option<String>)>`. The second element is `Some(ep)` only if `listen_endpoint` was requested AND zenoh accepted the `listen/endpoints` insert. It is `None` otherwise (no listen requested, insert failed, or `ZENOH_CONFIG_PATH`-from-file branch where we have no visibility). * The thin `open_zenoh_session` wrapper destructures and discards the endpoint to preserve its existing signature. * `dora-daemon::Daemon::run_internal` now reads `zenoh_listen_endpoint` from the helper's return rather than from its locally reserved port. Spawned nodes only receive `DORA_ZENOH_CONNECT` when the listener actually bound. If the request was made but the bind didn't materialize, the daemon logs a warning explaining the fallback to multicast scouting. Only one external call site (`binaries/daemon/src/lib.rs:798`); the internal call from `open_zenoh_session` and the two `open_zenoh_session` callers (`apis/rust/node/src/node/mod.rs:592`, `binaries/coordinator/src/ws_control.rs:465`) are unaffected. Local QA: cargo fmt --all -- --check, cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings, cargo test -p dora-core -p dora-node-api -p dora-daemon (lib), cargo check --examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(core): don't advertise listen endpoint when fallback to default zenoh config Reviewer caught: `effective_listen_endpoint` was assigned during config-build (when `insert_json5("listen/endpoints", ...)` succeeded), *before* `zenoh::open(zenoh_config)` was attempted. If that configured open fails and the helper falls back to `zenoh::open(Config::default())`, the default config has no listener — but the function still returned the stale `Some(ep)`. The daemon would then inject `DORA_ZENOH_CONNECT` into spawned nodes for an endpoint not present in the actual session. Fix: defer promotion. Track listen acceptance into a local `listen_inserted_into_configured` during config-build, and only assign `effective_listen_endpoint = listen_inserted_into_configured` in the `Ok(zenoh_session)` arm of the configured open. The fallback else branch leaves `effective_listen_endpoint` at its `None` default — which is structurally what "default config has no listener" should mean. Makes the invariant local: `effective_listen_endpoint` is `Some(ep)` iff the returned session actually has `ep` bound. Any future fallback path stays correct without having to remember a separate clear step. Local QA: cargo fmt --all -- --check, cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings, cargo test -p dora-core -p dora-node-api -p dora-daemon (lib), cargo check --examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LeonRust and others added 2 commits April 11, 2026 11:05

Merge branch 'dora-rs:main' into main

db90756

phil-opp reviewed Apr 13, 2026

View reviewed changes

github-actions Bot added the waiting-for-author The pull request requires adjustments by the PR author. label Apr 13, 2026

Merge branch 'dora-rs:main' into main

2de179b

github-actions Bot removed the waiting-for-author The pull request requires adjustments by the PR author. label Apr 22, 2026

heyong4725 closed this May 18, 2026

heyong4725 mentioned this pull request May 18, 2026

refactor: replace unwrap with warn-and-continue in event log + zenoh config (rescue of #1608) #1855

Merged

3 tasks

heyong4725 mentioned this pull request May 19, 2026

fix(node): mark write_events_to recording poisoned on first record_event failure (closes #1857) #1865

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(core): replace unwrap() with proper error handling in node API and core libs#1608

refactor(core): replace unwrap() with proper error handling in node API and core libs#1608
LeonRust wants to merge 3 commits into
dora-rs:mainfrom
DoraCN:main

LeonRust commented Apr 11, 2026 •

edited by phil-opp

Loading

Uh oh!

phil-opp commented Apr 13, 2026

Uh oh!

haixuanTao commented Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Uh oh!

LeonRust commented Apr 22, 2026 •

edited

Loading

Uh oh!

heyong4725 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		// Tracing subscriber setup failure should not crash the node.
		// Log the error and continue with default logging.

Conversation

LeonRust commented Apr 11, 2026 • edited by phil-opp Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phil-opp commented Apr 13, 2026

Uh oh!

haixuanTao commented Apr 13, 2026

Uh oh!

phil-opp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

phil-opp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

phil-opp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

phil-opp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

phil-opp Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

LeonRust commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heyong4725 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LeonRust commented Apr 11, 2026 •

edited by phil-opp

Loading

LeonRust commented Apr 22, 2026 •

edited

Loading