Skip to content

refactor(core): replace unwrap() with proper error handling in node API and core libs#1608

Closed
LeonRust wants to merge 3 commits into
dora-rs:mainfrom
DoraCN:main
Closed

refactor(core): replace unwrap() with proper error handling in node API and core libs#1608
LeonRust wants to merge 3 commits into
dora-rs:mainfrom
DoraCN:main

Conversation

@LeonRust
Copy link
Copy Markdown
Collaborator

@LeonRust LeonRust commented Apr 11, 2026

Replace 6 instances of unwrap() with graceful error handling to prevent unnecessary panics in production.

LeonRust and others added 2 commits April 11, 2026 11:05
…PI and core libs

Replace 6 instances of unwrap() with graceful error handling to prevent
unnecessary panics in production. This change improves fault tolerance
while maintaining 100% functional equivalence.

Changes:
- apis/rust/node/src/node/mod.rs:
  • Handle Mutex PoisonError in tracing setup (recover lock instead of panic)
  • Handle OTLP tracing setup failure (fallback to stdout logging)
  • Handle tracing subscriber build failure (warn and continue)
  • Replace unwrap() with expect() for cache index removal (better diagnostics)

- apis/rust/node/src/event_stream/mod.rs:
  • Handle event recording failure (warn and continue scheduling)
  • Ensures observability never breaks the main event processing logic

- libraries/core/src/topics.rs:
  • Handle Zenoh config insert failures (warn and use defaults)
  • Applied to 3 insert_json5 calls in build_zenoh_config()

Important: NO FUNCTIONAL LOGIC CHANGED
- All modifications are limited to error handling paths only
- Normal execution paths behave identically to before
- Only failure scenarios changed from panic → warn + graceful degradation
- Follows Rust best practices: expect() over unwrap(), recover from PoisonError,
  observability should not break main logic

Verified:
- cargo clippy passes with no new warnings
- All 47 unit tests pass (dora-node-api, dora-core, dora-message)
- Doc tests pass
@phil-opp
Copy link
Copy Markdown
Collaborator

Wow, this PR description feels like a new record for unnecessary verboseness. I trimmed it down to the once sentence that would be needed for such a simple change. Please keep your PR descriptions short in the future, nobody has time to read that wall of text.

@haixuanTao
Copy link
Copy Markdown
Collaborator

@LeonRust it's failing formatting, could you fix it?

Comment on lines +765 to +768
self.cache.remove(i).expect(
"shared memory cache index should exist: \
found via iter() but missing on remove() indicates cache corruption",
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unwrap already gives us file and line info, which should be enough to understand the issue. We definitely don't need to put the reason for the expect in the comment, this is just extra noise.

Comment on lines +1017 to +1018
// Tracing subscriber setup failure should not crash the node.
// Log the error and continue with default logging.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log the error where? If there is no tracing subscriber, nobody will receive the logs? I think it's probably better to error than to log nothing at all.

Comment on lines +1002 to +1015
// Handle PoisonError gracefully: tracing setup failure should
// not crash the node. If the mutex is poisoned, recover the lock
// and continue (data is still valid).
match clone.lock() {
Ok(mut guard) => *guard = builder.guard.take(),
Err(poisoned) => {
tracing::warn!(
"tracing guard mutex was poisoned, recovering: \
another task may have panicked while holding the lock"
);
let mut guard = poisoned.into_inner();
*guard = builder.guard.take();
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need that. This is very unlikely to ever happen

Comment on lines 36 to 42
if let Some(port) = listen_port {
zenoh_config
if let Err(err) = zenoh_config
.insert_json5("listen/endpoints", &format!(r#"["tcp/0.0.0.0:{port}"]"#))
.unwrap();
{
tracing::warn!("failed to set zenoh listen endpoint: {err}");
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a certain port was explicitly asked for, it should be treated as an error if that doesn't work.

)
.unwrap();
{
tracing::warn!("failed to set zenoh connect endpoint: {err}");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the node/daemon might not reach other nodes. I think it would be better to keep this an error.

@github-actions github-actions Bot added the waiting-for-author The pull request requires adjustments by the PR author. label Apr 13, 2026
@LeonRust
Copy link
Copy Markdown
Collaborator Author

LeonRust commented Apr 22, 2026

OK, I'll readjust it again.

@github-actions github-actions Bot removed the waiting-for-author The pull request requires adjustments by the PR author. label Apr 22, 2026
@heyong4725
Copy link
Copy Markdown
Collaborator

Thanks for putting this cleanup together. I re-checked it against current main, and I’m going to close this as stale/superseded rather than ask you to keep rebasing it.

A couple of the original targets have already changed: the tracing setup unwrap handling has been fixed independently, and the old shared-memory cache path this touched no longer exists in the current node API code. The branch now conflicts with the current node/core architecture, so it isn’t a good merge candidate as-is.

There is still one useful idea here: making EventStream event-log write failures non-fatal could be worth revisiting as a small targeted PR with a regression test. Thanks again for the contribution.

@heyong4725 heyong4725 closed this May 18, 2026
heyong4725 added a commit that referenced this pull request May 18, 2026
Pre-landing /review noted the trailing "Rescue of #1608." in the
add_event comment is non-conventional. Recent rescue PRs (#1850,
#1854) keep "rescue of #NNNN" exclusively in commit messages and PR
titles; code comments rot when issue numbers stop being load-bearing.

The preceding two sentences explain WHY the failure must not panic;
that is the part future readers need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
heyong4725 added a commit that referenced this pull request May 18, 2026
…config (rescue of #1608) (#1855)

* refactor: replace unwrap with warn-and-continue in event log + zenoh config (rescue of #1608)

Rescues the still-applicable parts of @LeonRust's #1608 against current
main. The original PR touched three sites; only two remain valid.

apis/rust/node/src/event_stream/mod.rs

  EventStream::add_event() called self.record_event(&event).unwrap().
  record_event writes the optional write_events_to event-log file --
  observability only. A failed write (e.g. disk full, file handle
  revoked) panicked the entire event loop. Replace with a tracing::warn
  and continue, so observability never breaks the main path.

libraries/core/src/topics.rs

  open_zenoh_session_with_listen() called .unwrap() on six
  zenoh_config.insert_json5(...) results. The keys are static and the
  zenoh internals make these effectively infallible today, but they are
  also static-config hygiene noise that flags in audit / unwrap-budget.
  Replace each with `if let Err(err) = ... { warn!(...) }` so a future
  zenoh schema change degrades to a warning rather than a daemon panic.

Dropped from #1608:

  - apis/rust/node/src/node/mod.rs (cache.remove + init_tracing
    unwraps): the entire shared-memory cache was removed by #1745, and
    init_tracing's unwraps were already replaced with graceful match /
    if-let-Err handling on current main. PR 1608's diff there is now
    either dead code or strictly redundant.

Local QA: cargo fmt --all -- --check, cargo clippy -p dora-node-api
-p dora-core --all-targets -- -D warnings, cargo test -p dora-node-api
-p dora-core (79 tests pass), cargo check --examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: drop inline issue ref from event_stream comment

Pre-landing /review noted the trailing "Rescue of #1608." in the
add_event comment is non-conventional. Recent rescue PRs (#1850,
#1854) keep "rescue of #NNNN" exclusively in commit messages and PR
titles; code comments rot when issue numbers stop being load-bearing.

The preceding two sentences explain WHY the failure must not panic;
that is the part future readers need.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(core): skip dependent zenoh config when its primary insert fails (interim for #1856)

Codex/reviewer flagged that after the panic→warn conversion in this PR,
two sequenced zenoh config writes had a regression-shaped failure mode:

  - DORA_ZENOH_CONNECT path: if connect/endpoints insert fails, the
    next line still disables multicast scouting, leaving the node with
    neither a daemon connection nor multicast discovery — an isolated
    session masquerading as healthy.
  - listen_endpoint path: if listen/endpoints insert fails, the next
    line still sets listen/exit_on_failure=false — tuning a listener
    that never got configured.

Minimal interim mitigation: track whether each primary insert
succeeded and gate the dependent insert on it. The DORA_ZENOH_CONNECT
warn message now states "leaving multicast scouting enabled as
fallback" so operators see the recovery path.

Does NOT do the full required-vs-optional classification (returning
Err for required-topology failures from open_zenoh_session_with_listen
so the caller can choose policy). That belongs in #1856, kept open.

Local QA: cargo fmt --all -- --check, cargo clippy -p dora-node-api
-p dora-core --all-targets -- -D warnings, cargo test -p dora-node-api
-p dora-core (79 tests pass), cargo check --examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(core, daemon): make open_zenoh_session_with_listen the source of truth for the listener (closes #1856)

Reviewer escalated: even with the prior commit's sequenced-dependency
mitigation, the failure mode is still daemon-visible. If
insert_json5("listen/endpoints", ...) fails inside the helper, the
helper now returns Ok(Session), but the daemon keeps its locally
reserved `zenoh_listen_endpoint` in state and later injects it into
spawned nodes via `DORA_ZENOH_CONNECT`. Spawned nodes then try to
connect to an endpoint that nothing is listening on.

Fix: make the helper the source of truth.

  * `open_zenoh_session_with_listen` now returns
    `eyre::Result<(zenoh::Session, Option<String>)>`. The second
    element is `Some(ep)` only if `listen_endpoint` was requested AND
    zenoh accepted the `listen/endpoints` insert. It is `None`
    otherwise (no listen requested, insert failed, or
    `ZENOH_CONFIG_PATH`-from-file branch where we have no visibility).
  * The thin `open_zenoh_session` wrapper destructures and discards
    the endpoint to preserve its existing signature.
  * `dora-daemon::Daemon::run_internal` now reads `zenoh_listen_endpoint`
    from the helper's return rather than from its locally reserved
    port. Spawned nodes only receive `DORA_ZENOH_CONNECT` when the
    listener actually bound. If the request was made but the bind
    didn't materialize, the daemon logs a warning explaining the
    fallback to multicast scouting.

Only one external call site (`binaries/daemon/src/lib.rs:798`); the
internal call from `open_zenoh_session` and the two `open_zenoh_session`
callers (`apis/rust/node/src/node/mod.rs:592`,
`binaries/coordinator/src/ws_control.rs:465`) are unaffected.

Local QA: cargo fmt --all -- --check, cargo clippy --all --exclude
dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings,
cargo test -p dora-core -p dora-node-api -p dora-daemon (lib),
cargo check --examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(core): don't advertise listen endpoint when fallback to default zenoh config

Reviewer caught: `effective_listen_endpoint` was assigned during
config-build (when `insert_json5("listen/endpoints", ...)` succeeded),
*before* `zenoh::open(zenoh_config)` was attempted. If that configured
open fails and the helper falls back to `zenoh::open(Config::default())`,
the default config has no listener — but the function still returned
the stale `Some(ep)`. The daemon would then inject `DORA_ZENOH_CONNECT`
into spawned nodes for an endpoint not present in the actual session.

Fix: defer promotion. Track listen acceptance into a local
`listen_inserted_into_configured` during config-build, and only assign
`effective_listen_endpoint = listen_inserted_into_configured` in the
`Ok(zenoh_session)` arm of the configured open. The fallback else
branch leaves `effective_listen_endpoint` at its `None` default —
which is structurally what "default config has no listener" should
mean.

Makes the invariant local: `effective_listen_endpoint` is `Some(ep)`
iff the returned session actually has `ep` bound. Any future fallback
path stays correct without having to remember a separate clear step.

Local QA: cargo fmt --all -- --check, cargo clippy --all --exclude
dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings,
cargo test -p dora-core -p dora-node-api -p dora-daemon (lib),
cargo check --examples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants