Skip to content

fix(coordinator): validate add-node daemon reply#1757

Closed
imwyvern wants to merge 1 commit into
dora-rs:mainfrom
imwyvern:clawoss/fix/dora-node-add-reply
Closed

fix(coordinator): validate add-node daemon reply#1757
imwyvern wants to merge 1 commit into
dora-rs:mainfrom
imwyvern:clawoss/fix/dora-node-add-reply

Conversation

@imwyvern
Copy link
Copy Markdown
Contributor

add-node treats any daemon response as success, so a SetParamResult or an explicit daemon failure can still be reported as applied by the coordinator.

Changed the coordinator to validate that the daemon replied with a successful DaemonReply::AddNodeResult before committing the add-node state update. The daemon now returns that reply variant when handling AddNode.

Verified with:

  • cargo fmt --check
  • cargo test -p dora-coordinator add_node_forward_reply --lib

Fixes #1682

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented Apr 28, 2026

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

Ok(reply_raw) => {
ensure_add_node_applied(
&reply_raw, &node_id,
)?;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure.

.expect_err("unexpected reply variant should fail AddNode forwarding");
assert!(err.to_string().contains("unexpected daemon reply"));
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR.

@phil-opp phil-opp added the waiting-for-author The pull request requires adjustments by the PR author. label Apr 28, 2026
@heyong4725
Copy link
Copy Markdown
Collaborator

Closing in favor of #1873 — rescue of your work against current main, addressing both items in @phil-opp's review.

The new PR keeps your protocol additions and daemon-side change verbatim (AddNodeResult variant + reply_tx.send(Some(reply)) in the daemon's AddNode handler). The coordinator-side change was reshaped to address phil's two concerns:

  1. Error forwarding instead of ? propagation. The validator's error now flows into the existing Err arm of result, which becomes a ControlRequestReply::Error to the CLI. The coordinator's main loop stays alive on recoverable per-request failures. (phil's Set up a basic Rust project #1)
  2. Regression-style test. add_node_reply_rejects_wrong_reply_variant directly captures the dora node add times out + corrupts dataflow state (dynamic-topology CLI broken) #1682 failure mode — a stale SetParamResult(Ok) reply must be rejected with "unexpected daemon reply" instead of silently committing state. If the validator is weakened, the test fails. (phil's #2)

You're credited as Co-authored-by on the rescue commit. Thank you for the original work — the diagnosis and protocol shape were correct from the start, only the coordinator-side error handling needed reshaping.

cc @phil-opp

@heyong4725 heyong4725 closed this May 19, 2026
heyong4725 added a commit that referenced this pull request May 19, 2026
…I (rescue of #1757)

Rescues @imwyvern's #1757 against current main, addressing both items
in @phil-opp's 2026-04-28 review (changes-requested).

The bug (#1682)
================

`binaries/coordinator/src/lib.rs:1558` (pre-fix) had:

  Ok(_) => {
      dataflow.node_to_daemon.insert(...);
      dataflow.descriptor.nodes.push(...);
      dataflow.nodes.insert(...);
      Ok(ControlRequestReply::NodeAdded { ... })
  }

The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from
the daemon as proof that AddNode applied. The daemon's actual handler
returned `reply_tx.send(None)` — a null reply — and the coordinator
would still happily commit state.

When the daemon's TCP connection happened to be carrying a stale or
out-of-order `SetParamResult` / other variant (which can happen under
concurrent CLI requests against a flaky daemon), the coordinator
committed AddNode state for a node the daemon never actually added.
The user-visible symptom from #1682:

  $ dora node add  <node>
  <times out>
  $ dora node list
  <node> shows as present in descriptor
  <subsequent commands corrupt or hang>

The fix
========

Three coordinated changes:

1. `libraries/message/src/daemon_to_coordinator.rs`: new
   `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant
   so the daemon can identify the reply specifically.

2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode`
   handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))`
   instead of the previous `None` placeholder.

3. `binaries/coordinator/src/lib.rs`:
   * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied`
     helper that pattern-matches the reply against `AddNodeResult`.
   * The validator returns `Err(eyre!(...))` on either an explicit
     daemon failure or a wrong-variant reply.
   * The validator's error is funneled into the existing `Err` branch
     of the `result` binding, which the coordinator's main loop sends
     back to the CLI as a `ControlRequestReply::Error`. **The error
     does NOT propagate via `?` past the dispatch arm** — addresses
     phil's concern that the original PR's `?` would tear down the
     coordinator's main loop on a recoverable per-request failure.

Addressing @phil-opp's review
==============================

Two items from the 2026-04-28 review:

* "This brings the whole coordinator down on error, no? We should
  instead forward the result to the CLI and let it deal with the
  failure." — Fixed. The helper returns `Err`, the call site converts
  it into the `Err` arm of `result`, which becomes a
  `ControlRequestReply::Error` sent back to the CLI. The coordinator
  loop continues handling subsequent requests.

* "These tests are not testing anything non-trivial. Ideally, there
  would be a test that fails before this PR and succeeds after this
  PR." — The three new tests now explicitly cover:
    - happy path (`AddNodeResult(Ok)` accepted)
    - daemon-rejection path (`AddNodeResult(Err)` rejected with named
      operation + node id)
    - **regression scenario for #1682**: a wrong-variant reply
      (`SetParamResult(Ok)`) is rejected with "unexpected daemon
      reply" instead of silently committing state. This is the
      specific failure mode that caused #1682's state corruption.
  A full end-to-end test against a mock daemon would require
  extracting the AddNode handler from the `start_inner` async loop —
  out of scope for the rescue. The helper-level tests cover the
  contract regression-style: if the validator is removed or weakened,
  these tests fail.

What did NOT change
====================

* No new public API. The CLI/daemon protocol gains one reply variant,
  consistent with the existing `SetParamResult` / `DeleteParamResult`
  pattern in the same enum.
* No semver-affecting changes to user-facing types.
* No behavioral change for the happy path — a successful AddNode
  still returns `ControlRequestReply::NodeAdded { dataflow_id,
  node_id }` to the CLI exactly as before.
* No new dependencies.

Verification
=============

  cargo fmt --all -- --check
  cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings
  cargo test -p dora-coordinator --lib add_node_reply
    (3/3 new tests pass)
  cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib
    (103+/all pass)
  cargo check --examples

Manual verification of the original #1682 repro recipe (`dora up` →
`dora build` → `dora start --detach` → `dora node list` → `dora node
add`) should now surface daemon errors as CLI-visible failures
instead of timeouts + state corruption.

Co-authored-by: imwyvern <imwyvern@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
heyong4725 added a commit that referenced this pull request May 19, 2026
…I (rescue of #1757) (#1873)

Rescues @imwyvern's #1757 against current main, addressing both items
in @phil-opp's 2026-04-28 review (changes-requested).

The bug (#1682)
================

`binaries/coordinator/src/lib.rs:1558` (pre-fix) had:

  Ok(_) => {
      dataflow.node_to_daemon.insert(...);
      dataflow.descriptor.nodes.push(...);
      dataflow.nodes.insert(...);
      Ok(ControlRequestReply::NodeAdded { ... })
  }

The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from
the daemon as proof that AddNode applied. The daemon's actual handler
returned `reply_tx.send(None)` — a null reply — and the coordinator
would still happily commit state.

When the daemon's TCP connection happened to be carrying a stale or
out-of-order `SetParamResult` / other variant (which can happen under
concurrent CLI requests against a flaky daemon), the coordinator
committed AddNode state for a node the daemon never actually added.
The user-visible symptom from #1682:

  $ dora node add  <node>
  <times out>
  $ dora node list
  <node> shows as present in descriptor
  <subsequent commands corrupt or hang>

The fix
========

Three coordinated changes:

1. `libraries/message/src/daemon_to_coordinator.rs`: new
   `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant
   so the daemon can identify the reply specifically.

2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode`
   handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))`
   instead of the previous `None` placeholder.

3. `binaries/coordinator/src/lib.rs`:
   * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied`
     helper that pattern-matches the reply against `AddNodeResult`.
   * The validator returns `Err(eyre!(...))` on either an explicit
     daemon failure or a wrong-variant reply.
   * The validator's error is funneled into the existing `Err` branch
     of the `result` binding, which the coordinator's main loop sends
     back to the CLI as a `ControlRequestReply::Error`. **The error
     does NOT propagate via `?` past the dispatch arm** — addresses
     phil's concern that the original PR's `?` would tear down the
     coordinator's main loop on a recoverable per-request failure.

Addressing @phil-opp's review
==============================

Two items from the 2026-04-28 review:

* "This brings the whole coordinator down on error, no? We should
  instead forward the result to the CLI and let it deal with the
  failure." — Fixed. The helper returns `Err`, the call site converts
  it into the `Err` arm of `result`, which becomes a
  `ControlRequestReply::Error` sent back to the CLI. The coordinator
  loop continues handling subsequent requests.

* "These tests are not testing anything non-trivial. Ideally, there
  would be a test that fails before this PR and succeeds after this
  PR." — The three new tests now explicitly cover:
    - happy path (`AddNodeResult(Ok)` accepted)
    - daemon-rejection path (`AddNodeResult(Err)` rejected with named
      operation + node id)
    - **regression scenario for #1682**: a wrong-variant reply
      (`SetParamResult(Ok)`) is rejected with "unexpected daemon
      reply" instead of silently committing state. This is the
      specific failure mode that caused #1682's state corruption.
  A full end-to-end test against a mock daemon would require
  extracting the AddNode handler from the `start_inner` async loop —
  out of scope for the rescue. The helper-level tests cover the
  contract regression-style: if the validator is removed or weakened,
  these tests fail.

What did NOT change
====================

* No new public API. The CLI/daemon protocol gains one reply variant,
  consistent with the existing `SetParamResult` / `DeleteParamResult`
  pattern in the same enum.
* No semver-affecting changes to user-facing types.
* No behavioral change for the happy path — a successful AddNode
  still returns `ControlRequestReply::NodeAdded { dataflow_id,
  node_id }` to the CLI exactly as before.
* No new dependencies.

Verification
=============

  cargo fmt --all -- --check
  cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings
  cargo test -p dora-coordinator --lib add_node_reply
    (3/3 new tests pass)
  cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib
    (103+/all pass)
  cargo check --examples

Manual verification of the original #1682 repro recipe (`dora up` →
`dora build` → `dora start --detach` → `dora node list` → `dora node
add`) should now surface daemon errors as CLI-visible failures
instead of timeouts + state corruption.

Co-authored-by: imwyvern <imwyvern@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-for-author The pull request requires adjustments by the PR author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dora node add times out + corrupts dataflow state (dynamic-topology CLI broken)

3 participants