fix(coordinator): validate add-node daemon reply by imwyvern · Pull Request #1757 · dora-rs/dora

imwyvern · 2026-04-28T06:26:15Z

add-node treats any daemon response as success, so a SetParamResult or an explicit daemon failure can still be reported as applied by the coordinator.

Changed the coordinator to validate that the daemon replied with a successful DaemonReply::AddNodeResult before committing the add-node state update. The daemon now returns that reply variant when handling AddNode.

Verified with:

cargo fmt --check
cargo test -p dora-coordinator add_node_forward_reply --lib

Fixes #1682

trunk-io · 2026-04-28T06:26:20Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

phil-opp · 2026-04-28T07:03:16Z

+                                                                    Ok(reply_raw) => {
+                                                                        ensure_add_node_applied(
+                                                                            &reply_raw, &node_id,
+                                                                        )?;


This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure.

phil-opp · 2026-04-28T07:04:46Z

+            .expect_err("unexpected reply variant should fail AddNode forwarding");
+        assert!(err.to_string().contains("unexpected daemon reply"));
+    }
+


These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR.

heyong4725 · 2026-05-19T22:05:15Z

Closing in favor of #1873 — rescue of your work against current main, addressing both items in @phil-opp's review.

The new PR keeps your protocol additions and daemon-side change verbatim (AddNodeResult variant + reply_tx.send(Some(reply)) in the daemon's AddNode handler). The coordinator-side change was reshaped to address phil's two concerns:

Error forwarding instead of ? propagation. The validator's error now flows into the existing Err arm of result, which becomes a ControlRequestReply::Error to the CLI. The coordinator's main loop stays alive on recoverable per-request failures. (phil's Set up a basic Rust project #1)
Regression-style test. add_node_reply_rejects_wrong_reply_variant directly captures the dora node add times out + corrupts dataflow state (dynamic-topology CLI broken) #1682 failure mode — a stale SetParamResult(Ok) reply must be rejected with "unexpected daemon reply" instead of silently committing state. If the validator is weakened, the test fails. (phil's #2)

You're credited as Co-authored-by on the rescue commit. Thank you for the original work — the diagnosis and protocol shape were correct from the start, only the coordinator-side error handling needed reshaping.

cc @phil-opp

@imwyvern

…I (rescue of #1757) Rescues @imwyvern's #1757 against current main, addressing both items in @phil-opp's 2026-04-28 review (changes-requested). The bug (#1682) ================ `binaries/coordinator/src/lib.rs:1558` (pre-fix) had: Ok(_) => { dataflow.node_to_daemon.insert(...); dataflow.descriptor.nodes.push(...); dataflow.nodes.insert(...); Ok(ControlRequestReply::NodeAdded { ... }) } The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from the daemon as proof that AddNode applied. The daemon's actual handler returned `reply_tx.send(None)` — a null reply — and the coordinator would still happily commit state. When the daemon's TCP connection happened to be carrying a stale or out-of-order `SetParamResult` / other variant (which can happen under concurrent CLI requests against a flaky daemon), the coordinator committed AddNode state for a node the daemon never actually added. The user-visible symptom from #1682: $ dora node add <node> <times out> $ dora node list <node> shows as present in descriptor <subsequent commands corrupt or hang> The fix ======== Three coordinated changes: 1. `libraries/message/src/daemon_to_coordinator.rs`: new `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant so the daemon can identify the reply specifically. 2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode` handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))` instead of the previous `None` placeholder. 3. `binaries/coordinator/src/lib.rs`: * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied` helper that pattern-matches the reply against `AddNodeResult`. * The validator returns `Err(eyre!(...))` on either an explicit daemon failure or a wrong-variant reply. * The validator's error is funneled into the existing `Err` branch of the `result` binding, which the coordinator's main loop sends back to the CLI as a `ControlRequestReply::Error`. **The error does NOT propagate via `?` past the dispatch arm** — addresses phil's concern that the original PR's `?` would tear down the coordinator's main loop on a recoverable per-request failure. Addressing @phil-opp's review ============================== Two items from the 2026-04-28 review: * "This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure." — Fixed. The helper returns `Err`, the call site converts it into the `Err` arm of `result`, which becomes a `ControlRequestReply::Error` sent back to the CLI. The coordinator loop continues handling subsequent requests. * "These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR." — The three new tests now explicitly cover: - happy path (`AddNodeResult(Ok)` accepted) - daemon-rejection path (`AddNodeResult(Err)` rejected with named operation + node id) - **regression scenario for #1682**: a wrong-variant reply (`SetParamResult(Ok)`) is rejected with "unexpected daemon reply" instead of silently committing state. This is the specific failure mode that caused #1682's state corruption. A full end-to-end test against a mock daemon would require extracting the AddNode handler from the `start_inner` async loop — out of scope for the rescue. The helper-level tests cover the contract regression-style: if the validator is removed or weakened, these tests fail. What did NOT change ==================== * No new public API. The CLI/daemon protocol gains one reply variant, consistent with the existing `SetParamResult` / `DeleteParamResult` pattern in the same enum. * No semver-affecting changes to user-facing types. * No behavioral change for the happy path — a successful AddNode still returns `ControlRequestReply::NodeAdded { dataflow_id, node_id }` to the CLI exactly as before. * No new dependencies. Verification ============= cargo fmt --all -- --check cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings cargo test -p dora-coordinator --lib add_node_reply (3/3 new tests pass) cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib (103+/all pass) cargo check --examples Manual verification of the original #1682 repro recipe (`dora up` → `dora build` → `dora start --detach` → `dora node list` → `dora node add`) should now surface daemon errors as CLI-visible failures instead of timeouts + state corruption. Co-authored-by: imwyvern <imwyvern@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@imwyvern

…I (rescue of #1757) (#1873) Rescues @imwyvern's #1757 against current main, addressing both items in @phil-opp's 2026-04-28 review (changes-requested). The bug (#1682) ================ `binaries/coordinator/src/lib.rs:1558` (pre-fix) had: Ok(_) => { dataflow.node_to_daemon.insert(...); dataflow.descriptor.nodes.push(...); dataflow.nodes.insert(...); Ok(ControlRequestReply::NodeAdded { ... }) } The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from the daemon as proof that AddNode applied. The daemon's actual handler returned `reply_tx.send(None)` — a null reply — and the coordinator would still happily commit state. When the daemon's TCP connection happened to be carrying a stale or out-of-order `SetParamResult` / other variant (which can happen under concurrent CLI requests against a flaky daemon), the coordinator committed AddNode state for a node the daemon never actually added. The user-visible symptom from #1682: $ dora node add <node> <times out> $ dora node list <node> shows as present in descriptor <subsequent commands corrupt or hang> The fix ======== Three coordinated changes: 1. `libraries/message/src/daemon_to_coordinator.rs`: new `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant so the daemon can identify the reply specifically. 2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode` handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))` instead of the previous `None` placeholder. 3. `binaries/coordinator/src/lib.rs`: * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied` helper that pattern-matches the reply against `AddNodeResult`. * The validator returns `Err(eyre!(...))` on either an explicit daemon failure or a wrong-variant reply. * The validator's error is funneled into the existing `Err` branch of the `result` binding, which the coordinator's main loop sends back to the CLI as a `ControlRequestReply::Error`. **The error does NOT propagate via `?` past the dispatch arm** — addresses phil's concern that the original PR's `?` would tear down the coordinator's main loop on a recoverable per-request failure. Addressing @phil-opp's review ============================== Two items from the 2026-04-28 review: * "This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure." — Fixed. The helper returns `Err`, the call site converts it into the `Err` arm of `result`, which becomes a `ControlRequestReply::Error` sent back to the CLI. The coordinator loop continues handling subsequent requests. * "These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR." — The three new tests now explicitly cover: - happy path (`AddNodeResult(Ok)` accepted) - daemon-rejection path (`AddNodeResult(Err)` rejected with named operation + node id) - **regression scenario for #1682**: a wrong-variant reply (`SetParamResult(Ok)`) is rejected with "unexpected daemon reply" instead of silently committing state. This is the specific failure mode that caused #1682's state corruption. A full end-to-end test against a mock daemon would require extracting the AddNode handler from the `start_inner` async loop — out of scope for the rescue. The helper-level tests cover the contract regression-style: if the validator is removed or weakened, these tests fail. What did NOT change ==================== * No new public API. The CLI/daemon protocol gains one reply variant, consistent with the existing `SetParamResult` / `DeleteParamResult` pattern in the same enum. * No semver-affecting changes to user-facing types. * No behavioral change for the happy path — a successful AddNode still returns `ControlRequestReply::NodeAdded { dataflow_id, node_id }` to the CLI exactly as before. * No new dependencies. Verification ============= cargo fmt --all -- --check cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings cargo test -p dora-coordinator --lib add_node_reply (3/3 new tests pass) cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib (103+/all pass) cargo check --examples Manual verification of the original #1682 repro recipe (`dora up` → `dora build` → `dora start --detach` → `dora node list` → `dora node add`) should now surface daemon errors as CLI-visible failures instead of timeouts + state corruption. Co-authored-by: imwyvern <imwyvern@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(coordinator): validate add-node daemon reply

ccb55cf

phil-opp requested changes Apr 28, 2026

View reviewed changes

phil-opp added the waiting-for-author The pull request requires adjustments by the PR author. label Apr 28, 2026

heyong4725 mentioned this pull request May 19, 2026

fix(coordinator): validate AddNode daemon reply, forward errors to CLI (rescue of #1757) #1873

Merged

7 tasks

heyong4725 closed this May 19, 2026

heyong4725 mentioned this pull request May 19, 2026

coordinator: RemoveNode accepts any daemon reply as success (parallel to #1682) #1874

Closed

5 tasks

heyong4725 mentioned this pull request May 20, 2026

fix(coordinator): validate remove node daemon replies #1876

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coordinator): validate add-node daemon reply#1757

fix(coordinator): validate add-node daemon reply#1757
imwyvern wants to merge 1 commit into
dora-rs:mainfrom
imwyvern:clawoss/fix/dora-node-add-reply

imwyvern commented Apr 28, 2026

Uh oh!

trunk-io Bot commented Apr 28, 2026

Uh oh!

phil-opp Apr 28, 2026

Uh oh!

phil-opp Apr 28, 2026

Uh oh!

heyong4725 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

imwyvern commented Apr 28, 2026

Uh oh!

trunk-io Bot commented Apr 28, 2026

Uh oh!

phil-opp Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

phil-opp Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

heyong4725 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants