fix(coordinator): validate add-node daemon reply#1757
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
| Ok(reply_raw) => { | ||
| ensure_add_node_applied( | ||
| &reply_raw, &node_id, | ||
| )?; |
There was a problem hiding this comment.
This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure.
| .expect_err("unexpected reply variant should fail AddNode forwarding"); | ||
| assert!(err.to_string().contains("unexpected daemon reply")); | ||
| } | ||
|
|
There was a problem hiding this comment.
These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR.
|
Closing in favor of #1873 — rescue of your work against current main, addressing both items in @phil-opp's review. The new PR keeps your protocol additions and daemon-side change verbatim (
You're credited as cc @phil-opp |
…I (rescue of #1757) Rescues @imwyvern's #1757 against current main, addressing both items in @phil-opp's 2026-04-28 review (changes-requested). The bug (#1682) ================ `binaries/coordinator/src/lib.rs:1558` (pre-fix) had: Ok(_) => { dataflow.node_to_daemon.insert(...); dataflow.descriptor.nodes.push(...); dataflow.nodes.insert(...); Ok(ControlRequestReply::NodeAdded { ... }) } The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from the daemon as proof that AddNode applied. The daemon's actual handler returned `reply_tx.send(None)` — a null reply — and the coordinator would still happily commit state. When the daemon's TCP connection happened to be carrying a stale or out-of-order `SetParamResult` / other variant (which can happen under concurrent CLI requests against a flaky daemon), the coordinator committed AddNode state for a node the daemon never actually added. The user-visible symptom from #1682: $ dora node add <node> <times out> $ dora node list <node> shows as present in descriptor <subsequent commands corrupt or hang> The fix ======== Three coordinated changes: 1. `libraries/message/src/daemon_to_coordinator.rs`: new `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant so the daemon can identify the reply specifically. 2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode` handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))` instead of the previous `None` placeholder. 3. `binaries/coordinator/src/lib.rs`: * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied` helper that pattern-matches the reply against `AddNodeResult`. * The validator returns `Err(eyre!(...))` on either an explicit daemon failure or a wrong-variant reply. * The validator's error is funneled into the existing `Err` branch of the `result` binding, which the coordinator's main loop sends back to the CLI as a `ControlRequestReply::Error`. **The error does NOT propagate via `?` past the dispatch arm** — addresses phil's concern that the original PR's `?` would tear down the coordinator's main loop on a recoverable per-request failure. Addressing @phil-opp's review ============================== Two items from the 2026-04-28 review: * "This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure." — Fixed. The helper returns `Err`, the call site converts it into the `Err` arm of `result`, which becomes a `ControlRequestReply::Error` sent back to the CLI. The coordinator loop continues handling subsequent requests. * "These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR." — The three new tests now explicitly cover: - happy path (`AddNodeResult(Ok)` accepted) - daemon-rejection path (`AddNodeResult(Err)` rejected with named operation + node id) - **regression scenario for #1682**: a wrong-variant reply (`SetParamResult(Ok)`) is rejected with "unexpected daemon reply" instead of silently committing state. This is the specific failure mode that caused #1682's state corruption. A full end-to-end test against a mock daemon would require extracting the AddNode handler from the `start_inner` async loop — out of scope for the rescue. The helper-level tests cover the contract regression-style: if the validator is removed or weakened, these tests fail. What did NOT change ==================== * No new public API. The CLI/daemon protocol gains one reply variant, consistent with the existing `SetParamResult` / `DeleteParamResult` pattern in the same enum. * No semver-affecting changes to user-facing types. * No behavioral change for the happy path — a successful AddNode still returns `ControlRequestReply::NodeAdded { dataflow_id, node_id }` to the CLI exactly as before. * No new dependencies. Verification ============= cargo fmt --all -- --check cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings cargo test -p dora-coordinator --lib add_node_reply (3/3 new tests pass) cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib (103+/all pass) cargo check --examples Manual verification of the original #1682 repro recipe (`dora up` → `dora build` → `dora start --detach` → `dora node list` → `dora node add`) should now surface daemon errors as CLI-visible failures instead of timeouts + state corruption. Co-authored-by: imwyvern <imwyvern@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…I (rescue of #1757) (#1873) Rescues @imwyvern's #1757 against current main, addressing both items in @phil-opp's 2026-04-28 review (changes-requested). The bug (#1682) ================ `binaries/coordinator/src/lib.rs:1558` (pre-fix) had: Ok(_) => { dataflow.node_to_daemon.insert(...); dataflow.descriptor.nodes.push(...); dataflow.nodes.insert(...); Ok(ControlRequestReply::NodeAdded { ... }) } The `Ok(_)` arm accepted ANY successful `send_and_receive` reply from the daemon as proof that AddNode applied. The daemon's actual handler returned `reply_tx.send(None)` — a null reply — and the coordinator would still happily commit state. When the daemon's TCP connection happened to be carrying a stale or out-of-order `SetParamResult` / other variant (which can happen under concurrent CLI requests against a flaky daemon), the coordinator committed AddNode state for a node the daemon never actually added. The user-visible symptom from #1682: $ dora node add <node> <times out> $ dora node list <node> shows as present in descriptor <subsequent commands corrupt or hang> The fix ======== Three coordinated changes: 1. `libraries/message/src/daemon_to_coordinator.rs`: new `DaemonCoordinatorReply::AddNodeResult(Result<(), String>)` variant so the daemon can identify the reply specifically. 2. `binaries/daemon/src/lib.rs`: the `DaemonCoordinatorEvent::AddNode` handler now sends `AddNodeResult(result.map_err(|e| format!("{e:?}")))` instead of the previous `None` placeholder. 3. `binaries/coordinator/src/lib.rs`: * `Ok(reply_raw) =>` arm now calls a new `ensure_add_node_applied` helper that pattern-matches the reply against `AddNodeResult`. * The validator returns `Err(eyre!(...))` on either an explicit daemon failure or a wrong-variant reply. * The validator's error is funneled into the existing `Err` branch of the `result` binding, which the coordinator's main loop sends back to the CLI as a `ControlRequestReply::Error`. **The error does NOT propagate via `?` past the dispatch arm** — addresses phil's concern that the original PR's `?` would tear down the coordinator's main loop on a recoverable per-request failure. Addressing @phil-opp's review ============================== Two items from the 2026-04-28 review: * "This brings the whole coordinator down on error, no? We should instead forward the result to the CLI and let it deal with the failure." — Fixed. The helper returns `Err`, the call site converts it into the `Err` arm of `result`, which becomes a `ControlRequestReply::Error` sent back to the CLI. The coordinator loop continues handling subsequent requests. * "These tests are not testing anything non-trivial. Ideally, there would be a test that fails before this PR and succeeds after this PR." — The three new tests now explicitly cover: - happy path (`AddNodeResult(Ok)` accepted) - daemon-rejection path (`AddNodeResult(Err)` rejected with named operation + node id) - **regression scenario for #1682**: a wrong-variant reply (`SetParamResult(Ok)`) is rejected with "unexpected daemon reply" instead of silently committing state. This is the specific failure mode that caused #1682's state corruption. A full end-to-end test against a mock daemon would require extracting the AddNode handler from the `start_inner` async loop — out of scope for the rescue. The helper-level tests cover the contract regression-style: if the validator is removed or weakened, these tests fail. What did NOT change ==================== * No new public API. The CLI/daemon protocol gains one reply variant, consistent with the existing `SetParamResult` / `DeleteParamResult` pattern in the same enum. * No semver-affecting changes to user-facing types. * No behavioral change for the happy path — a successful AddNode still returns `ControlRequestReply::NodeAdded { dataflow_id, node_id }` to the CLI exactly as before. * No new dependencies. Verification ============= cargo fmt --all -- --check cargo clippy --all --exclude dora-{node-api,operator-api,ros2-bridge}-python -- -D warnings cargo test -p dora-coordinator --lib add_node_reply (3/3 new tests pass) cargo test -p dora-coordinator -p dora-daemon -p dora-message --lib (103+/all pass) cargo check --examples Manual verification of the original #1682 repro recipe (`dora up` → `dora build` → `dora start --detach` → `dora node list` → `dora node add`) should now surface daemon errors as CLI-visible failures instead of timeouts + state corruption. Co-authored-by: imwyvern <imwyvern@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
add-nodetreats any daemon response as success, so aSetParamResultor an explicit daemon failure can still be reported as applied by the coordinator.Changed the coordinator to validate that the daemon replied with a successful
DaemonReply::AddNodeResultbefore committing the add-node state update. The daemon now returns that reply variant when handlingAddNode.Verified with:
cargo fmt --checkcargo test -p dora-coordinator add_node_forward_reply --libFixes #1682