Skip to content

fix(coordinator): validate remove node daemon replies#1876

Merged
heyong4725 merged 1 commit into
dora-rs:mainfrom
GHX5T-SOL:fix-1874-remove-node-reply
May 20, 2026
Merged

fix(coordinator): validate remove node daemon replies#1876
heyong4725 merged 1 commit into
dora-rs:mainfrom
GHX5T-SOL:fix-1874-remove-node-reply

Conversation

@GHX5T-SOL
Copy link
Copy Markdown
Contributor

@GHX5T-SOL GHX5T-SOL commented May 19, 2026

Closes #1874.

Summary

  • adds an explicit RemoveNodeResult daemon reply
  • returns RemoveNode success/failure from the daemon instead of None
  • validates the exact RemoveNodeResult(Ok(())) reply before the coordinator removes node state
  • adds regression tests for success, daemon rejection, and wrong reply variants

Validation

  • cargo fmt --all
  • cargo fmt --all -- --check -> passed after rebase
  • cargo check -p dora-daemon -p dora-coordinator -p dora-message -> passed after rebase
  • cargo test -p dora-coordinator remove_node_reply --lib -> 3 passed after rebase
  • cargo test -p dora-coordinator --lib -> 66 passed after rebase
  • cargo test -p dora-daemon --lib -> 32 passed after rebase
  • cargo clippy -p dora-daemon -p dora-message -- -D warnings -> passed after rebase
  • cargo clippy -p dora-coordinator -- -D warnings -A clippy::unnecessary_sort_by -> passed after rebase
  • git diff --check -> passed
  • git diff origin/main..HEAD --check -> passed after rebase
  • git diff origin/main..HEAD | gitleaks stdin --no-banner --redact --exit-code 1 -> no leaks found after rebase
  • gitleaks protect --redact --verbose -> no leaks found
  • gitleaks protect --staged --redact --verbose -> no leaks found before the original commit

Not run

  • End-to-end daemon/coordinator integration was not run locally.
  • Strict cargo clippy -p dora-daemon -p dora-coordinator -p dora-message -- -D warnings on local rustc 1.95 fails on an unrelated existing clippy::unnecessary_sort_by lint at binaries/coordinator/src/lib.rs:3644; I did not change that unrelated code.

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 19, 2026

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@GHX5T-SOL GHX5T-SOL force-pushed the fix-1874-remove-node-reply branch from eb74c95 to e300caa Compare May 19, 2026 23:24
@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

/trunk merge

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 19, 2026

An error occurred while submitting your PR to the queue: Only users that are a part of this repo's Trunk organization or have write permissions to the repo can submit a PR to the queue

@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

Looks like I do not have Trunk queue permission here. The visible CI checks are green, so I will leave queue submission to someone with repo access.

Copy link
Copy Markdown
Collaborator

@heyong4725 heyong4725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ghxst — this is a faithful mirror of the #1873 AddNode fix and the load-bearing pieces are all correct. Two small polish items before merge, then a rebase, then we're good.

What I verified

  • ✅ Helper ensure_remove_node_applied is structurally identical to ensure_add_node_applied (signature, error shape, wrong-variant arm).
  • ✅ Dispatch arm uses explicit Err(e) => Err(e) — no ? propagation past the arm, satisfying @phil-opp's review concern from #1757.
  • ✅ Coordinator state cleanup is gated behind the validator's Ok(()) arm, not before it.
  • ✅ Daemon-side closure pattern (|| { ... })() actually improves on my #1873 AddNode handler — it correctly surfaces "dataflow not found" and stop_single_node errors that the pre-fix code was silently swallowing. Nice touch.
  • ✅ Three regression tests pass locally (cargo test -p dora-coordinator --lib remove_node_reply).
  • ✅ TODO marker I left at the RemoveNode call site in #1873 is correctly removed.

Two small polish items requested

1. Add a doc comment to ensure_remove_node_applied

ensure_add_node_applied (just above it) carries a 7-line contract doc explaining why ? is forbidden at the call site and pointing at the originating issues. ensure_remove_node_applied has zero doc. A reviewer hitting this for the first time has no signal that the explicit Err(e) => Err(e) pattern is load-bearing. Suggested wording (parallel to the AddNode helper):

/// Validate that the daemon's reply to `DaemonCoordinatorEvent::RemoveNode`
/// is a successful `RemoveNodeResult`. Returns `Err` for both an explicit
/// daemon failure and an unexpected reply variant; callers should forward
/// the error to the CLI as a `ControlRequestReply::Error` and NOT use `?`
/// to bubble out of the coordinator's main loop. Parallel to
/// `ensure_add_node_applied` (#1873). Closes #1874.
fn ensure_remove_node_applied(
    reply_raw: &[u8],
    node_id: &dora_core::config::NodeId,
) -> eyre::Result<()> { ... }

2. Extend the wrong-variant test to also probe AddNodeResult(Ok)

remove_node_reply_rejects_wrong_reply_variant currently only tests SetParamResult(Ok) as the wrong variant. That catches the generic "any other reply" case, but doesn't pin the specific cross-contamination scenario this PR exists to prevent: an AddNodeResult(Ok) reply for a different dataflow event slipping through RemoveNode validation. Suggested addition (one extra assert! block at the bottom of the existing test):

// The literal cross-contamination scenario: an AddNodeResult reply
// stranded on the daemon connection from a concurrent request must
// not be accepted by RemoveNode's validator.
let reply = serde_json::to_vec(&DaemonCoordinatorReply::AddNodeResult(Ok(()))).unwrap();
let err = ensure_remove_node_applied(&reply, &node_id)
    .expect_err("AddNodeResult reply must not be accepted by RemoveNode validator");
assert!(
    err.to_string().contains("unexpected daemon reply"),
    "error must call out the wrong-reply-type failure mode: {err}"
);

(Optional, but I'd appreciate it: same probe added symmetrically to add_node_reply_rejects_wrong_reply_variant testing RemoveNodeResult(Ok). Two-line change.)

Heads up: rebase needed

Your branch is 2 commits behind main (#1875 find_package, #1877 node.timestamp). Both touch entirely unrelated files, so the rebase will be conflict-free. Just git fetch && git rebase origin/main && git push --force-with-lease.

What I did NOT flag

  • Cross-version compat (opaque deserialize error if an old daemon sends None): this is parity with ensure_add_node_applied. Same trade-off was accepted in #1873. If we want a version-mismatch hint, that's a follow-up PR touching both helpers, not a blocker here.
  • Missing doc on the new RemoveNodeResult enum variant: my AddNodeResult has one, yours doesn't. Nice to have but not blocking.

Path to merge

  1. Apply the two polish items above (~15 lines total).
  2. Rebase on main.
  3. Push — I'll re-review and merge.

Excellent first contribution. The closure-based daemon handler is the kind of thing I would NOT have noticed if you'd just blindly copied my AddNode shape; the surfaced "dataflow not found" error path is a strict improvement.

🤖 Review generated with Claude Code

Signed-off-by: GHX5T-SOL <200635707+GHX5T-SOL@users.noreply.github.com>
@GHX5T-SOL GHX5T-SOL force-pushed the fix-1874-remove-node-reply branch from e300caa to 995620e Compare May 20, 2026 03:03
@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

Updated in 995620e65a29157d8eaa5aed49c463a61b787c10:

  • Added the ensure_remove_node_applied contract doc comment.
  • Added the requested AddNodeResult(Ok) wrong-variant probe for RemoveNode.
  • Added the symmetric RemoveNodeResult(Ok) wrong-variant probe for AddNode.
  • Rebased onto current origin/main (d96a338f).

Validation:

  • cargo test -p dora-coordinator --lib node_reply -> 6 passed, 0 failed, 60 filtered out.
  • cargo fmt --all -- --check -> passed.
  • git diff origin/main..HEAD --check -> passed.
  • git show --format= --patch HEAD | gitleaks stdin --no-banner --redact --timeout 30 -> no leaks found.
  • cargo clippy -p dora-coordinator --lib --tests -- -A clippy::unnecessary-sort-by -D warnings -> passed.

Limitation: strict local cargo clippy -p dora-coordinator --lib --tests -- -D warnings on Rust 1.95 still trips the existing unnecessary_sort_by lint at binaries/coordinator/src/lib.rs:3698, outside this patch.

Copy link
Copy Markdown
Collaborator

@heyong4725 heyong4725 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three polish items addressed verbatim, plus the optional symmetric probe I asked for. Re-reviewed against the new commit 995620e6. Approving.

What you addressed

  1. Doc comment on ensure_remove_node_applied — exact suggested wording, cites #1873 + #1874.
  2. AddNodeResult(Ok) cross-contamination probe in remove_node_reply_rejects_wrong_reply_variant — the load-bearing test that pins the specific bug class this PR exists to prevent.
  3. Symmetric RemoveNodeResult(Ok) probe in add_node_reply_rejects_wrong_reply_variant — the optional one. Both validators now have explicit cross-contamination coverage in both directions. Nice.
  4. Rebased on main — picked up #1873, #1875, #1877 cleanly.

Verification (re-run locally)

cargo fmt --all -- --check                                                          ✓
PYO3_NO_PYTHON=1 cargo test -p dora-coordinator --lib node_reply                    ✓ (6/6 pass)
PYO3_NO_PYTHON=1 cargo clippy -p dora-daemon -p dora-coordinator -p dora-message    ✓

GitHub: mergeStateStatus: CLEAN, mergeable: MERGEABLE. No conflicts against current main.

Heads up

main advanced one more commit since your rebase (#1881, CI-only fix in Cargo.toml + .circleci/config.yml). No conflict with your diff, GitHub will fast-forward or auto-merge. No action needed from you.

Excellent first contribution. The daemon-side closure pattern that properly surfaces the "dataflow not found" and stop_single_node errors (which the pre-fix code was silently swallowing) is a strict improvement over the AddNode equivalent I merged in #1873 — that's the kind of thing that makes a maintainer want to keep working with you.

Merging momentarily.

🤖 Review generated with Claude Code

@heyong4725 heyong4725 merged commit 679e55f into dora-rs:main May 20, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

coordinator: RemoveNode accepts any daemon reply as success (parallel to #1682)

2 participants