fix(coordinator): validate remove node daemon replies#1876
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
eb74c95 to
e300caa
Compare
|
/trunk merge |
|
An error occurred while submitting your PR to the queue: |
|
Looks like I do not have Trunk queue permission here. The visible CI checks are green, so I will leave queue submission to someone with repo access. |
heyong4725
left a comment
There was a problem hiding this comment.
Thanks @Ghxst — this is a faithful mirror of the #1873 AddNode fix and the load-bearing pieces are all correct. Two small polish items before merge, then a rebase, then we're good.
What I verified
- ✅ Helper
ensure_remove_node_appliedis structurally identical toensure_add_node_applied(signature, error shape, wrong-variant arm). - ✅ Dispatch arm uses explicit
Err(e) => Err(e)— no?propagation past the arm, satisfying @phil-opp's review concern from #1757. - ✅ Coordinator state cleanup is gated behind the validator's
Ok(())arm, not before it. - ✅ Daemon-side closure pattern
(|| { ... })()actually improves on my #1873 AddNode handler — it correctly surfaces "dataflow not found" andstop_single_nodeerrors that the pre-fix code was silently swallowing. Nice touch. - ✅ Three regression tests pass locally (
cargo test -p dora-coordinator --lib remove_node_reply). - ✅ TODO marker I left at the RemoveNode call site in #1873 is correctly removed.
Two small polish items requested
1. Add a doc comment to ensure_remove_node_applied
ensure_add_node_applied (just above it) carries a 7-line contract doc explaining why ? is forbidden at the call site and pointing at the originating issues. ensure_remove_node_applied has zero doc. A reviewer hitting this for the first time has no signal that the explicit Err(e) => Err(e) pattern is load-bearing. Suggested wording (parallel to the AddNode helper):
/// Validate that the daemon's reply to `DaemonCoordinatorEvent::RemoveNode`
/// is a successful `RemoveNodeResult`. Returns `Err` for both an explicit
/// daemon failure and an unexpected reply variant; callers should forward
/// the error to the CLI as a `ControlRequestReply::Error` and NOT use `?`
/// to bubble out of the coordinator's main loop. Parallel to
/// `ensure_add_node_applied` (#1873). Closes #1874.
fn ensure_remove_node_applied(
reply_raw: &[u8],
node_id: &dora_core::config::NodeId,
) -> eyre::Result<()> { ... }2. Extend the wrong-variant test to also probe AddNodeResult(Ok)
remove_node_reply_rejects_wrong_reply_variant currently only tests SetParamResult(Ok) as the wrong variant. That catches the generic "any other reply" case, but doesn't pin the specific cross-contamination scenario this PR exists to prevent: an AddNodeResult(Ok) reply for a different dataflow event slipping through RemoveNode validation. Suggested addition (one extra assert! block at the bottom of the existing test):
// The literal cross-contamination scenario: an AddNodeResult reply
// stranded on the daemon connection from a concurrent request must
// not be accepted by RemoveNode's validator.
let reply = serde_json::to_vec(&DaemonCoordinatorReply::AddNodeResult(Ok(()))).unwrap();
let err = ensure_remove_node_applied(&reply, &node_id)
.expect_err("AddNodeResult reply must not be accepted by RemoveNode validator");
assert!(
err.to_string().contains("unexpected daemon reply"),
"error must call out the wrong-reply-type failure mode: {err}"
);(Optional, but I'd appreciate it: same probe added symmetrically to add_node_reply_rejects_wrong_reply_variant testing RemoveNodeResult(Ok). Two-line change.)
Heads up: rebase needed
Your branch is 2 commits behind main (#1875 find_package, #1877 node.timestamp). Both touch entirely unrelated files, so the rebase will be conflict-free. Just git fetch && git rebase origin/main && git push --force-with-lease.
What I did NOT flag
- Cross-version compat (opaque deserialize error if an old daemon sends
None): this is parity withensure_add_node_applied. Same trade-off was accepted in #1873. If we want a version-mismatch hint, that's a follow-up PR touching both helpers, not a blocker here. - Missing doc on the new
RemoveNodeResultenum variant: myAddNodeResulthas one, yours doesn't. Nice to have but not blocking.
Path to merge
- Apply the two polish items above (~15 lines total).
- Rebase on main.
- Push — I'll re-review and merge.
Excellent first contribution. The closure-based daemon handler is the kind of thing I would NOT have noticed if you'd just blindly copied my AddNode shape; the surfaced "dataflow not found" error path is a strict improvement.
🤖 Review generated with Claude Code
Signed-off-by: GHX5T-SOL <200635707+GHX5T-SOL@users.noreply.github.com>
e300caa to
995620e
Compare
|
Updated in
Validation:
Limitation: strict local |
heyong4725
left a comment
There was a problem hiding this comment.
All three polish items addressed verbatim, plus the optional symmetric probe I asked for. Re-reviewed against the new commit 995620e6. Approving.
What you addressed
- ✅ Doc comment on
ensure_remove_node_applied— exact suggested wording, cites #1873 + #1874. - ✅
AddNodeResult(Ok)cross-contamination probe inremove_node_reply_rejects_wrong_reply_variant— the load-bearing test that pins the specific bug class this PR exists to prevent. - ✅ Symmetric
RemoveNodeResult(Ok)probe inadd_node_reply_rejects_wrong_reply_variant— the optional one. Both validators now have explicit cross-contamination coverage in both directions. Nice. - ✅ Rebased on main — picked up #1873, #1875, #1877 cleanly.
Verification (re-run locally)
cargo fmt --all -- --check ✓
PYO3_NO_PYTHON=1 cargo test -p dora-coordinator --lib node_reply ✓ (6/6 pass)
PYO3_NO_PYTHON=1 cargo clippy -p dora-daemon -p dora-coordinator -p dora-message ✓
GitHub: mergeStateStatus: CLEAN, mergeable: MERGEABLE. No conflicts against current main.
Heads up
main advanced one more commit since your rebase (#1881, CI-only fix in Cargo.toml + .circleci/config.yml). No conflict with your diff, GitHub will fast-forward or auto-merge. No action needed from you.
Excellent first contribution. The daemon-side closure pattern that properly surfaces the "dataflow not found" and stop_single_node errors (which the pre-fix code was silently swallowing) is a strict improvement over the AddNode equivalent I merged in #1873 — that's the kind of thing that makes a maintainer want to keep working with you.
Merging momentarily.
🤖 Review generated with Claude Code
Closes #1874.
Summary
RemoveNodeResultdaemon replyNoneRemoveNodeResult(Ok(()))reply before the coordinator removes node stateValidation
cargo fmt --allcargo fmt --all -- --check-> passed after rebasecargo check -p dora-daemon -p dora-coordinator -p dora-message-> passed after rebasecargo test -p dora-coordinator remove_node_reply --lib-> 3 passed after rebasecargo test -p dora-coordinator --lib-> 66 passed after rebasecargo test -p dora-daemon --lib-> 32 passed after rebasecargo clippy -p dora-daemon -p dora-message -- -D warnings-> passed after rebasecargo clippy -p dora-coordinator -- -D warnings -A clippy::unnecessary_sort_by-> passed after rebasegit diff --check-> passedgit diff origin/main..HEAD --check-> passed after rebasegit diff origin/main..HEAD | gitleaks stdin --no-banner --redact --exit-code 1-> no leaks found after rebasegitleaks protect --redact --verbose-> no leaks foundgitleaks protect --staged --redact --verbose-> no leaks found before the original commitNot run
cargo clippy -p dora-daemon -p dora-coordinator -p dora-message -- -D warningson local rustc 1.95 fails on an unrelated existingclippy::unnecessary_sort_bylint atbinaries/coordinator/src/lib.rs:3644; I did not change that unrelated code.