Skip to content

compute-domain-daemon: skip SIGUSR1 when DNS mapping update only removes nodes#919

Open
AkshatDudeja77 wants to merge 2 commits into
kubernetes-sigs:mainfrom
AkshatDudeja77:fix-skip-sigusr1-on-node-removal
Open

compute-domain-daemon: skip SIGUSR1 when DNS mapping update only removes nodes#919
AkshatDudeja77 wants to merge 2 commits into
kubernetes-sigs:mainfrom
AkshatDudeja77:fix-skip-sigusr1-on-node-removal

Conversation

@AkshatDudeja77
Copy link
Copy Markdown

Fixes #913

The IMEXDaemonUpdateLoopWithDNSNames currently sends SIGUSR1 to the IMEX
daemon whenever the DNS/IP mapping changes.

However, when the update only removes nodes (and does not add new ones),
forcing the daemon to re-resolve and reconnect is unnecessary.

This change tracks the previous mapping size and skips sending SIGUSR1 when
the updated mapping strictly removes nodes.

Before:
SIGUSR1 triggered whenever the mapping changed.

After:
SIGUSR1 triggered only when nodes are added or mappings otherwise change,
but not when nodes are only removed.

This implements the TODO in the code suggesting that reconnects can be
skipped when the new IP set only removes addresses compared to the old set.

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from a620c5d to f8ad4c0 Compare March 6, 2026 07:20
@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Mar 7, 2026

when the update only removes nodes (and does not add new ones)

That's a great goal. But how do we detect this situation with sets being our central data structures?

Write two sets down, on a piece of paper. This could look like this:

Set A:  item1, item2, item3
Set B:  item1, item2, item4

What do you need to do in order to find out if set A has an item removed compared to set B?

Do you need to look at more than just set size?

I genuinely wonder what your goals and incentives are. This is I think your sixth pull request of this style against this repository, and I believe we need to think about providing candid feedback soon :). These are wild times!

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically incorrect, as expressed in regular comment

@jgehrcke
Copy link
Copy Markdown
Contributor

jgehrcke commented Mar 7, 2026

For context, I still find it quite likely that this is AI-generated noise. Last time I looked into this, I noted:
#863 (comment)

@AkshatDudeja77
Copy link
Copy Markdown
Author

AkshatDudeja77 commented Mar 8, 2026

@jgehrcke thank you for your point about comparing sets rather than just sizes, I finally do understand what you're saying.

My intention with the current change was only to avoid sending SIGUSR1 when the update strictly removes nodes (i.e. when the mapping shrinks). In cases where
the size stays the same but the elements differ (like old set: A,B,C and new set: A,B,D), we would still send SIGUSR1 because a new peer appeared.

For transparency: I used AI to draft the initial idea here, but I’ve reviewed and adjusted the code and tests myself.

@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from f8ad4c0 to a7ddddf Compare March 8, 2026 19:05
@AkshatDudeja77 AkshatDudeja77 requested a review from jgehrcke March 8, 2026 19:18
@rajatchopra
Copy link
Copy Markdown
Contributor

@AkshatDudeja77 Can you write some unit testcases around it

@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from 13461a6 to e40c7ed Compare March 24, 2026 06:21
@AkshatDudeja77
Copy link
Copy Markdown
Author

@AkshatDudeja77 Can you write some unit testcases around it

@rajatchopra Refactored the SIGUSR1 decision into a helper and added unit tests covering no-change, pure-removal, additions, replacements, remove+add, and fresh-process cases.

This now explicitly checks for newly added peers rather than relying on size or subset assumptions.

Please let me know if there are additional edge cases you'd like covered

Comment thread internal/common/util.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
Comment thread cmd/compute-domain-daemon/main.go Outdated
// - the mapping was updated,
// - the process is not fresh, and
// - at least one new IP (peer) was added.
func shouldSendSIGUSR1(oldIPs, newIPs IPSet, updated, fresh bool) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need updated as an argument into this function? Is it directly implied by oldIPs and newIPs?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, updated is handled at the call site now, so i removed it from the helper and aligned the tests accordingly

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the docstring to reflect the current function behavior after removing updated.

Also fixed the IPSet usage at the call site by explicitly constructing the new IP set from the DNS mapping to ensure correct types

@AkshatDudeja77 AkshatDudeja77 requested a review from jgehrcke April 2, 2026 05:56
@k8s-triage-robot
Copy link
Copy Markdown

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 2, 2026
Copy link
Copy Markdown
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@AkshatDudeja77 please squash the commits

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: be76cc6eec49013d0f97eb3c7c4c8180775a0392

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AkshatDudeja77, rajatchopra
Once this PR has been reviewed and has the lgtm label, please assign dims for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: AkshatDudeja77 <akshat.dudeja77@gmail.com>
@AkshatDudeja77 AkshatDudeja77 force-pushed the fix-skip-sigusr1-on-node-removal branch from 418084a to 65cdb23 Compare May 15, 2026 06:19
@netlify
Copy link
Copy Markdown

netlify Bot commented May 15, 2026

Deploy Preview for dra-driver-nvidia-gpu failed. Why did it fail? →

Name Link
🔨 Latest commit 65cdb23
🔍 Latest deploy log https://app.netlify.com/projects/dra-driver-nvidia-gpu/deploys/6a06bb0f1e6b5200082d9e76

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 15, 2026
@AkshatDudeja77
Copy link
Copy Markdown
Author

@rajatchopra squashed commits as requested, thanks for the review

@AkshatDudeja77
Copy link
Copy Markdown
Author

@dims please take a look for approval when you get the chance

@dims
Copy link
Copy Markdown
Member

dims commented May 15, 2026

@jgehrcke can you please chime in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid sending SIGUSR1 when ComputeDomain update only removes nodes

7 participants