Skip to content

Fix node drain deadlock when device plugin holds kernel module#1273

Closed
TomerNewman wants to merge 1 commit into
kubernetes-sigs:mainfrom
TomerNewman:bugfix/fix-drain-deadlock-with-device-plugin
Closed

Fix node drain deadlock when device plugin holds kernel module#1273
TomerNewman wants to merge 1 commit into
kubernetes-sigs:mainfrom
TomerNewman:bugfix/fix-drain-deadlock-with-device-plugin

Conversation

@TomerNewman

Copy link
Copy Markdown
Collaborator

Problem

When draining a node running a KMM kernel module with an associated device plugin, a race condition between the Module reconciler and NMC reconciler can cause a permanent deadlock where the node drain hangs indefinitely.

The deadlock cycle:

  1. Unloader pod needs device file released → requires device plugin terminated
  2. Device plugin termination → requires kmm-ready label removed from node
  3. Label removal (removeOrphanedLabels) → requires NMC status absent
  4. Status removal → requires unloader to succeed (back to step 1)

Root Cause

When the Module reconciler processes the drain first and removes the module from NMC.spec, the NMC reconciler enters the "orphan status" path, creating an unloader pod. However, removeOrphanedLabels only removes the kmm-ready label when the module is absent from both spec AND status. Since the status persists until the unloader succeeds, the label stays, keeping the device plugin DaemonSet pod running and holding the device file open.

There is a "lucky" path where the NMC reconciler processes the drain first (while spec is still present), hitting the existing "unschedulable fast path" that removes labels immediately. But this depends on controller scheduling order — a race condition.

Fix

After processing orphan statuses, check if the node is unschedulable for each orphan module (considering its tolerations). If so, add the module's kmm-ready labels to the removal set. This triggers the existing early-return label-removal path, ensuring the device plugin is evicted regardless of which controller processes the drain first.

Testing

  • Added regression test: "should remove kmod labels for orphan statuses on unschedulable node to prevent drain deadlock"
  • Verified the test fails without the fix (confirms it catches the bug)
  • Verified the test passes with the fix
  • Full test suite: 296/296 controller specs pass, all project tests green

Confidence

High (95%) — code paths traced end-to-end, race condition fully characterized, fix is minimal and uses existing mechanisms.

Rollback

Revert this commit. The previous behavior (non-deterministic depending on controller ordering) will return — manual workaround is to remove the kmm-ready label from the node.

Risk Assessment

Low — the fix only adds label removal for orphan statuses on unschedulable nodes. Normal (schedulable) code paths are unaffected. The change reuses the same readyLabelsToRemove + early-return mechanism already used by the existing "unschedulable fast path" for spec modules.


This fix was written by bug buddy - ai workflow

Made with Cursor

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: TomerNewman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented May 25, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: TomerNewman / name: TomerNewman (4f38ea8)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label May 25, 2026
@netlify

netlify Bot commented May 25, 2026

Copy link
Copy Markdown

Deploy Preview for kubernetes-sigs-kmm ready!

Name Link
🔨 Latest commit 4f38ea8
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kmm/deploys/6a13fae4a3622e00081d917b
😎 Deploy Preview https://deploy-preview-1273--kubernetes-sigs-kmm.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 25, 2026
@codecov-commenter

codecov-commenter commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.28%. Comparing base (fa23a9b) to head (4f38ea8).
⚠️ Report is 372 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1273      +/-   ##
==========================================
- Coverage   79.09%   73.28%   -5.82%     
==========================================
  Files          51       66      +15     
  Lines        5109     4656     -453     
==========================================
- Hits         4041     3412     -629     
- Misses        882     1083     +201     
+ Partials      186      161      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

When draining a node running a KMM kernel module with an associated device
plugin, a race condition between the Module reconciler and NMC reconciler
can cause a deadlock:

1. Module reconciler removes the module from NMC spec (node unschedulable)
2. NMC reconciler creates an unloader pod (orphan status path)
3. The kmm-ready label stays on the node (removeOrphanedLabels requires
   both spec AND status to be absent, but status persists until unload
   succeeds)
4. Device plugin DaemonSet pod stays running (nodeSelector still matches
   and DaemonSet pods tolerate the unschedulable taint)
5. Unloader's modprobe -r fails indefinitely (device file held open)

Fix: after processing orphan statuses, check if the node is unschedulable
for each orphan module. If so, add the kmm-ready labels to the removal set.
This ensures the device plugin pod is evicted, releasing the device file so
the unloader can succeed.

This fix was written by bug buddy - ai workflow
@TomerNewman TomerNewman force-pushed the bugfix/fix-drain-deadlock-with-device-plugin branch from 0ec299e to 4f38ea8 Compare May 25, 2026 07:31
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 25, 2026
@TomerNewman TomerNewman marked this pull request as ready for review May 25, 2026 07:32
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 25, 2026
@TomerNewman TomerNewman deleted the bugfix/fix-drain-deadlock-with-device-plugin branch May 25, 2026 08:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants