Fix node drain deadlock when device plugin holds kernel module#1273
Fix node drain deadlock when device plugin holds kernel module#1273TomerNewman wants to merge 1 commit into
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: TomerNewman The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
✅ Deploy Preview for kubernetes-sigs-kmm ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1273 +/- ##
==========================================
- Coverage 79.09% 73.28% -5.82%
==========================================
Files 51 66 +15
Lines 5109 4656 -453
==========================================
- Hits 4041 3412 -629
- Misses 882 1083 +201
+ Partials 186 161 -25 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When draining a node running a KMM kernel module with an associated device plugin, a race condition between the Module reconciler and NMC reconciler can cause a deadlock: 1. Module reconciler removes the module from NMC spec (node unschedulable) 2. NMC reconciler creates an unloader pod (orphan status path) 3. The kmm-ready label stays on the node (removeOrphanedLabels requires both spec AND status to be absent, but status persists until unload succeeds) 4. Device plugin DaemonSet pod stays running (nodeSelector still matches and DaemonSet pods tolerate the unschedulable taint) 5. Unloader's modprobe -r fails indefinitely (device file held open) Fix: after processing orphan statuses, check if the node is unschedulable for each orphan module. If so, add the kmm-ready labels to the removal set. This ensures the device plugin pod is evicted, releasing the device file so the unloader can succeed. This fix was written by bug buddy - ai workflow
0ec299e to
4f38ea8
Compare
Problem
When draining a node running a KMM kernel module with an associated device plugin, a race condition between the Module reconciler and NMC reconciler can cause a permanent deadlock where the node drain hangs indefinitely.
The deadlock cycle:
kmm-readylabel removed from noderemoveOrphanedLabels) → requires NMC status absentRoot Cause
When the Module reconciler processes the drain first and removes the module from
NMC.spec, the NMC reconciler enters the "orphan status" path, creating an unloader pod. However,removeOrphanedLabelsonly removes thekmm-readylabel when the module is absent from both spec AND status. Since the status persists until the unloader succeeds, the label stays, keeping the device plugin DaemonSet pod running and holding the device file open.There is a "lucky" path where the NMC reconciler processes the drain first (while spec is still present), hitting the existing "unschedulable fast path" that removes labels immediately. But this depends on controller scheduling order — a race condition.
Fix
After processing orphan statuses, check if the node is unschedulable for each orphan module (considering its tolerations). If so, add the module's
kmm-readylabels to the removal set. This triggers the existing early-return label-removal path, ensuring the device plugin is evicted regardless of which controller processes the drain first.Testing
"should remove kmod labels for orphan statuses on unschedulable node to prevent drain deadlock"Confidence
High (95%) — code paths traced end-to-end, race condition fully characterized, fix is minimal and uses existing mechanisms.
Rollback
Revert this commit. The previous behavior (non-deterministic depending on controller ordering) will return — manual workaround is to remove the
kmm-readylabel from the node.Risk Assessment
Low — the fix only adds label removal for orphan statuses on unschedulable nodes. Normal (schedulable) code paths are unaffected. The change reuses the same
readyLabelsToRemove+ early-return mechanism already used by the existing "unschedulable fast path" for spec modules.This fix was written by bug buddy - ai workflow
Made with Cursor