[karpenter]: rolling restart consolidation for PDB-blocked nodes#2
Open
jeremywgleeson wants to merge 12 commits into
Open
[karpenter]: rolling restart consolidation for PDB-blocked nodes#2jeremywgleeson wants to merge 12 commits into
jeremywgleeson wants to merge 12 commits into
Conversation
b66d5a8 to
e2682c0
Compare
…eps group (kubernetes-sigs#2845) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
0e9f0e4 to
4e4baf1
Compare
…kubernetes-sigs#2656) Co-authored-by: Jason Deal <jmdeal@amazon.com>
… API Port the rolling restart feature from the old API (v1beta1/v1alpha5) to the new API (v1/v1alpha1) after main underwent massive refactoring. Changes: - Add AllowRollingRestartAnnotationKey annotation to labels.go - Add pdbBlocked field to Candidate and isSinglePDBBlockError helper - Add rolling restart metrics (gauge, counters) - Add rolling_restart.go with workload restart logic and dedup - Update single/multi-node consolidation to handle PDB-blocked candidates - Update queue to trigger rolling restarts after cordoning - Add PDB status consistency validation - Add comprehensive rolling_restart_test.go (22 tests) - Update 5 existing suite_test.go PDB tests for new behavior All 242 disruption tests pass. Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
PDB-blocked nodes whose workloads lack the AllowRollingRestartAnnotationKey annotation are now rejected at NewCandidate time, identical to the pre-rolling-restart behavior. This prevents: - Single-node: wasted scheduling work and incorrect metrics - Multi-node: PDB-blocked candidates without annotation poisoning binary search Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
- Revert SingleNodeConsolidationTimeoutDuration to 3min (was 60min) - Replace brittle string matching in isSinglePDBBlockError with typed PDBBlockReason - Add ClearNodeClaimsCondition to rolling restart failure cleanup in queue.go - Use IsReschedulable consistently in collectRestartTargets (was IsEvictable) - Document performance tradeoff for podsEligibleForRollingRestart API calls Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
68a06b6 to
8ea507b
Compare
Pull Request Test Coverage Report for Build 23100567941Details
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports the rolling restart consolidation feature onto the current
mainAPI (v1/v1alpha1), replacing the previous implementation that targeted the oldv1beta1/v1alpha5API. Instead of rejecting PDB-blocked nodes outright, single-PDB-blocked candidates whose workloads have opted in viakarpenter.sh/allow-rolling-restartare markedpdbBlocked=trueand consolidated via rolling restart of owning workloads (Deployments/StatefulSets) instead of pod eviction. Nodes without the annotation are rejected at candidate creation time, identical to pre-rolling-restart behavior.Key behaviors
PDBBlockReasononPodBlockEvictionError—ValidatePodsDisruptablenow classifies errors asPDBBlockReasonSinglePDB,PDBBlockReasonMultiplePDB, orPDBBlockReasonOther(do-not-disrupt).isSinglePDBBlockErrorusesstate.GetPDBBlockReason()instead of string matching.AllowRollingRestartAnnotationKey(karpenter.sh/allow-rolling-restart) opt-in annotation on workloadsNewCandidate:podsEligibleForRollingRestartverifies all reschedulable pods have annotated owners before marking a candidate as PDB-blocked. Without the annotation, the node is rejected identically to before — never visible to consolidationcollectRestartTargetscheckskubectl.kubernetes.io/restartedAtwithin a 10-minute window before building the commandcomputeRollingRestartCommandre-fetches PDB limits before issuing a rolling restart, guarding against stalepdbBlockedstateStartCommandnow clears both the NoSchedule taint and the disruption reason condition on NodeClaimsFiles changed
PDBBlockReasonenum,NewPodBlockEvictionErrorWithReason,GetPDBBlockReason;ValidatePodsDisruptableuses typed reasonsAllowRollingRestartAnnotationKeyannotationpdbBlockedfield onCandidate,isSinglePDBBlockError(typed check),Restartsfield onCommand, early annotation eligibility check inNewCandidatepodsEligibleForRollingRestart, owner resolution, opt-in check, dedup, trigger logic. BothpodsEligibleForRollingRestartandcollectRestartTargetsuseIsReschedulableconsistently.computeRollingRestartCommandwith PDB re-verification; timeout remains at 3 minutesClearNodeClaimsConditionUpdates since last revision
Latest (gofmt fix):
gofmtalignment inPDBBlockReasonconst block that was causing CImake verifyto failPrevious (review fixes):
SingleNodeConsolidationTimeoutDurationfrom 60min back to 3min (20x increase was unintentional)isSinglePDBBlockErrorwith typedPDBBlockReasonenum onPodBlockEvictionErrorClearNodeClaimsConditionto rolling restart failure cleanup inqueue.go(was only clearing taint)collectRestartTargetsto useIsReschedulablefor consistency withpodsEligibleForRollingRestartpodsEligibleForRollingRestartAPI calls in hot pathPrevious (annotation check at candidate creation):
allow-rolling-restartannotation were incorrectly included in consolidationpodsEligibleForRollingRestartfunction to check annotation eligibility atNewCandidatetime before marking aspdbBlockedIsReschedulable(notIsEvictable) to properly handle DaemonSet/mirror pods — nodes with only non-reschedulable PDB-blocked pods are rejectedInitial (port to main):
main(new API). During the port, three test failures were discovered and fixed:Node(not justNodeClaim) sinceStateNode.Annotations()returnsNode.Annotationswhen registeredkubectl.kubernetes.io/restartedAt; dedup check added at command computation time (was only at execution time)computeRollingRestartCommandviapdb.NewLimitsso deleted PDBs are detected before issuing a rolling restartrolling_restart.goto satisfy staticcheck ST1005CI status: all presubmit checks pass (1.29.x–1.34.x). The 1.35.x failure is pre-existing infrastructure (
go-licensesbuild error + Coveralls 530), unrelated to this PR.Review & Testing Checklist for Human
podsEligibleForRollingRestartmakes 2 API calls per reschedulable pod inNewCandidatefor every single-PDB-blocked node. Monitor candidate creation latency in clusters with many PDB-blocked nodes. Consider adding caching or batch-fetching if this becomes a bottleneck.allow-rolling-restartannotation is removed between candidate creation (podsEligibleForRollingRestart) and command computation (collectRestartTargets), the command will fail. Verify this is handled gracefully and doesn't leave nodes in a bad state.SingleNodeConsolidationTimeoutDurationis 3 minutes. In clusters with many PDB-blocked candidates, rolling restart operations (owner resolution + PDB re-verification) take longer than standard consolidation. Monitor timeout metrics to ensure 3 minutes is adequate.PDBBlockReasonbackward compatibility — The newNewPodBlockEvictionErrorWithReasonfunction is used inValidatePodsDisruptable, but the oldNewPodBlockEvictionErrorstill exists and defaults toPDBBlockReasonOther. Confirm no other callers ofNewPodBlockEvictionErrorare affected by the newPDBReasonfield.Notes
All 244 disruption tests pass locally. The typed error approach eliminates the brittle string matching that was flagged in review. The
IsReschedulableconsistency fix ensures DaemonSet pods are properly excluded from both eligibility checks and target collection.Link to Devin run: https://app.devin.ai/sessions/a822da3f61094a318e3d05a6db48aefd
Requested by: @jeremywgleeson