Skip to content

[karpenter]: rolling restart consolidation for PDB-blocked nodes#2

Open
jeremywgleeson wants to merge 12 commits into
mainfrom
jeremy/rolling-restart-consolidation
Open

[karpenter]: rolling restart consolidation for PDB-blocked nodes#2
jeremywgleeson wants to merge 12 commits into
mainfrom
jeremy/rolling-restart-consolidation

Conversation

@jeremywgleeson
Copy link
Copy Markdown

@jeremywgleeson jeremywgleeson commented Feb 22, 2026

Summary

Ports the rolling restart consolidation feature onto the current main API (v1/v1alpha1), replacing the previous implementation that targeted the old v1beta1/v1alpha5 API. Instead of rejecting PDB-blocked nodes outright, single-PDB-blocked candidates whose workloads have opted in via karpenter.sh/allow-rolling-restart are marked pdbBlocked=true and consolidated via rolling restart of owning workloads (Deployments/StatefulSets) instead of pod eviction. Nodes without the annotation are rejected at candidate creation time, identical to pre-rolling-restart behavior.

Key behaviors

  • Typed PDBBlockReason on PodBlockEvictionErrorValidatePodsDisruptable now classifies errors as PDBBlockReasonSinglePDB, PDBBlockReasonMultiplePDB, or PDBBlockReasonOther (do-not-disrupt). isSinglePDBBlockError uses state.GetPDBBlockReason() instead of string matching.
  • AllowRollingRestartAnnotationKey (karpenter.sh/allow-rolling-restart) opt-in annotation on workloads
  • Early annotation check at NewCandidate: podsEligibleForRollingRestart verifies all reschedulable pods have annotated owners before marking a candidate as PDB-blocked. Without the annotation, the node is rejected identically to before — never visible to consolidation
  • Dedup at computation time: collectRestartTargets checks kubectl.kubernetes.io/restartedAt within a 10-minute window before building the command
  • PDB re-verification: computeRollingRestartCommand re-fetches PDB limits before issuing a rolling restart, guarding against stale pdbBlocked state
  • Validation: detects PDB status changes during TTL wait to reject stale commands
  • Cleanup on failure: rolling restart trigger failure in StartCommand now clears both the NoSchedule taint and the disruption reason condition on NodeClaims

Files changed

  • statenode.go: PDBBlockReason enum, NewPodBlockEvictionErrorWithReason, GetPDBBlockReason; ValidatePodsDisruptable uses typed reasons
  • labels.go: AllowRollingRestartAnnotationKey annotation
  • types.go: pdbBlocked field on Candidate, isSinglePDBBlockError (typed check), Restarts field on Command, early annotation eligibility check in NewCandidate
  • metrics.go: 4 new rolling restart metrics (triggered, skipped/dedup, candidates gauge, errors)
  • rolling_restart.go: new file — podsEligibleForRollingRestart, owner resolution, opt-in check, dedup, trigger logic. Both podsEligibleForRollingRestart and collectRestartTargets use IsReschedulable consistently.
  • singlenodeconsolidation.go: computeRollingRestartCommand with PDB re-verification; timeout remains at 3 minutes
  • multinodeconsolidation.go: restart target collection for PDB-blocked candidates
  • queue.go: trigger rolling restarts after cordoning; failure path clears disruption condition via ClearNodeClaimsCondition
  • validation.go: PDB status consistency check
  • suite_test.go: existing PDB tests unchanged (PDB-blocked nodes without annotation still rejected as errors)
  • rolling_restart_test.go: new file — 24 tests covering candidate creation, annotation gating, consolidation, dedup, validation

Updates since last revision

Latest (gofmt fix):

  • Fixed gofmt alignment in PDBBlockReason const block that was causing CI make verify to fail

Previous (review fixes):

  • Reverted SingleNodeConsolidationTimeoutDuration from 60min back to 3min (20x increase was unintentional)
  • Replaced brittle string matching in isSinglePDBBlockError with typed PDBBlockReason enum on PodBlockEvictionError
  • Added ClearNodeClaimsCondition to rolling restart failure cleanup in queue.go (was only clearing taint)
  • Changed collectRestartTargets to use IsReschedulable for consistency with podsEligibleForRollingRestart
  • Documented performance tradeoff for podsEligibleForRollingRestart API calls in hot path

Previous (annotation check at candidate creation):

  • Fixed bug where PDB-blocked candidates without allow-rolling-restart annotation were incorrectly included in consolidation
  • Added podsEligibleForRollingRestart function to check annotation eligibility at NewCandidate time before marking as pdbBlocked
  • Nodes with PDB-blocked pods but no annotated owners are now rejected identically to pre-rolling-restart behavior (never become candidates)
  • Uses IsReschedulable (not IsEvictable) to properly handle DaemonSet/mirror pods — nodes with only non-reschedulable PDB-blocked pods are rejected
  • Added negative tests for annotation gating in candidate creation and consolidation
  • Test count: 244/244 pass (was 242, added 2 new negative tests)

Initial (port to main):

  • Rebased onto current main (new API). During the port, three test failures were discovered and fixed:
    1. do-not-disrupt test: annotation must be set on Node (not just NodeClaim) since StateNode.Annotations() returns Node.Annotations when registered
    2. dedup test: annotation key corrected to kubectl.kubernetes.io/restartedAt; dedup check added at command computation time (was only at execution time)
    3. validation test: added PDB re-verification in computeRollingRestartCommand via pdb.NewLimits so deleted PDBs are detected before issuing a rolling restart
    4. lint fix: lowercased 3 error strings in rolling_restart.go to satisfy staticcheck ST1005

CI status: all presubmit checks pass (1.29.x–1.34.x). The 1.35.x failure is pre-existing infrastructure (go-licenses build error + Coveralls 530), unrelated to this PR.

Review & Testing Checklist for Human

  • End-to-end cluster testing — Unit tests pass (244/244), but the rolling restart flow hasn't been tested in a real cluster. Deploy to a test cluster with PDB-protected workloads and verify: (a) workloads with the annotation restart correctly, (b) nodes without annotation are rejected at candidate creation, (c) dedup prevents cascading restarts, (d) PDB re-verification catches stale commands, (e) metrics are emitted correctly, (f) failure cleanup properly clears both taint and disruption condition.
  • API call performance in hot pathpodsEligibleForRollingRestart makes 2 API calls per reschedulable pod in NewCandidate for every single-PDB-blocked node. Monitor candidate creation latency in clusters with many PDB-blocked nodes. Consider adding caching or batch-fetching if this becomes a bottleneck.
  • Annotation removal race — If a workload's allow-rolling-restart annotation is removed between candidate creation (podsEligibleForRollingRestart) and command computation (collectRestartTargets), the command will fail. Verify this is handled gracefully and doesn't leave nodes in a bad state.
  • 3-minute timeout sufficiencySingleNodeConsolidationTimeoutDuration is 3 minutes. In clusters with many PDB-blocked candidates, rolling restart operations (owner resolution + PDB re-verification) take longer than standard consolidation. Monitor timeout metrics to ensure 3 minutes is adequate.
  • Verify typed PDBBlockReason backward compatibility — The new NewPodBlockEvictionErrorWithReason function is used in ValidatePodsDisruptable, but the old NewPodBlockEvictionError still exists and defaults to PDBBlockReasonOther. Confirm no other callers of NewPodBlockEvictionError are affected by the new PDBReason field.

Notes

All 244 disruption tests pass locally. The typed error approach eliminates the brittle string matching that was flagged in review. The IsReschedulable consistency fix ensures DaemonSet pods are properly excluded from both eligibility checks and target collection.


Link to Devin run: https://app.devin.ai/sessions/a822da3f61094a318e3d05a6db48aefd
Requested by: @jeremywgleeson

dependabot Bot and others added 3 commits February 25, 2026 05:39
@devin-ai-integration devin-ai-integration Bot force-pushed the jeremy/rolling-restart-consolidation branch from 0e9f0e4 to 4e4baf1 Compare February 26, 2026 20:07
moko-poi and others added 9 commits March 4, 2026 23:16
… API

Port the rolling restart feature from the old API (v1beta1/v1alpha5) to
the new API (v1/v1alpha1) after main underwent massive refactoring.

Changes:
- Add AllowRollingRestartAnnotationKey annotation to labels.go
- Add pdbBlocked field to Candidate and isSinglePDBBlockError helper
- Add rolling restart metrics (gauge, counters)
- Add rolling_restart.go with workload restart logic and dedup
- Update single/multi-node consolidation to handle PDB-blocked candidates
- Update queue to trigger rolling restarts after cordoning
- Add PDB status consistency validation
- Add comprehensive rolling_restart_test.go (22 tests)
- Update 5 existing suite_test.go PDB tests for new behavior

All 242 disruption tests pass.

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
PDB-blocked nodes whose workloads lack the AllowRollingRestartAnnotationKey
annotation are now rejected at NewCandidate time, identical to the
pre-rolling-restart behavior. This prevents:
- Single-node: wasted scheduling work and incorrect metrics
- Multi-node: PDB-blocked candidates without annotation poisoning binary search

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
- Revert SingleNodeConsolidationTimeoutDuration to 3min (was 60min)
- Replace brittle string matching in isSinglePDBBlockError with typed PDBBlockReason
- Add ClearNodeClaimsCondition to rolling restart failure cleanup in queue.go
- Use IsReschedulable consistently in collectRestartTargets (was IsEvictable)
- Document performance tradeoff for podsEligibleForRollingRestart API calls

Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
Co-Authored-By: jeremy <jeremywgleeson@gmail.com>
@devin-ai-integration devin-ai-integration Bot force-pushed the jeremy/rolling-restart-consolidation branch from 68a06b6 to 8ea507b Compare March 15, 2026 01:24
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 23100567941

Details

  • 435 of 612 (71.08%) changed or added relevant lines in 15 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.2%) to 80.296%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/disruption/validation.go 6 9 66.67%
pkg/controllers/state/statenode.go 12 15 80.0%
pkg/controllers/disruption/multinodeconsolidation.go 15 19 78.95%
pkg/controllers/state/informer/nodeclaim.go 4 8 50.0%
pkg/controllers/controllers.go 0 5 0.0%
pkg/controllers/disruption/queue.go 5 12 41.67%
pkg/controllers/state/informer/nodepool.go 6 13 46.15%
pkg/controllers/disruption/singlenodeconsolidation.go 34 48 70.83%
pkg/controllers/state/informer/pricing.go 18 59 30.51%
pkg/state/cost/cost.go 170 212 80.19%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/state/informer/nodepool.go 1 60.0%
Totals Coverage Status
Change from base Build 22327604045: -0.2%
Covered Lines: 12311
Relevant Lines: 15332

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants