Restart driver pods in place when driver config is unchanged#2527
Draft
rajathagasthya wants to merge 1 commit into
Draft
Restart driver pods in place when driver config is unchanged#2527rajathagasthya wants to merge 1 commit into
rajathagasthya wants to merge 1 commit into
Conversation
cdesiniotis
reviewed
Jun 9, 2026
A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the helm.sh/chart label) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node -- for no driver benefit. Register a RestartOnlyPredicate on the upgrade state manager (from the UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST -- a hash of the install-relevant driver config, already set on the driver pod template -- between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails, and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow. The digest env name and a reader for it live in internal/config beside the digest definition; the restart-only routing decision is a method on the upgrade controller, registered in SetupWithManager. Depends on the RestartOnlyPredicate hook in k8s-operator-libs; the vendored dependency bump follows once that change is released. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
ce94262 to
5165328
Compare
5165328 to
6144fb2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the
helm.sh/chartlabel) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node, causing disruption for running workloads.Register a
RestartOnlyPredicateon the upgrade state manager (from theUpgradeReconciler) that comparesDRIVER_CONFIG_DIGEST— a hash of the install-relevant driver config, already set on the driver pod template — between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails (same as the full flow), and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow.If the predicate returns an error or the cordon fails, the node stays in
upgrade-requiredand is retried on a later reconcile (with a Warning event), rather than being routed to the disruptive flow on an unknown answer.Known limitation: the first upgrade from a release without restart-only is still disruptive, because the old operator holds the leader-election lease and routes the upgrade before the new operator becomes leader. Steady-state (both sides have the code) is non-disruptive.
Related to #349
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Unit tests:
internal/config:TestDriverConfigDigestFromPodSpec— digest reader, incl. nil/empty/container-precedence cases.controllers:TestDriverPodRestartOnly— predicate routing, incl. nil pod/DS and missing/equal/differing digests.Manual testing (single-node cluster, GPU workload running throughout):
upgrade-required → pod-restart-required(cordoned, nevercordon-required), the driver pod restarts via the fast path, and the GPU workload is not evicted.