Restart driver pods in place when driver config is unchanged by rajathagasthya · Pull Request #2527 · NVIDIA/gpu-operator

rajathagasthya · 2026-06-09T21:50:10Z

Draft for discussion. Depends on the RestartOnlyPredicate hook in NVIDIA/k8s-operator-libs#145 (draft). The library is not vendored here yet, so CI builds fail until that lands and the dependency is bumped; local testing uses a temporary go.mod replace + vendor of the library branch.

Description

A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the helm.sh/chart label) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node, causing disruption for running workloads.

Register a RestartOnlyPredicate on the upgrade state manager (from the UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST — a hash of the install-relevant driver config, already set on the driver pod template — between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails (same as the full flow), and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow.

If the predicate returns an error or the cordon fails, the node stays in upgrade-required and is retried on a later reconcile (with a Warning event), rather than being routed to the disruptive flow on an unknown answer.

Known limitation: the first upgrade from a release without restart-only is still disruptive, because the old operator holds the leader-election lease and routes the upgrade before the new operator becomes leader. Steady-state (both sides have the code) is non-disruptive.

Related to #349

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

Unit tests:

internal/config: TestDriverConfigDigestFromPodSpec — digest reader, incl. nil/empty/container-precedence cases.
controllers: TestDriverPodRestartOnly — predicate routing, incl. nil pod/DS and missing/equal/differing digests.
Add optional restart-only predicate to inplace upgrade flow k8s-operator-libs#145: Ginkgo specs for restart-only routing (cordon + pod-restart), retry-on-error for predicate and cordon failures, orphaned/upgrade-requested/safe-driver-load skips, and the maxParallelUpgrades throttle.

Manual testing (single-node cluster, GPU workload running throughout):

Without helm: deploy an operator image with this change, then patch the driver DS pod template with a label-only change. Node goes upgrade-required → pod-restart-required (cordoned, never cordon-required), the driver pod restarts via the fast path, and the GPU workload is not evicted.
Helm, first adoption: install v26.3.2, then upgrade to a chart built with this change. Full upgrade flow, as expected: the driver version also changed, and the first upgrade is routed by the old operator (see known limitation above).
Helm, patch upgrade: install a chart built with this change, then upgrade to another chart also built with this change, same driver version. Restart-only flow — the GPU workload is not evicted.
Helm, real driver change: install a chart built with this change, then upgrade to another chart also built with this change but a different driver version. Full upgrade flow (digest differs).

A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the helm.sh/chart label) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node -- for no driver benefit. Register a RestartOnlyPredicate on the upgrade state manager (from the UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST -- a hash of the install-relevant driver config, already set on the driver pod template -- between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails, and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow. The digest env name and a reader for it live in internal/config beside the digest definition; the restart-only routing decision is a method on the upgrade controller, registered in SetupWithManager. Depends on the RestartOnlyPredicate hook in k8s-operator-libs; the vendored dependency bump follows once that change is released. Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>

cdesiniotis reviewed Jun 9, 2026

View reviewed changes

Comment thread vendor/github.com/NVIDIA/k8s-operator-libs/pkg/upgrade/common_manager.go Outdated

rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch from ce94262 to 5165328 Compare June 10, 2026 18:01

rajathagasthya mentioned this pull request Jun 11, 2026

Add optional restart-only predicate to inplace upgrade flow NVIDIA/k8s-operator-libs#145

Open

rajathagasthya force-pushed the worktree-minor-version-driver-no-upgrade branch from 5165328 to 6144fb2 Compare June 11, 2026 17:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart driver pods in place when driver config is unchanged#2527

Restart driver pods in place when driver config is unchanged#2527
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade

rajathagasthya commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rajathagasthya commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rajathagasthya commented Jun 9, 2026 •

edited

Loading