Skip to content

Restart driver pods in place when driver config is unchanged#2527

Draft
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade
Draft

Restart driver pods in place when driver config is unchanged#2527
rajathagasthya wants to merge 1 commit into
NVIDIA:mainfrom
rajathagasthya:worktree-minor-version-driver-no-upgrade

Conversation

@rajathagasthya

@rajathagasthya rajathagasthya commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Draft for discussion. Depends on the RestartOnlyPredicate hook in NVIDIA/k8s-operator-libs#145 (draft). The library is not vendored here yet, so CI builds fail until that lands and the dependency is bumped; local testing uses a temporary go.mod replace + vendor of the library branch.

Description

A patch chart upgrade can change only cosmetic pod-template metadata (e.g. the helm.sh/chart label) without changing the driver itself. The upgrade controller keys on the DaemonSet's controller revision hash, so such a change still evicts running GPU workloads and drains the node, causing disruption for running workloads.

Register a RestartOnlyPredicate on the upgrade state manager (from the UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST — a hash of the install-relevant driver config, already set on the driver pod template — between the running pod and the desired DaemonSet. When the digests match, the node is cordoned and the driver pod restarted in place, with no workload eviction or drain; the driver fast-path keeps the kernel modules loaded across the restart, so running GPU workloads are not disrupted. Cordoning keeps the node unschedulable if the restart fails (same as the full flow), and the node is uncordoned on success. A missing or differing digest falls back to the full upgrade flow.

If the predicate returns an error or the cordon fails, the node stays in upgrade-required and is retried on a later reconcile (with a Warning event), rather than being routed to the disruptive flow on an unknown answer.

Known limitation: the first upgrade from a release without restart-only is still disruptive, because the old operator holds the leader-election lease and routes the upgrade before the new operator becomes leader. Steady-state (both sides have the code) is non-disruptive.

Related to #349

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Unit tests:

  • internal/config: TestDriverConfigDigestFromPodSpec — digest reader, incl. nil/empty/container-precedence cases.
  • controllers: TestDriverPodRestartOnly — predicate routing, incl. nil pod/DS and missing/equal/differing digests.
  • Add optional restart-only predicate to inplace upgrade flow k8s-operator-libs#145: Ginkgo specs for restart-only routing (cordon + pod-restart), retry-on-error for predicate and cordon failures, orphaned/upgrade-requested/safe-driver-load skips, and the maxParallelUpgrades throttle.

Manual testing (single-node cluster, GPU workload running throughout):

  1. Without helm: deploy an operator image with this change, then patch the driver DS pod template with a label-only change. Node goes upgrade-required → pod-restart-required (cordoned, never cordon-required), the driver pod restarts via the fast path, and the GPU workload is not evicted.
  2. Helm, first adoption: install v26.3.2, then upgrade to a chart built with this change. Full upgrade flow, as expected: the driver version also changed, and the first upgrade is routed by the old operator (see known limitation above).
  3. Helm, patch upgrade: install a chart built with this change, then upgrade to another chart also built with this change, same driver version. Restart-only flow — the GPU workload is not evicted.
  4. Helm, real driver change: install a chart built with this change, then upgrade to another chart also built with this change but a different driver version. Full upgrade flow (digest differs).

Comment thread vendor/github.com/NVIDIA/k8s-operator-libs/pkg/upgrade/common_manager.go Outdated
A patch chart upgrade can change only cosmetic pod-template metadata
(e.g. the helm.sh/chart label) without changing the driver itself. The
upgrade controller keys on the DaemonSet's controller revision hash, so
such a change still evicts running GPU workloads and drains the node --
for no driver benefit.

Register a RestartOnlyPredicate on the upgrade state manager (from the
UpgradeReconciler) that compares DRIVER_CONFIG_DIGEST -- a hash of the
install-relevant driver config, already set on the driver pod
template -- between the running pod and the desired DaemonSet. When the
digests match, the node is cordoned and the driver pod restarted in
place, with no workload eviction or drain; the driver fast-path keeps
the kernel modules loaded across the restart, so running GPU workloads
are not disrupted. Cordoning keeps the node unschedulable if the
restart fails, and the node is uncordoned on success. A missing or
differing digest falls back to the full upgrade flow.

The digest env name and a reader for it live in internal/config beside
the digest definition; the restart-only routing decision is a method on
the upgrade controller, registered in SetupWithManager. Depends on the
RestartOnlyPredicate hook in k8s-operator-libs; the vendored dependency
bump follows once that change is released.

Signed-off-by: Rajath Agasthya <ragasthya@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants