Skip to content

fix: support seamless upgrades with pod uid in socket path names + default helm chart values#113

Merged
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
alimaazamat:seamless-upgrade-socket-discovery
Apr 27, 2026
Merged

fix: support seamless upgrades with pod uid in socket path names + default helm chart values#113
k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
alimaazamat:seamless-upgrade-socket-discovery

Conversation

@alimaazamat
Copy link
Copy Markdown
Contributor

@alimaazamat alimaazamat commented Jul 29, 2025

Fixes issue #107
Socket path names in the format {baseName}{suffix} works fine for legacy deployments but if we have seamless upgrades (max surge 1 maxUnavailable 0) then it doesn't work because the socket path names include uid {baseName}-{uid}{suffix}.
Without UIDs in socket names: Two pods can't run simultaneously because they'd conflict trying to create the same socket file names.

Solution:
Include pod UID in socket paths. Define default seamless upgrade values. Chart now defaults to seamless upgrades (max surge 1 maxUnavailable 0) which users can override by setting kubeletPlugin.updateStrategy.rollingUpdate values directly.

How to test:

# Build the kubelet plugin binary with the changes
make cmd-dra-example-kubeletplugin
# in demo dir and build/load image
cd demo
./scripts/build-driver-image.sh
./scripts/load-driver-image-into-kind.sh
cd ..
# made a new namespace
kubectl create namespace dra-example-driver

# Test Default Config where seamless upgrades disabled
helm install dra-example-driver deployments/helm/dra-example-driver --namespace dra-example-driver
# Check deployment
kubectl get pods --namespace dra-example-driver
# I verified DaemonSet update strategy correctly has **maxSurge: 1, maxUnavailable: 0**
# Check logs for legacy socket paths
kubectl logs -n dra-example-driver $POD_NAME | grep -E "(socket|endpoint)"
 
# Upgrade deployment triggering rolling update to enable seamless upgrades
kubectl patch daemonset dra-example-driver-kubeletplugin -n dra-example-driver \
  -p '{"spec":{"template":{"metadata":{"annotations":{"test-rolling-update":"'$(date +%s)'"}}}}}'
# verify uid socket paths in new pods
NEW_POD_NAME=$(kubectl get pods -n dra-example-driver -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n dra-example-driver $NEW_POD_NAME | grep -E "(socket|endpoint)" | head -5

# Test non-seamless (overriding defaults)
helm upgrade dra-example-driver deployments/helm/dra-example-driver \
  --namespace dra-example-driver \
  --set kubeletPlugin.updateStrategy.rollingUpdate.maxSurge=0 \
  --set kubeletPlugin.updateStrategy.rollingUpdate.maxUnavailable=1

# Verify old rolling update strategy {"maxSurge": 0, "maxUnavailable": 1}
kubectl get daemonset dra-example-driver-kubeletplugin -n dra-example-driver \
  -o jsonpath='{.spec.updateStrategy.rollingUpdate}' | jq .

Results:
Seamless Upgrades Enabled default behavior:
maxSurge: 1, maxUnavailable: 0

I0422 23:21:04.522190 1 health.go:71] "connecting to registration socket" path="unix:///var/lib/kubelet/plugins_registry/gpu.example.com-<POD_UID>-reg.sock"
I0422 23:21:04.522536 1 health.go:89] "connecting to DRA socket" path="unix:///var/lib/kubelet/plugins/gpu.example.com/dra-<POD_UID>.sock"

Seamless Upgrades Disabled by user overriding helm chart:
maxSurge: 0, maxUnavailable: 1

  • socket paths still include pod uid
I0422 23:18:56.245888 1 health.go:71] "connecting to registration socket" path="unix:///var/lib/kubelet/plugins_registry/gpu.example.com-<POD_UID>-reg.sock"
I0422 23:18:56.246339 1 health.go:89] "connecting to DRA socket" path="unix:///var/lib/kubelet/plugins/gpu.example.com/dra-<POD_UID>.sock"

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 29, 2025
@k8s-ci-robot k8s-ci-robot requested review from klueska and nojnhuh July 29, 2025 23:24
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 29, 2025
Comment thread cmd/dra-example-kubeletplugin/driver.go Outdated
Comment thread deployments/helm/dra-example-driver/templates/kubeletplugin.yaml Outdated
Comment thread deployments/helm/dra-example-driver/templates/kubeletplugin.yaml Outdated
Comment thread deployments/helm/dra-example-driver/values.yaml Outdated
@alimaazamat alimaazamat changed the title fix: supports seamless upgrade socket path name with uid [WIP] fix: supports seamless upgrade socket path name with uid Aug 15, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 15, 2025
@alimaazamat
Copy link
Copy Markdown
Contributor Author

/assign

@pohly pohly moved this from 🆕 New to 🏗 In progress in Dynamic Resource Allocation Aug 22, 2025
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch 2 times, most recently from 39da3da to 28f1e83 Compare August 26, 2025 23:54
@alimaazamat alimaazamat changed the title [WIP] fix: supports seamless upgrade socket path name with uid fix: supports seamless upgrade socket path name with uid Aug 27, 2025
@alimaazamat alimaazamat requested a review from nojnhuh August 27, 2025 22:16
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 27, 2025
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch from 28f1e83 to ce936de Compare August 27, 2025 23:04
@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Sep 22, 2025
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2025
@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Nov 26, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2025
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 24, 2026
@k8s-triage-robot
Copy link
Copy Markdown

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 26, 2026
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2026
@nojnhuh
Copy link
Copy Markdown
Contributor

nojnhuh commented Mar 26, 2026

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 26, 2026
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch from ed5054b to a68f70d Compare April 22, 2026 22:36
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 22, 2026
Comment thread cmd/dra-example-kubeletplugin/driver.go Outdated
Comment thread cmd/dra-example-kubeletplugin/main.go Outdated
Comment thread cmd/dra-example-kubeletplugin/driver.go Outdated
Comment thread deployments/helm/dra-example-driver/values.yaml Outdated
Comment thread deployments/helm/dra-example-driver/values.yaml
Comment thread cmd/dra-example-kubeletplugin/driver.go
Comment thread deployments/helm/dra-example-driver/templates/kubeletplugin.yaml Outdated
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @alimaazamat! Could you please squash down to one commit?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 27, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: f3f1c421dc5b8f7c738ee6c521b0e31781ba3808

@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch from 5e788f8 to f48a596 Compare April 27, 2026 16:40
@alimaazamat alimaazamat changed the title fix: supports seamless upgrade socket path name with uid fix: support seamless upgrades with pod uid in socket path names + default helm chart values Apr 27, 2026
@alimaazamat alimaazamat force-pushed the seamless-upgrade-socket-discovery branch from f48a596 to a56e95c Compare April 27, 2026 16:47
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alimaazamat, nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 27, 2026
@k8s-ci-robot k8s-ci-robot merged commit a538bc8 into kubernetes-sigs:main Apr 27, 2026
6 checks passed
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants