Skip to content

fix(spawner): set fsGroupChangePolicy: OnRootMismatch on singleuser pods#83

Open
tylerpotts wants to merge 2 commits into
mainfrom
fix/56-fsgroup-onrootmismatch
Open

fix(spawner): set fsGroupChangePolicy: OnRootMismatch on singleuser pods#83
tylerpotts wants to merge 2 commits into
mainfrom
fix/56-fsgroup-onrootmismatch

Conversation

@tylerpotts
Copy link
Copy Markdown
Contributor

Summary

  • Adds securityContext.fsGroupChangePolicy: OnRootMismatch to c.KubeSpawner.extra_pod_config in config/jupyterhub/01-spawner.py.
  • Prevents the kubelet's default recursive chown (Always policy) from running on every spawn, which on populated home PVCs (large pixi envs, HuggingFace caches) takes 30–150s and times out the spawn at JupyterHub's 60s default.
  • OnRootMismatch runs the chown at most once per PVC (when the root GID is wrong), so subsequent spawns skip it entirely.

The new key is deep-merged into the pod spec alongside the existing securityContext.fsGroup: 100 set by c.KubeSpawner.fs_gid.

Closes #56

Test plan

  • helm lint . --set nebariapp.enabled=false passes
  • uvx ruff check config/ passes
  • helm template renders both fsGroup: 100 and fsGroupChangePolicy: OnRootMismatch on the rendered hub ConfigMap
  • All 50 unit tests in tests/unit/ pass
  • Manual cluster verification (cannot run in CI): spawn a user with a populated home PVC and confirm
    • kubectl get pod jupyter-<user> -o jsonpath='{.spec.securityContext}' shows both fsGroup and fsGroupChangePolicy
    • kubectl describe pod jupyter-<user> no longer emits the "Setting volume ownership ... is taking longer than expected" warning after the first post-deploy spawn

Out of scope

The architectural issue called out in #56 and #54 — that 01-spawner.py directly assigns c.KubeSpawner.extra_pod_config and clobbers any operator-provided singleuser.extraPodConfig from values — is not addressed here. That belongs in its own PR.

The kubelet's default fsGroupChangePolicy: Always recursively chowns every
file on the mounted home PVC to match fsGroup at every spawn. On home PVCs
with large pixi envs and model caches (observed: 750k files), this takes
30-150s and times out the spawn at the JupyterHub 60s default.

OnRootMismatch only performs the chown when the volume's root GID is wrong,
so it runs at most once per PVC and subsequent spawns skip it entirely.

Closes #56
The first attempt set securityContext.fsGroupChangePolicy via
extra_pod_config but left fs_gid as a separate KubeSpawner trait,
assuming kubespawner would deep-merge the two into the pod's
securityContext. It doesn't — extra_pod_config does a top-level
attribute overwrite, so the securityContext set here replaced the one
fs_gid produced and pod.spec.securityContext.fsGroup was dropped.

Kubespawner logged exactly this at spawn time:
  'pod.spec.security_context' current value: '{'fsGroup': 100}' is
  overridden with '{'fsGroupChangePolicy': 'OnRootMismatch'}'

Effect: users no longer had GID 100 as a supplemental group, breaking
shared-storage writes and failing 5 e2e tests on PR #83.

Fix: set fsGroup and fsGroupChangePolicy together in extra_pod_config,
remove the now-redundant fs_gid trait, and document the overwrite
behavior in the comment so this trap doesn't get re-set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set fsGroupChangePolicy: OnRootMismatch on singleuser pods to avoid spawn timeouts on large home PVCs

2 participants