Description
Deleting a NodeSet CRD while Slurm jobs still reference its partition/node causes slurmctld to crash with SIGSEGV on the next reconfigure, entering an infinite CrashLoopBackOff.
The crash loop is self-perpetuating because job_state persists on PVC. Every restart causes slurmctld to attempt state recovery from the same corrupted job_state, hit the same null pointer dereference, and crash again.
Root Cause
NodeSet deleted
→ Controller reconciler regenerates slurm.conf (partition/node removed)
→ ConfigMap updated → /etc/slurm hash changes
→ reconfigure.sh sidecar detects hash change
→ scontrol reconfigure
→ slurmctld forks child to reload config + recover job state
→ child reads job_state: job references partition no longer in config
→ NULL pointer dereference → SIGSEGV (child)
→ restart → reads same job_state → crash again ...
Note that this is not limited to running jobs. Slurm retains job records in job_state even after jobs have ended, for up to MinJobAge seconds (default 300s) for accounting sync and dependency resolution. Deleting the NodeSet while any job record still references its partition triggers the SIGSEGV.
Steps to Reproduce
- Install slurm-operator with a
StateSaveLocation on a PVC
- Create a NodeSet (e.g.,
test-partition, replicas=1)
- Wait for the partition to appear in
sinfo
- Submit a job:
sbatch --partition=test-partition --wrap="sleep 300"
- Delete the NodeSet CRD directly (no
scancel, no scale-down)
- Observe slurmctld crashing with SIGSEGV on the next reconfigure
Variant (MinJobAge race): Even if all jobs on the partition have already ended, deleting the NodeSet before MinJobAge expires causes the same crash.
Reproduction Log
Environment: slurm-operator v1.0.1, Slurm 25.11, Kubernetes v1.31.7
Timeline:
| Time (UTC) |
Event |
| 06:13:26 |
sbatch --partition=test-partition → job submitted (RUNNING) |
| 06:13:29 |
kubectl delete nodeset test-partition |
| 06:14:03 |
reconfigure.sh detects hash change → scontrol reconfigure |
| 06:14:08 |
SIGSEGV — slurmctld child crashes |
| 06:15:03 |
Container killed by Kubernetes (SIGTERM) |
| 06:15:04 |
Restart #1 — on recovery reads stale job_state |
| 06:15:08 |
error: Invalid partition (test-partition) for JobId=69 |
| 06:15~06:17 |
Crash loop: 6 total restarts |
| 06:17:35 |
Recovery: NodeSet re-created with replicas: 0 → partition restored → slurmctld stabilizes |
slurm-controller-0 supervisor log:
# Reconfigure triggered after NodeSet deletion
[2026-03-10 06:14:03+00:00] fakesystemd.sh: received PID=26947
2026-03-10 06:14:03,998 INFO reaped unknown pid 26804 (exit status 0)
2026-03-10 06:14:03,998 INFO reaped unknown pid 26838 (exit status 0)
# SIGSEGV — child process crashes during state recovery
2026-03-10 06:14:08,002 INFO reaped unknown pid 26947 (terminated by SIGSEGV (core dumped))
2026-03-10 06:14:08,002 INFO reaped unknown pid 26981 (exit status 0)
# Kubernetes kills the destabilized container
2026-03-10 06:15:03,060 WARN received SIGTERM indicating exit request
2026-03-10 06:15:04,062 WARN stopped: fakesystemd (terminated by SIGTERM)
slurmctld log on restart (reading stale job_state):
[2026-03-10T06:15:08] error: Invalid partition (test-partition) for JobId=69
[2026-03-10T06:15:08] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 4 partitions
[2026-03-10T06:15:08] Running as primary controller
Environment
- slurm-operator:
v1.0.1 (image: ghcr.io/slinkyproject/slurm-operator:1.0.1)
- Slurm:
25.11 (image: ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04)
- Kubernetes:
v1.31.7
StateSaveLocation on persistent PVC
MinJobAge = 300 (default)
Expected Behavior
Deleting a NodeSet should not cause slurmctld to crash, regardless of whether jobs referencing its partition exist or have recently completed.
One possible approach would be to ensure that partition/node definitions are not removed from slurm.conf while job_state still holds records referencing them — for example, via a finalizer that defers deletion until stale job records have been purged.
Additional Context
Description
Deleting a NodeSet CRD while Slurm jobs still reference its partition/node causes
slurmctldto crash with SIGSEGV on the nextreconfigure, entering an infinite CrashLoopBackOff.The crash loop is self-perpetuating because
job_statepersists on PVC. Every restart causes slurmctld to attempt state recovery from the same corruptedjob_state, hit the same null pointer dereference, and crash again.Root Cause
Note that this is not limited to running jobs. Slurm retains job records in
job_stateeven after jobs have ended, for up toMinJobAgeseconds (default 300s) for accounting sync and dependency resolution. Deleting the NodeSet while any job record still references its partition triggers the SIGSEGV.Steps to Reproduce
StateSaveLocationon a PVCtest-partition,replicas=1)sinfosbatch --partition=test-partition --wrap="sleep 300"scancel, no scale-down)Variant (MinJobAge race): Even if all jobs on the partition have already ended, deleting the NodeSet before
MinJobAgeexpires causes the same crash.Reproduction Log
Environment: slurm-operator
v1.0.1, Slurm25.11, Kubernetesv1.31.7Timeline:
sbatch --partition=test-partition→ job submitted (RUNNING)kubectl delete nodeset test-partitionscontrol reconfigurejob_stateerror: Invalid partition (test-partition) for JobId=69replicas: 0→ partition restored → slurmctld stabilizesslurm-controller-0supervisor log:slurmctld log on restart (reading stale
job_state):Environment
v1.0.1(image:ghcr.io/slinkyproject/slurm-operator:1.0.1)25.11(image:ghcr.io/slinkyproject/slurmctld:25.11-ubuntu24.04)v1.31.7StateSaveLocationon persistent PVCMinJobAge = 300(default)Expected Behavior
Deleting a NodeSet should not cause slurmctld to crash, regardless of whether jobs referencing its partition exist or have recently completed.
One possible approach would be to ensure that partition/node definitions are not removed from
slurm.confwhilejob_statestill holds records referencing them — for example, via a finalizer that defers deletion until stale job records have been purged.Additional Context
processCondemned) already has drain-before-delete logic. The NodeSet deletion path could potentially benefit from equivalent safety guarantees.job_staterace on full NodeSet deletion.