Skip to content

[slice] NodeSelector in STS webhook#1176

Open
pajakd wants to merge 1 commit intoAI-Hypercomputer:slice-mainfrom
pajakd:lws_selector
Open

[slice] NodeSelector in STS webhook#1176
pajakd wants to merge 1 commit intoAI-Hypercomputer:slice-mainfrom
pajakd:lws_selector

Conversation

@pajakd
Copy link
Copy Markdown
Collaborator

@pajakd pajakd commented Apr 15, 2026

Description

Add a StatefulSet webhook, that (for relevant tpuv7 pods) makes sure that the topology NodeSelector is present.

If the selector is not there, it adds it with the value taken from the "cloud.google.com/gke-tpu-topology" annotation.

This is to make sure that LWS workloads are not blocked by the GKE warden.

Issue

Testing

@pajakd pajakd marked this pull request as draft April 15, 2026 13:18
@pajakd pajakd changed the title [slice] NodeSelector in pod webhook [slice] NodeSelector in sts webhook Apr 16, 2026
@pajakd pajakd marked this pull request as ready for review April 16, 2026 07:13
Copy link
Copy Markdown

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Fix the typo in the commit.

sts.Spec.Template.Spec.NodeSelector = make(map[string]string)
}
if _, ok := sts.Spec.Template.Spec.NodeSelector[core.TPUTopologyAnnotation]; !ok {
sts.Spec.Template.Spec.NodeSelector[core.TPUTopologyAnnotation] = tpuTopology
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about health selector?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The health labels are always at the nodes (unlike the topology labels that are added only after the slice is Active), so it is not interfering with the scheduling to just add them in the LWS/Job/Jobset webhook.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking whether we should add node selector for healthy/degraded nodes or not.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add this healthy/degraded selector in LWS webhook and it is propagated to kueue workload, so TAS knows which nodes to exclude. I think it should be propagated from LWS to STS but even if its not then is it a problem?

Comment thread slice/internal/webhooks/statefulset_webhook.go Outdated
Comment thread slice/internal/webhooks/statefulset_webhook.go Outdated
if len(w.Spec.LeaderWorkerTemplate.WorkerTemplate.Spec.Containers) == 0 {
w.Spec.LeaderWorkerTemplate.WorkerTemplate.Spec.Containers = []corev1.Container{{}}
}
w.Spec.LeaderWorkerTemplate.WorkerTemplate.Spec.Containers[0].Name = name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is more than 1 container?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but this is only for e2e tests. Making it work for any number of containers would introduce additional complexity and in kueue also we have 1 container hardcoded in such wrappers (https://github.com/kubernetes-sigs/kueue/blob/90375d4a3e8c363f7187512c8f5ad4145d041610/pkg/util/testingjobs/leaderworkerset/wrappers.go#L189-L211)

Comment thread slice/internal/webhooks/statefulset_webhook_test.go
@pajakd pajakd changed the title [slice] NodeSelector in sts webhook [slice] NodeSelector in STS webhook Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants