KEP-666: Gang Scheduling in LWS#844
Conversation
✅ Deploy Preview for kubernetes-sigs-lws canceled.
|
|
|
fd3f47d to
558b085
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: yankay The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
558b085 to
0b97fb7
Compare
3cf2eae to
dabbcd9
Compare
|
I was working on a prototype https://github.com/Edwinhr716/lws/tree/was-poc, happy to collaborate here |
| kind: PodGroup | ||
| metadata: | ||
| name: leaderworkerset-sample-pg-0 | ||
| ownerReferences: |
There was a problem hiding this comment.
Thoughts on having the leader pod own the PodGroup instead? That way it gets cleaned up when LWS is autoscaled. It also simplifies the logic of managing the life cycle during maxSurge and rolling updates
There was a problem hiding this comment.
Thanks @Edwinhr716. My one concern is forward compatibility: KEP-5832 §Risks and Mitigations plans a validating admission controller for Workload → PodGroup → Pod creation order, with UnschedulableAndUnresolvable kept only as a "last line of defense". If that lands and is on by default, pod-owned would break at admission instead of degrading.
That said, the cleanup simplification is real and you're closer to the implementation — happy to go with pod-owned, just want to note this as a known follow-up.
There was a problem hiding this comment.
Thanks for pointing that out, I wasn't aware that there was going to be a validating admission order.
Mmm that makes it trickier for the LWS controller to manage the lifecycle of PodGroups, and is something we need to think about in the design.
dabbcd9 to
9fcb988
Compare
|
Pushed an update aligning the KEP with the original PoC Google Doc:
|
|
Something we need to discuss further here is whether or not it makes sense to integrate with the PodGroup and Workload API now, or does it make more sense to wait until kubernetes/enhancements#6017 to address the limitations that I flagged here https://docs.google.com/document/d/1VqfNB1u8cmrRhMe0DKycX-bfaLgDm5cgHGWdWu-94cM/edit?tab=t.0#bookmark=kix.e0bqf1kap91e. If the former, we also need to think about the migration from simple PodGroups to using CompositePodGroup APIs |
551722e to
49cf8f8
Compare
|
|
||
| ## Proposal | ||
|
|
||
| When the LWS object carries `leaderworkerset.sigs.k8s.io/gang-scheduling: "true"`, LWS creates and owns one `scheduling.k8s.io/v1alpha2` Workload (holding a gang PodGroup template) plus one standalone `PodGroup` per replica; each PodGroup's `MinCount` defaults to `LeaderWorkerTemplate.Size`, so all pods of a replica co-schedule by default. The pod webhook sets each pod's `spec.schedulingGroup.podGroupName` from its `leaderworkerset.sigs.k8s.io/group-index` label. |
There was a problem hiding this comment.
Will there be any API discovery on k8s clusters for this feature?
ie check if this API is available in addition to the annotation.
There was a problem hiding this comment.
Yes — at admission the webhook resolves v1alpha2 Workload + PodGroup via a cached RESTMapper, missing → reject with an error naming the missing GVK. See the new API Discovery and Prerequisites section. Caveat: the upstream GenericWorkload gate itself can't be discovered, so it stays an install-time prereq.
|
|
||
| ## Proposal | ||
|
|
||
| When the LWS object carries `leaderworkerset.sigs.k8s.io/gang-scheduling: "true"`, LWS creates and owns one `scheduling.k8s.io/v1alpha2` Workload (holding a gang PodGroup template) plus one standalone `PodGroup` per replica; each PodGroup's `MinCount` defaults to `LeaderWorkerTemplate.Size`, so all pods of a replica co-schedule by default. The pod webhook sets each pod's `spec.schedulingGroup.podGroupName` from its `leaderworkerset.sigs.k8s.io/group-index` label. |
There was a problem hiding this comment.
Should we go with an API for alpha with a feature gate instead?
Annotations are a bit hacky and difficult to deprecate. Plus I'm not sure how you plan to support ResourceClaims or TopologyAwareScheduling without going with an API.
There was a problem hiding this comment.
I'm fine with creating an API field. I did an annotation because it was an easy way to enable the prototype, but if we want to have an actual integration I agree we should add an API field
There was a problem hiding this comment.
Should we go with an API for alpha with a feature gate instead?
Annotations are a bit hacky and difficult to deprecate. Plus I'm not sure how you plan to support ResourceClaims or TopologyAwareScheduling without going with an API.
HI @kannon92 Quick check — did "feature gate" mean an LWS-side gate, or upstream GenericWorkload as the de-facto guard? Latest push takes the latter: typed alpha spec.gangScheduling, no LWS gate (matches SubGroupPolicy / RolloutStrategy.MaxSurge). The LWS-side feature-gate scaffold question is tracked separately in #850 — happy to flip back if you meant the LWS-gate reading.
There was a problem hiding this comment.
I'm fine with creating an API field. I did an annotation because it was an easy way to enable the prototype, but if we want to have an actual integration I agree we should add an API field
Done — typed alpha spec.gangScheduling field replaces the annotation. Empty struct in alpha (presence = opt-in); future TAS / DRA / RC knobs added additively. No LWS-side feature gate — webhook API discovery against upstream GenericWorkload is the guard. Scaffold question tracked separately in #850.
| - Validate [KEP-4671][kep4671] (Workload / PodGroup APIs) for multi-host inference use cases. | ||
| - Support autoscaling at the replica level. | ||
|
|
||
| ### Non-Goals |
There was a problem hiding this comment.
You should maybe call out all the other features for WAS as Non-Goals .
TAS, Workload disruption, PodGroupResourceClaims.
There was a problem hiding this comment.
Added under Non-Goals: TAS (KEP-5732), workload-aware preemption / disruption (KEP-5710), PodGroup-shared ResourceClaims (KEP-5729). Escape Hatch is the alpha workaround.
|
|
||
| [kep6012]: https://github.com/kubernetes/enhancements/issues/6012 | ||
|
|
||
| ### Future Work: Hierarchical Gang via CompositePodGroup |
There was a problem hiding this comment.
Something else to call out here. A composite gang that places the leader and the workers in separate PodGroups
CompositePodGroup serving-root
├─ PodGroup leader minCoint = 1 (parentRef=serving-root)
└─ PodGroup workers minCount = lws.size - 1 (parentRef=serving-root)
There was a problem hiding this comment.
This covers the use case where the leader requests different resources from the worker, and the use case where we want to give priority to the leader when it comes to preemption
There was a problem hiding this comment.
Done in 15ed9ce — Limitations §Per-role gang policy + Future Work §Single-LWS per-role split (per-replica CompositePodGroup tree).
Open question: any concrete LWS workload where splitting leader and workers into separate PodGroups actually helps? In vLLM head and workers all sit inside the same TP group — same GPU shape, all-or-nothing — so two leaf PodGroups give the same scheduling outcome as a single MinCount = size PodGroup. Heterogeneous resources / leader preemption priority make sense in the abstract but I can't map them to a real LWS workload yet.
There was a problem hiding this comment.
any concrete LWS workload where splitting leader and workers into separate PodGroups actually helps?
Yes, take a look at axlearn for example https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L1611. Their leader only requests CPU resources, while the workers request TPUs.
To guarantee that the workers all fall into the same TPU slice, while also being able to run the leader in a separate CPU nodepool, they use LeaderOnly subgroup policy + subgroup-exclusive-topology. That use case would be covered by having separate PodGroups for leader and workers
There was a problem hiding this comment.
any concrete LWS workload where splitting leader and workers into separate PodGroups actually helps?
Yes, take a look at axlearn for example https://github.com/apple/axlearn/blob/main/axlearn/cloud/gcp/pathways_utils.py#L1611. Their leader only requests CPU resources, while the workers request TPUs.
To guarantee that the workers all fall into the same TPU slice, while also being able to run the leader in a separate CPU nodepool, they use
LeaderOnlysubgroup policy + subgroup-exclusive-topology. That use case would be covered by having separate PodGroups for leader and workers
Thanks — pulled into the KEP as the heterogeneous-role motivating shape (CPU leader + single-accelerator-slice workers). Concrete reference: axlearn pathways_utils.py#L1611.
Per kubernetes-sigs#844 review (kannon92, Edwinhr716): - Replace the gang-scheduling annotation with a typed spec.gangScheduling *GangSchedulingPolicy field, gated by the GangScheduling feature gate (off by default; empty struct = opt-in). - Promote API discovery to a first-class "API Discovery and Prerequisites" design subsection. - Split the project-wide pkg/features scaffold to kubernetes-sigs#850 as a prerequisite; KEP-666 only adds the GangScheduling constant. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Expand the single `minCount < size` Limitations bullet into a `Per-role gang policy` umbrella with three sub-cases (leader-first gang, leader preemption priority, heterogeneous role minimums) and add a matching `Single-LWS per-role split` Future Work subsection with the per-replica CompositePodGroup tree. Addresses Edwinhr716's review on PR kubernetes-sigs#844. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Per kubernetes-sigs#844 review (kannon92, Edwinhr716): - Replace the gang-scheduling annotation with a typed alpha spec.gangScheduling *GangSchedulingPolicy field (presence = opt-in; TAS / DRA / ResourceClaims / hierarchical knobs added additively as upstream stabilizes them). - Promote API discovery to a first-class "API Discovery and Prerequisites" subsection. - Drop the LWS-side feature gate; upstream GenericWorkload is the de-facto kill switch, propagated to admission via webhook discovery. pkg/features scaffold tracked separately in kubernetes-sigs#850. - Split the single `minCount < size` Limitations bullet into a Per-role gang policy umbrella (leader-first, leader preemption priority, heterogeneous role minimums) with a matching Single-LWS per-role split Future Work subsection. - Add other WAS features to Non-Goals: TAS (KEP-5732), workload-aware preemption (KEP-5710), PodGroup-shared ResourceClaims (KEP-5729); Escape Hatch is the alpha workaround. - Trim Implementation History to milestone entries. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
1c59655 to
e81f521
Compare
Document the upstream Workload and PodGroup API integration for LWS as a parallel path to KEP-407. Design highlights: - Opt-in via a typed alpha spec.gangScheduling *GangSchedulingPolicy field (presence = opt-in; TAS / DRA / ResourceClaims / hierarchical knobs added additively as upstream stabilizes them). - spec.gangScheduling is the umbrella opt-in for both LWS-managed mode and the escape-hatch sub-mode; admission rejects a pre-set pod.spec.schedulingGroup without it, keeping spec.gangScheduling as the single source of truth for "this LWS uses gang scheduling". - Lifecycle: lws_controller creates one Workload <lws-name>; pod_controller creates one PodGroup per replica named <lws-name>-<group-index> (= leader pod name). - Escape Hatch: spec.gangScheduling set together with a pre-set pod.spec.schedulingGroup opts out of LWS-managed lifecycle (Job controller pattern), enabling external owners (e.g. Kueue, DisaggregatedSet) without LWS-side API surface. - Admission rules: reject mutation of spec.gangScheduling, gang + LeaderReady, gang + exclusive-topology, Size mutation in LWS-managed mode, gang when v1alpha2 API resources are not registered, and a pre-set pod.spec.schedulingGroup without spec.gangScheduling. - API discovery against upstream GenericWorkload at admission, with cached RESTMapper invalidation on NoMatchError so installing the API takes effect on the next admission without an LWS restart. - No LWS-side feature gate; upstream GenericWorkload is the de-facto kill switch (project-wide pkg/features scaffold tracked separately in kubernetes-sigs#850). - Per-role gang policy split (leader-first, leader preemption priority, heterogeneous role minimums) documented as Limitations of alpha and Single-LWS per-role split Future Work via KEP-6012. - Forward-compat sketch for KEP-6012 (CompositePodGroup): cross-LWS gangs (e.g. DisaggregatedSet) layer on the escape hatch via an external Workload owner; concrete tree shape owned by KEP-766. - Other WAS features as Non-Goals: TAS (KEP-5732), workload-aware preemption (KEP-5710), PodGroup-shared ResourceClaims (KEP-5729); Escape Hatch is the alpha workaround. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
e81f521 to
4c6d883
Compare
| type GangSchedulingPolicy struct{} | ||
| ``` | ||
|
|
||
| No LWS-side feature gate: the upstream `GenericWorkload` gate already controls whether `kube-apiserver` preserves `pod.spec.schedulingGroup`, and the webhook's [API discovery](#api-discovery-and-prerequisites) propagates that into LWS admission. Matches how LWS handles other typed alpha fields (`SubGroupPolicy`, `RolloutStrategy.MaxSurge`); a project-wide `pkg/features` scaffold, if ever needed, is tracked in [#850][lws-feature-gate-issue]. |
There was a problem hiding this comment.
So once GenericWorkload FG is promoted to stable in Kubernetes, it will be GA'ed here too?.
There was a problem hiding this comment.
Yes — LWS GangScheduling tracks the upstream GenericWorkload lifecycle: alpha → beta (default-on, aligned with upstream beta) → removed at GA. See updated §Graduation Criteria.
| // upstream Workload / PodGroup APIs (one PodGroup per replica, | ||
| // MinCount = LeaderWorkerTemplate.Size). Alpha; subject to change. | ||
| // +optional | ||
| GangScheduling *GangSchedulingPolicy `json:"gangScheduling,omitempty"` |
There was a problem hiding this comment.
This field can not be set if feature gate is not enabled on the cluster?.
There was a problem hiding this comment.
What would happen if the LWS (including this feature) will be deployed onto old Kubernetes cluster?. As far as I understand, this is backwards compatible, right?.
There was a problem hiding this comment.
What would happen if the LWS (including this feature) will be deployed onto old Kubernetes cluster?. As far as I understand, this is backwards compatible, right?.
Yes — added an explicit Backwards Compatibility covering the three cases:
- Field unset → zero behavior change.
- Field set, v1alpha2 APIs missing → admission rejects with the missing GVK named; no half-created objects.
- APIs registered but GenericWorkload gate off → install-time prerequisite, not a runtime failure.
There was a problem hiding this comment.
This field can not be set if feature gate is not enabled on the cluster?.
Yes — spec.gangScheduling is an LWS field, but the LWS validating webhook rejects it at admission if the upstream v1alpha2 Workload/PodGroup GVKs aren't registered (see API Discovery).
| Rejected: `MinCount` only requires M co-scheduled pods, with no notion of which replica they belong to — the scheduler may legally pick M pods from different replicas, none complete, and the model still cannot start. Per-replica PodGroups make each replica an independent all-or-nothing unit. | ||
|
|
||
| **Rely on [KEP-407][kep407] only**. | ||
| KEP-407 targets third-party schedulers (Volcano / coscheduling / YuniKorn) via their own PodGroup CRDs; this KEP targets the upstream-native `scheduling.k8s.io/v1alpha2` Workload and PodGroup APIs. The two evolve independently — different prerequisites, different API surfaces, no shared data path — and a single LWS object opts into at most one. |
There was a problem hiding this comment.
As described above, we have added Volcano based gang-scheduling already. Users are expected to use one of them (Kubernetes gang-scheduling, Volcano)?
There was a problem hiding this comment.
Good point — pushed an update so they're not mutually exclusive. New Unified Provider Model makes spec.gangScheduling the single opt-in across all gang backends (this KEP + KEP-407's Volcano; KAI / coscheduling possible later), following Grove's Backend framework.
- Unified Provider Model: spec.gangScheduling as the single opt-in across all gang backends; upstream v1alpha2 schema is the reference, third-party backends honor a subset. Backend chosen via KEP-407's existing --gang-scheduler-provider flag. - Backwards Compatibility section: covers field-unset, gate-off, and v1alpha2-APIs-missing cases. - Adopt LWS GangScheduling feature gate (alpha=false -> beta=true -> removed at GA) tracking upstream GenericWorkload lifecycle, after the pkg/features scaffold landed. Reverses the earlier no-LWS-gate decision. Signed-off-by: Kay Yan <kay.yan@daocloud.io>
What type of PR is this?
/kind feature
/kind documentation
What this PR does / why we need it
Initial draft of KEP-666: Gang Scheduling in LWS, integrating the upstream Workload and PodGroup APIs (alpha, kubernetes/enhancements#5558, kubernetes/enhancements#5832) as a gang-scheduling provider for
LeaderWorkerSet, alongside the existing third-party path in KEP-407.Key design points
replicas: N, size: Mproduces N PodGroups, each withMinCount: M. A single shared PodGroup withReplicas=N, MinCount=Mcannot express "every replica is complete" under the alpha API, so it would not actually prevent partial-replica scheduling.<lws-name>-<i>is created before any pod withgroup-index=i(covering first creation, scale-up, andmaxSurgebursts); LWS owns the Workload and steady-state PodGroups; the WorkloadpodGroupTemplateis created once and never mutated; PodGroups are keyed bygroup-indexand reused across revisions, withminCountupdated in place whenSizechanges.gang.podGroupNamePrefix) was considered and dropped — see Alternatives. It can be revisited once a concrete external consumer needs it.See also "Limitations of the alpha PodGroup API" in the KEP for why a single
MinCountscalar cannot express the per-role availability needed by KEP-766 DisaggregatedSet.Related work
Parallel efforts on the same upstream APIs (KEP-4671 + KEP-5832) in sibling projects, listed for reviewer context:
GangConfigKEP (currently on the older alpha1 API; under design discussion).pod.spec.schedulingGroup) leaves room for Kueue or any future external owner to driveWorkloadlifecycle with zero new LWS-side API surface.LWS targets the newer
scheduling.k8s.io/v1alpha2decoupled API; reviewer questions raised on JobSet #1068 (owner refs, Workload lifecycle, defaulting/validation, feature-gate posture) are addressed in the corresponding KEP-666 sections.Which issue(s) this PR fixes
Fixes #666
Special notes for your reviewer
Does this PR introduce a user-facing change?