Skip to content

Commit 2f42cdd

Browse files
committed
kep75 topology based role coordinated scheduler
Signed-off-by: bcfre <guo0xiong1feng@gmail.com>
1 parent ccb66a6 commit 2f42cdd

2 files changed

Lines changed: 324 additions & 0 deletions

File tree

Lines changed: 306 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,306 @@
1+
# KEP-76: Enhanced Topology-Based Multi-Role Coordinated Scheduling
2+
3+
4+
5+
## Summary
6+
This KEP extends [KEP-30 (Role Coordination for RoleBasedGroup)](../30-role-coordination/README.md) by introducing proportional, multi-role rolling deployment (Rolling Deploy). This feature enables users to define cross-role deployment steps, ensuring that Pods across different roles maintain the desired ratio even when cluster resources are constrained. It also provides fine-grained scheduling control at the role level.
7+
8+
## Motivation
9+
10+
1. **Need for Fine-Grained Placement Policies**: Serving jobs are long-running and do not support rescheduling. Therefore, fine-grained placement policies are required during initial scheduling to achieve a globally optimal placement strategy.
11+
12+
2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains can reduce KV transmission latency and increase throughput.
13+
14+
3. **Improved Service Fault Tolerance**: For inference services, dispersing homogeneous instances of the same role across different topology domains enhances the service's fault tolerance.
15+
16+
### Goals
17+
1. **Cross-Role Topology Affinity Deployment**: Support topology affinity deployment for instances of specific roles (e.g., P:D) within corresponding sub-batches.
18+
19+
2. **Intra-Role Anti-Affinity Deployment**: Support topology anti-affinity deployment among instances within the same role to enhance service fault tolerance.
20+
21+
3. **Group Scheduling Capability**: Support group scheduling capability for each minimal topology affinity batch of Pods.
22+
23+
### Non-Goals
24+
1. **Scenario Limitation**: Currently, only non-scaling scenarios are considered.
25+
26+
2. **Cross-Group Coordination**: Coordination across multiple RoleBasedGroupSets is not supported.
27+
28+
3. **Non-Workload Resources**: Coordination for non-workload resources such as ConfigMaps or Secrets is not supported.
29+
30+
31+
## Proposal
32+
33+
### User Stories
34+
### User Story 1: Ensuring Optimal PD Instance Ratio Within a Topology Domain
35+
36+
**Scenario**: When PD instances within each topology domain maintain the optimal ratio, each P can prioritize selecting D nodes within the same topology domain when choosing Decode nodes.
37+
38+
**Example**: With an optimal ratio P:D=2:1:
39+
- Zone A: Deploy P1, P2, D1
40+
- Zone B: Deploy P3, P4, D2
41+
42+
P1, P2, and D1 form a virtual subgroup where the network is not a bottleneck for KV transmission between P and D. Furthermore, PD roles from subgroup1 and subgroup2 can still pair with each other in the router.
43+
44+
### User Story 2: Cluster Fault Tolerance Across Topology Domains
45+
46+
**Problem**: Cluster physical resources can fail at any time. If all P or D nodes are deployed in a specific topology domain (e.g., Node1), and Node1 fails, the entire LLM inference service crashes due to the lack of a critical role.
47+
48+
**Solution**: Disperse instances of the same role across different topology domains. Even if a topology domain fails, the remaining PD roles can still connect to handle request inference in a degraded service mode.
49+
50+
### User Story 3: Resource-Constrained Environment
51+
52+
**Scenario**: A user wants to deploy a 4P4D service (minimum viable ratio P:D=2:2).
53+
54+
**Problems with Uncoordinated Rolling Deployment**:
55+
- Only schedule 4P Pods → Service cannot run
56+
- Only schedule 4D Pods → Service cannot run
57+
- Group schedule all 8 Pods → Startup fails due to insufficient resources
58+
59+
**Advantages of Coordinated Rolling Deployment**:
60+
The operator deploys Pods progressively in steps of 2P:2D:
61+
- Step 1: 2P + 2D
62+
- Step 2: Next group of 2P + 2D
63+
- Continue until all replicas are deployed
64+
65+
This ensures the service remains operational at all intermediate stages and avoids resource wastage.
66+
67+
68+
## Design Details
69+
70+
71+
### Cluster Topology Definition
72+
73+
- **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy.
74+
- **Administrative Control**: This CRD object is created by the cluster administrator. Users can directly reference the topology name.
75+
- **Resource Protection**: The RBG validates the topology configuration and adds a finalizer to protect the resource object upon successful validation.
76+
- **Immutability**: The CRD is immutable after creation and cannot be deleted if referenced by an RBG.
77+
78+
#### API Extensions
79+
```go
80+
package v1alpha1
81+
82+
import (
83+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
84+
)
85+
86+
// +kubebuilder:printcolumn:name="CurrentToplogyLevel",type="string",JSONPath=".status.CurrentToplogyLevel"
87+
type ClusterTopology struct {
88+
metav1.TypeMeta `json:",inline"`
89+
metav1.ObjectMeta `json:"metadata,omitempty"`
90+
91+
Spec ClusterTopologySpec `json:"spec"`
92+
Status ClusterTopologyStatus `json:"status,omitempty"`
93+
}
94+
95+
type ClusterTopologySpec struct {
96+
// If topologyScope is empty, index maps from smaller to larger network domains
97+
Layers []TopologyLayer `json:"Layers"`
98+
// Larger numbers indicate larger managed network domains with worse communication performance
99+
topologyScope map[TopologyLayerName]int
100+
}
101+
102+
type TopologyLayer struct {
103+
TopologyLayerName TopologyLayerName `json:"name"`
104+
105+
MatchLabelKey string `json:"key"`
106+
}
107+
108+
type TopologyLayerName string
109+
110+
type ClusterTopologyStatus struct {
111+
// Output format example: Numa < Host < TOR < Pod < DataCenter < Zone < Region < Vendor < ALL
112+
CurrentToplogyLevel string
113+
// Validates whether the topology structure definition is correct
114+
// If validate = false, this topology cannot be referenced in RBG. Normally this should be handled by Webhook validation.
115+
validate bool
116+
}
117+
118+
```
119+
120+
121+
### RBG Operator Enhancements
122+
123+
124+
In large model inference services, considering service fault tolerance and resource-constrained cluster scenarios, it is often impractical to deploy all RBG workloads within a single network domain. Optimization strategies include:
125+
126+
1. **Intra-Role-Instance Aggregation**: Aggregate P or D instances within high-performance networks.
127+
2. **Cross-Role Optimization**: Due to the high-speed communication requirements between P and D roles and the inability to place all instances in the same topology domain, place the optimally proportioned number of Pods within the same network topology domain.
128+
129+
### API Extensions
130+
131+
To achieve topology affinity scheduling between instances of different roles in a fixed ratio, it is necessary to further partition the role internally using a virtual object called RoleSubGroup.
132+
133+
```go
134+
// RoleBasedGroupSpec defines the desired state of RoleBasedGroup.
135+
type RoleBasedGroupSpec struct {
136+
// +kubebuilder:pruning:PreserveUnknownFields
137+
// +kubebuilder:validation:MinItems=1
138+
// +kubebuilder:validation:Required
139+
// +patchMergeKey=name
140+
// +patchStrategy=merge
141+
// +listType=map
142+
// +listMapKey=name
143+
Roles []RoleSpec `json:"roles" patchStrategy:"merge" patchMergeKey:"name"`
144+
145+
// Configuration for the PodGroup to enable gang-scheduling via supported plugins.
146+
PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"`
147+
148+
// CoordinationRequirements describes the requirements of coordination strategies for some specified roles.
149+
// +patchMergeKey=name
150+
// +patchStrategy=merge
151+
// +listType=map
152+
// +listMapKey=name
153+
CoordinationRequirements []Coordination `json:"coordination,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
154+
}
155+
156+
// Coordination describes the requirements of coordination strategies for roles.
157+
type Coordination struct {
158+
// Name of the coordination.
159+
Name string `json:"name"`
160+
161+
// Roles that should be constrained by this coordination.
162+
Roles []string `json:"roles"`
163+
164+
// RolloutStrategy describes the coordination strategies.
165+
Strategy *CoordinationStrategy `json:"strategy,omitempty"`
166+
}
167+
168+
type CoordinationStrategy struct {
169+
// RollingUpdate defines the coordination strategies about rolling update.
170+
RollingUpdate *CoordinationRollingUpdate `json:"rollingUpdate,omitempty"`
171+
172+
// new field
173+
// RollingDeploy defines the coordination strategies about rolling deploy.
174+
RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"`
175+
}
176+
177+
// new field
178+
// CoordinationRollingDeploy describes the rolling deploy coordination strategy.
179+
type CoordinationRollingDeploy struct {
180+
// Number of Pods to deploy in each step, and the ratio of P:D that should form a logical group during deployment
181+
RoleDeployStep map[string]int
182+
// Topology requirements for the current logical group
183+
DeployTopologyConstraint *DeployTopologyConstraint
184+
}
185+
186+
type DeployTopologyConstraint struct {
187+
// TODO: Consider whether this should be globally unique or allow users to customize the configuration.
188+
// If not configured by user, use default value: rbg-cluster-topology
189+
clusterTopologyName *string
190+
191+
// Declares which topology level the current batch of Pods should be constrained to
192+
LayerName *TopologyLayerName `json:"topologyLayerName,omitempty"`
193+
194+
// Default value: hard
195+
InternalConstraintMode TopologyConstraintMode
196+
// Default value: soft
197+
ExternalConstraintMode TopologyConstraintMode
198+
199+
// Configuration for the PodGroup to enable gang-scheduling via supported plugins.
200+
PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"`
201+
}
202+
203+
// +enum
204+
type TopologyConstraintMode string
205+
206+
const (
207+
// HardTopologyConstraintMode represents a strict network topology constraint that workload must adhere to.
208+
HardTopologyConstraintMode TopologyConstraintMode = "hard"
209+
210+
// SoftTopologyConstraintMode represents a flexible network topology constraint that allows workload
211+
// to cross network boundaries under certain conditions.
212+
SoftTopologyConstraintMode TopologyConstraintMode = "soft"
213+
)
214+
```
215+
216+
217+
#### Coordinated-roles Deploy Yaml Example
218+
Prefill and decode roles rollingDeploy with coordination
219+
```yaml
220+
apiVersion: workloads.x-k8s.io/v1alpha1
221+
kind: RoleBasedGroup
222+
metadata:
223+
name: nginx-cluster
224+
spec:
225+
coordination:
226+
- name: pd-rollout-deploy
227+
strategy:
228+
rollingDeploy:
229+
prefill: 4
230+
decode: 2
231+
deployTopologyConstraint:
232+
clusterTopologyName: rbg-cluster-topology
233+
topologyLayerName: zone
234+
internalConstraintMode: hard
235+
externalConstraintMode: soft
236+
podGroupPolicy: nil
237+
- name: pd-rollout
238+
roles:
239+
- prefill
240+
- decode
241+
strategy:
242+
rollingUpdate:
243+
maxSkew: 1%
244+
maxUnavailable: 10%
245+
roles:
246+
- name: prefill
247+
replicas: 8
248+
template:
249+
spec:
250+
containers:
251+
- image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
252+
name: nginx-leader
253+
- name: decode
254+
replicas: 4
255+
template:
256+
spec:
257+
containers:
258+
- image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
259+
name: nginx-worker
260+
261+
```
262+
263+
**Example Scenario**:
264+
- RBG object with 8P, 4D
265+
- Minimum optimal ratio between roles is P:D = 4:2
266+
- Each 4P:2D group must satisfy the same topology affinity requirements
267+
- Divided into two deployment batches
268+
269+
#### Controller Logic
270+
271+
**Label Propagation Mechanism**:
272+
The RBG Operator propagates the Role Subgroup Size to the underlying workloads. In the specific workload (InstanceSet Operator), the specific subgroup for the current instance is calculated based on the Subgroup Size and Role Instance Index, and the corresponding affinity policy is configured for the Role Pod.
273+
274+
**Workload Label Example**:
275+
```yaml
276+
rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1
277+
rolebasedgroup.workloads.x-k8s.io/role: p
278+
rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4
279+
rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test
280+
```
281+
282+
**Topology Coordination Name Generation Strategy**:
283+
```shell
284+
hash(fmt.Sprintf("%s-%s", ${topology-coordination-name}, ${role-index} / ${role-subgroup}))
285+
```
286+
287+
**Pod Label Example**:
288+
```yaml
289+
rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1
290+
rolebasedgroup.workloads.x-k8s.io/role: p
291+
rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4
292+
rolebasedgroup.workloads.x-k8s.io/role-instance-index: 1
293+
rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test
294+
rolebasedgroup.workloads.x-k8s.io/deploy-topology-coordination-name: p-d-4-2-deploy-test-0
295+
```
296+
297+
### 6. Progressive Scheduling Deployment
298+
299+
**Implementation Mechanism**: Achieve batch scheduling using the PodSchedulingGate feature:
300+
- After the Pods of the previous batch are successfully scheduled
301+
- Remove the PodSchedulingGate for the Pods of the next batch
302+
- Trigger the scheduling process for the next batch of Pods
303+
304+
### Risks and Mitigations
305+
- Rolling deploy and rolling update are mutually exclusive within a single reconciliation cycle
306+
- If scaling occurs during deploy coordination, update coordination is skipped for that cycle
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
title: RBG Role Deploy Coordination
2+
kep-number: 75
3+
authors:
4+
- "@bcfre"
5+
status: provisional
6+
creation-date: 2025-12-10
7+
reviewers:
8+
- "@cheyang"
9+
- "@Syspretor"
10+
11+
stage: alpha
12+
13+
latest-milestone: "v0.6.0"
14+
15+
milestone:
16+
alpha: "v0.6.0"
17+
beta: "v0.6.0"
18+
stable: "v0.6.0"

0 commit comments

Comments
 (0)