Skip to content

Commit 2f3198a

Browse files
author
柏存
committed
kep75 topology based role coordinated scheduler
Signed-off-by: 柏存 <guoxiongfeng.gxf@alibaba-inc.com>
1 parent ccb66a6 commit 2f3198a

2 files changed

Lines changed: 349 additions & 0 deletions

File tree

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
# KEP-75: Enhanced Topology-Based Multi-Role Coordinated Scheduling
2+
3+
4+
5+
## Summary
6+
This KEP extends [KEP-30 (Role Coordination for RoleBasedGroup)](../30-role-coordination/README.md) by introducing proportional, multi-role rolling deployment (Rolling Deploy). This feature allows users to define cross-role deployment steps, ensuring that even with limited cluster resources, Pods in different roles maintain the required proportions within a specific cluster topology. It provides fine-grained scheduling control between roles through RBG Operator.
7+
8+
## Motivation
9+
10+
1. **Need for Fine-Grained Placement Policies**: Serving jobs are long-running and do not support rescheduling. Therefore, fine-grained placement policies are required during initial scheduling to achieve a globally optimal placement strategy.
11+
12+
2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains(NVLink > RDMA > TCP) can reduce KV transmission latency and increase throughput. Placements across different network switches can reduce the available bandwidth for KV cache transfer by approximately 20% [[1](https://arxiv.org/pdf/2508.19559)].
13+
14+
3. 在特定场景下,PD扽里
15+
16+
4. **Improved Service Fault Tolerance**: For inference services, dispersing homogeneous instances of the same role across different topology domains enhances the service's fault tolerance.
17+
18+
### Goals
19+
1. **Cross-Role Topology Affinity Deployment**: Support topology affinity deploy for instances of specific roles (e.g. P:D) within corresponding sub-batches.
20+
21+
2. **Intra-Role Anti-Affinity Deployment**: Support topology anti-affinity deployment among instances within the same role to enhance service fault tolerance.
22+
23+
3. **Group Scheduling Capability**: Support group scheduling capability for each minimal topology affinity batch of Pods.
24+
25+
### Non-Goals
26+
1. **Scenario Limitations**: This only considers the initial deployment scenario and does not cover subsequent scaling up or down scenarios.
27+
28+
2. **Cross-Group Coordination**: Coordination across multiple RoleBasedGroup is not supported.
29+
30+
3. **Non-Workload Resources**: Coordination for non-workload resources such as ConfigMaps or Secrets is not supported.
31+
32+
33+
## Proposal
34+
35+
### User Stories
36+
### User Story 1: Ensuring Optimal PD Instance Ratio Within a Topology Domain
37+
38+
**Scenario**: When PD instances within each topology domain maintain the optimal ratio, each P can prioritize selecting D nodes within the same topology domain when choosing Decode nodes.
39+
40+
**Example**: With an optimal ratio P:D=2:1:
41+
- Node A: Deploy P1, P2, D1
42+
- Node B: Deploy P3, P4, D2
43+
44+
P1, P2, and D1 form a virtual subgroup where the network is not a bottleneck for KV transmission between P and D. Furthermore, PD roles from subgroup1 and subgroup2 can still pair with each other in the router.
45+
46+
### User Story 2: Cluster Fault Tolerance Across Topology Domains
47+
48+
**Problem**: Cluster physical resources can fail at any time. If all P or D nodes are deployed in a specific topology domain (e.g., NodeA), and NodeA fails, the entire LLM inference service crashes due to the lack of a critical role.
49+
50+
**Solution**: Disperse instances of the same role across different topology domains. Even if a topology domain fails, the remaining PD roles can still connect to handle request inference in a degraded service mode.
51+
52+
### User Story 3: Resource-Constrained Environment
53+
54+
**Scenario**: A user wants to deploy a 4P4D service (minimum viable ratio P:D=2:2).
55+
56+
**Problems with Uncoordinated Rolling Deployment**:
57+
- Only schedule 4P Pods → Service cannot run
58+
- Only schedule 4D Pods → Service cannot run
59+
- Group schedule all 8 Pods → Startup fails due to insufficient resources
60+
61+
**Advantages of Coordinated Rolling Deployment**:
62+
The operator deploys Pods progressively in steps of 2P:2D:
63+
- Step 1: 2P + 2D
64+
- Step 2: Next group of 2P + 2D
65+
- Continue until all replicas are deployed
66+
67+
This ensures the service remains operational at all intermediate stages and avoids resource wastage.
68+
69+
70+
## Design Details
71+
72+
73+
### Cluster Topology Definition
74+
75+
- **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy.
76+
- **Administrative Control**: This CRD object is created by the cluster administrator. Users can directly reference the topology name.
77+
- **Resource Protection**: The RBG validates the topology configuration and adds a finalizer to protect the resource object upon successful validation.
78+
- **Immutability**: The CRD is immutable after creation and cannot be deleted if referenced by an RBG.
79+
80+
#### API Extensions
81+
```go
82+
package v1alpha1
83+
84+
import (
85+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
86+
)
87+
88+
// +kubebuilder:printcolumn:name="CurrentToplogyLevel",type="string",JSONPath=".status.CurrentToplogyLevel"
89+
type ClusterTopology struct {
90+
metav1.TypeMeta `json:",inline"`
91+
metav1.ObjectMeta `json:"metadata,omitempty"`
92+
93+
Spec ClusterTopologySpec `json:"spec"`
94+
Status ClusterTopologyStatus `json:"status,omitempty"`
95+
}
96+
97+
type ClusterTopologySpec struct {
98+
// If topologyScope is empty, index maps from smaller to larger network domains
99+
Layers []TopologyLayer `json:"Layers"`
100+
// Larger numbers indicate larger managed network domains with worse communication performance
101+
topologyScope map[TopologyLayerName]int
102+
}
103+
104+
type TopologyLayer struct {
105+
TopologyLayerName TopologyLayerName `json:"name"`
106+
107+
MatchLabelKey string `json:"key"`
108+
}
109+
110+
type TopologyLayerName string
111+
112+
type ClusterTopologyStatus struct {
113+
// Output format example: Host < TOR < Pod
114+
CurrentToplogyLevel string
115+
// Validates whether the topology structure definition is correct
116+
// If validate = false, this topology cannot be referenced in RBG. Normally this should be handled by Webhook validation.
117+
validate bool
118+
}
119+
120+
```
121+
122+
123+
### RBG Operator Enhancements
124+
125+
126+
In large model inference services, considering service fault tolerance and resource-constrained cluster scenarios, it is often impractical to deploy all RBG workloads within a single network domain. Optimization strategies include:
127+
128+
1. **Intra-Role-Instance Aggregation**: Aggregate P or D instances pod within high-performance networks.
129+
2. **Cross-Role Optimization**: Due to the high-speed communication requirements between P and D roles and the inability to place all instances in the same topology domain, place the optimally proportioned number of Pods within the same topology domain.
130+
131+
### API Extensions
132+
133+
To achieve topology affinity scheduling between instances of different roles in a fixed ratio, it is necessary to further partition the role internally using a virtual object called `RoleSubGroup`.
134+
135+
```go
136+
// RoleBasedGroupSpec defines the desired state of RoleBasedGroup.
137+
type RoleBasedGroupSpec struct {
138+
// +kubebuilder:pruning:PreserveUnknownFields
139+
// +kubebuilder:validation:MinItems=1
140+
// +kubebuilder:validation:Required
141+
// +patchMergeKey=name
142+
// +patchStrategy=merge
143+
// +listType=map
144+
// +listMapKey=name
145+
Roles []RoleSpec `json:"roles" patchStrategy:"merge" patchMergeKey:"name"`
146+
147+
// Configuration for the PodGroup to enable gang-scheduling via supported plugins.
148+
PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"`
149+
150+
// CoordinationRequirements describes the requirements of coordination strategies for some specified roles.
151+
// +patchMergeKey=name
152+
// +patchStrategy=merge
153+
// +listType=map
154+
// +listMapKey=name
155+
CoordinationRequirements []Coordination `json:"coordination,omitempty" patchStrategy:"merge" patchMergeKey:"name"`
156+
}
157+
158+
// Coordination describes the requirements of coordination strategies for roles.
159+
type Coordination struct {
160+
// Name of the coordination.
161+
Name string `json:"name"`
162+
163+
// Roles that should be constrained by this coordination.
164+
Roles []string `json:"roles"`
165+
166+
// RolloutStrategy describes the coordination strategies.
167+
Strategy *CoordinationStrategy `json:"strategy,omitempty"`
168+
}
169+
170+
type CoordinationStrategy struct {
171+
// RollingUpdate defines the coordination strategies about rolling update.
172+
RollingUpdate *CoordinationRollingUpdate `json:"rollingUpdate,omitempty"`
173+
174+
// new field
175+
// RollingDeploy defines the coordination strategies about rolling deploy.
176+
RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"`
177+
}
178+
179+
// new field
180+
// CoordinationRollingDeploy describes the rolling deploy coordination strategy.
181+
type CoordinationRollingDeploy struct {
182+
// Number of Instance to deploy in each step, and the minimum number of P:D pairs that should form a logical group during the deployment process.
183+
// eg. RoleDeployStep = {"prefill": 4, "decode": 2}
184+
RoleDeployStep map[string]int
185+
// Topology requirements for the current logical group
186+
DeployTopologyConstraint *DeployTopologyConstraint
187+
}
188+
189+
type DeployTopologyConstraint struct {
190+
// TODO: Consider whether this should be globally unique or allow users to customize the configuration.
191+
// If not configured by user, use default value: rbg-cluster-topology
192+
clusterTopologyName *string
193+
194+
// Declares which topology level the current batch of Pods should be constrained to
195+
LayerName *TopologyLayerName `json:"topologyLayerName,omitempty"`
196+
197+
// Default value: hard
198+
constraintMode TopologyConstraintMode
199+
200+
// Configuration for the PodGroup to enable gang-scheduling via supported plugins.
201+
PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"`
202+
}
203+
204+
// +enum
205+
type TopologyConstraintMode string
206+
207+
const (
208+
// HardTopologyConstraintMode represents a strict network topology constraint that workload must adhere to.
209+
HardTopologyConstraintMode TopologyConstraintMode = "hard"
210+
211+
// SoftTopologyConstraintMode represents a flexible network topology constraint that allows workload
212+
// to cross network boundaries under certain conditions.
213+
SoftTopologyConstraintMode TopologyConstraintMode = "soft"
214+
)
215+
```
216+
217+
218+
#### Coordinated-roles Deploy Yaml Example
219+
Prefill and decode roles rollingDeploy with coordination
220+
```yaml
221+
apiVersion: workloads.x-k8s.io/v1alpha1
222+
kind: RoleBasedGroup
223+
metadata:
224+
name: nginx-cluster
225+
spec:
226+
coordination:
227+
- name: pd-rollout-deploy
228+
strategy:
229+
rollingDeploy:
230+
prefill: 4
231+
decode: 2
232+
deployTopologyConstraint:
233+
clusterTopologyName: rbg-cluster-topology
234+
topologyLayerName: host
235+
constraintMode: hard
236+
podGroupPolicy: nil
237+
- name: pd-rollout
238+
roles:
239+
- prefill
240+
- decode
241+
strategy:
242+
rollingUpdate:
243+
maxSkew: 1%
244+
maxUnavailable: 10%
245+
roles:
246+
- name: prefill
247+
replicas: 8
248+
template:
249+
spec:
250+
containers:
251+
- image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
252+
name: nginx
253+
- name: decode
254+
replicas: 4
255+
template:
256+
spec:
257+
containers:
258+
- image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6
259+
name: nginx
260+
261+
```
262+
263+
**Example Scenario**:
264+
- RBG object with 8P, 4D
265+
- Minimum optimal instance number between roles is P:D = 4:2
266+
- Each 4P:2D group must satisfy the same topology affinity requirements
267+
- Divided into two deploy batches
268+
269+
#### Controller Logic
270+
271+
**Label Propagation Mechanism**:
272+
The RBG Operator propagates the Role Subgroup Size to the underlying workloads. In the specific workload (InstanceSet Operator), the specific subgroup for the current instance is calculated based on the Subgroup Size and Role Instance Index, and the corresponding affinity policy is configured for the Role Pod.
273+
274+
**Workload Label Example**:
275+
```yaml
276+
rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1
277+
rolebasedgroup.workloads.x-k8s.io/role: p
278+
rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4
279+
rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test
280+
```
281+
282+
**Topology Coordination Name Generation Strategy**:
283+
```shell
284+
hash(fmt.Sprintf("%s-%s", ${topology-coordination-name}, ${role-index} / ${role-subgroup}))
285+
```
286+
287+
**Pod Label Example**:
288+
```yaml
289+
rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1
290+
rolebasedgroup.workloads.x-k8s.io/role: p
291+
rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4
292+
rolebasedgroup.workloads.x-k8s.io/role-instance-index: 1
293+
rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test
294+
rolebasedgroup.workloads.x-k8s.io/deploy-topology-coordination-name: p-d-4-2-deploy-test-0
295+
```
296+
297+
**Pod Affinity Example**:
298+
```yaml
299+
affinity:
300+
podAffinity:
301+
requiredDuringSchedulingIgnoredDuringExecution:
302+
- labelSelector:
303+
matchExpressions:
304+
- key: rolebasedgroup.workloads.x-k8s.io/deploy-topology-coordination-name
305+
operator: In
306+
values:
307+
- p-d-4-2-deploy-test-0
308+
topologyKey: kubernetes.io/hostname
309+
podAntiAffinity:
310+
preferredDuringSchedulingIgnoredDuringExecution:
311+
- weight: 100
312+
podAffinityTerm:
313+
labelSelector:
314+
matchExpressions:
315+
- key: rolebasedgroup.workloads.x-k8s.io/role
316+
operator: In
317+
values:
318+
- p
319+
topologyKey: kubernetes.io/hostname
320+
```
321+
322+
### 6. Progressive Scheduling Deployment
323+
324+
**Implementation Mechanism**: Achieve batch scheduling using the PodSchedulingGate feature:
325+
- After the Pods of the previous batch are successfully scheduled
326+
- Remove the PodSchedulingGate for the Pods of the next batch
327+
- Trigger the scheduling process for the next batch of Pods
328+
329+
### Risks and Mitigations
330+
- Rolling deploy and rolling update are mutually exclusive within a single reconciliation cycle
331+
- If scaling occurs during deploy coordination, update coordination is skipped for that cycle
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
title: RBG Role Deploy Coordination
2+
kep-number: 75
3+
authors:
4+
- "@bcfre"
5+
status: provisional
6+
creation-date: 2025-12-10
7+
reviewers:
8+
- "@cheyang"
9+
- "@Syspretor"
10+
11+
stage: alpha
12+
13+
latest-milestone: "v0.6.0"
14+
15+
milestone:
16+
alpha: "v0.6.0"
17+
beta: "v0.6.0"
18+
stable: "v0.6.0"

0 commit comments

Comments
 (0)