|
| 1 | +# KEP-76: Enhanced Topology-Based Multi-Role Coordinated Scheduling |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## Summary |
| 6 | +This KEP extends [KEP-30 (Role Coordination for RoleBasedGroup)](../30-role-coordination/README.md) by introducing proportional, multi-role rolling deployment (Rolling Deploy). This feature enables users to define cross-role deployment steps, ensuring that Pods across different roles maintain the desired ratio even when cluster resources are constrained. It also provides fine-grained scheduling control at the role level. |
| 7 | + |
| 8 | +## Motivation |
| 9 | + |
| 10 | +1. **Need for Fine-Grained Placement Policies**: Serving jobs are long-running and do not support rescheduling. Therefore, fine-grained placement policies are required during initial scheduling to achieve a globally optimal placement strategy. |
| 11 | + |
| 12 | +2. **Network Topology Optimization**: In PD-separated deployment architectures, placing P and D instances that process the same request within close network topology domains can reduce KV transmission latency and increase throughput. |
| 13 | + |
| 14 | +3. **Improved Service Fault Tolerance**: For inference services, dispersing homogeneous instances of the same role across different topology domains enhances the service's fault tolerance. |
| 15 | + |
| 16 | +### Goals |
| 17 | +1. **Cross-Role Topology Affinity Deployment**: Support topology affinity deployment for instances of specific roles (e.g., P:D) within corresponding sub-batches. |
| 18 | + |
| 19 | +2. **Intra-Role Anti-Affinity Deployment**: Support topology anti-affinity deployment among instances within the same role to enhance service fault tolerance. |
| 20 | + |
| 21 | +3. **Group Scheduling Capability**: Support group scheduling capability for each minimal topology affinity batch of Pods. |
| 22 | + |
| 23 | +### Non-Goals |
| 24 | +1. **Scenario Limitation**: Currently, only non-scaling scenarios are considered. |
| 25 | + |
| 26 | +2. **Cross-Group Coordination**: Coordination across multiple RoleBasedGroupSets is not supported. |
| 27 | + |
| 28 | +3. **Non-Workload Resources**: Coordination for non-workload resources such as ConfigMaps or Secrets is not supported. |
| 29 | + |
| 30 | + |
| 31 | +## Proposal |
| 32 | + |
| 33 | +### User Stories |
| 34 | +### User Story 1: Ensuring Optimal PD Instance Ratio Within a Topology Domain |
| 35 | + |
| 36 | +**Scenario**: When PD instances within each topology domain maintain the optimal ratio, each P can prioritize selecting D nodes within the same topology domain when choosing Decode nodes. |
| 37 | + |
| 38 | +**Example**: With an optimal ratio P:D=2:1: |
| 39 | +- Zone A: Deploy P1, P2, D1 |
| 40 | +- Zone B: Deploy P3, P4, D2 |
| 41 | + |
| 42 | +P1, P2, and D1 form a virtual subgroup where the network is not a bottleneck for KV transmission between P and D. Furthermore, PD roles from subgroup1 and subgroup2 can still pair with each other in the router. |
| 43 | + |
| 44 | +### User Story 2: Cluster Fault Tolerance Across Topology Domains |
| 45 | + |
| 46 | +**Problem**: Cluster physical resources can fail at any time. If all P or D nodes are deployed in a specific topology domain (e.g., Node1), and Node1 fails, the entire LLM inference service crashes due to the lack of a critical role. |
| 47 | + |
| 48 | +**Solution**: Disperse instances of the same role across different topology domains. Even if a topology domain fails, the remaining PD roles can still connect to handle request inference in a degraded service mode. |
| 49 | + |
| 50 | +### User Story 3: Resource-Constrained Environment |
| 51 | + |
| 52 | +**Scenario**: A user wants to deploy a 4P4D service (minimum viable ratio P:D=2:2). |
| 53 | + |
| 54 | +**Problems with Uncoordinated Rolling Deployment**: |
| 55 | +- Only schedule 4P Pods → Service cannot run |
| 56 | +- Only schedule 4D Pods → Service cannot run |
| 57 | +- Group schedule all 8 Pods → Startup fails due to insufficient resources |
| 58 | + |
| 59 | +**Advantages of Coordinated Rolling Deployment**: |
| 60 | +The operator deploys Pods progressively in steps of 2P:2D: |
| 61 | +- Step 1: 2P + 2D |
| 62 | +- Step 2: Next group of 2P + 2D |
| 63 | +- Continue until all replicas are deployed |
| 64 | + |
| 65 | +This ensures the service remains operational at all intermediate stages and avoids resource wastage. |
| 66 | + |
| 67 | + |
| 68 | +## Design Details |
| 69 | + |
| 70 | + |
| 71 | +### Cluster Topology Definition |
| 72 | + |
| 73 | +- **New CRD**: Introduce a Custom Resource Definition (CRD) to describe the cluster topology hierarchy. |
| 74 | +- **Administrative Control**: This CRD object is created by the cluster administrator. Users can directly reference the topology name. |
| 75 | +- **Resource Protection**: The RBG validates the topology configuration and adds a finalizer to protect the resource object upon successful validation. |
| 76 | +- **Immutability**: The CRD is immutable after creation and cannot be deleted if referenced by an RBG. |
| 77 | + |
| 78 | +#### API Extensions |
| 79 | +```go |
| 80 | +package v1alpha1 |
| 81 | + |
| 82 | +import ( |
| 83 | + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" |
| 84 | +) |
| 85 | + |
| 86 | +// +kubebuilder:printcolumn:name="CurrentToplogyLevel",type="string",JSONPath=".status.CurrentToplogyLevel" |
| 87 | +type ClusterTopology struct { |
| 88 | + metav1.TypeMeta `json:",inline"` |
| 89 | + metav1.ObjectMeta `json:"metadata,omitempty"` |
| 90 | + |
| 91 | + Spec ClusterTopologySpec `json:"spec"` |
| 92 | + Status ClusterTopologyStatus `json:"status,omitempty"` |
| 93 | +} |
| 94 | + |
| 95 | +type ClusterTopologySpec struct { |
| 96 | + // If topologyScope is empty, index maps from smaller to larger network domains |
| 97 | + Layers []TopologyLayer `json:"Layers"` |
| 98 | + // Larger numbers indicate larger managed network domains with worse communication performance |
| 99 | + topologyScope map[TopologyLayerName]int |
| 100 | +} |
| 101 | + |
| 102 | +type TopologyLayer struct { |
| 103 | + TopologyLayerName TopologyLayerName `json:"name"` |
| 104 | + |
| 105 | + MatchLabelKey string `json:"key"` |
| 106 | +} |
| 107 | + |
| 108 | +type TopologyLayerName string |
| 109 | + |
| 110 | +type ClusterTopologyStatus struct { |
| 111 | + // Output format example: Numa < Host < TOR < Pod < DataCenter < Zone < Region < Vendor < ALL |
| 112 | + CurrentToplogyLevel string |
| 113 | + // Validates whether the topology structure definition is correct |
| 114 | + // If validate = false, this topology cannot be referenced in RBG. Normally this should be handled by Webhook validation. |
| 115 | + validate bool |
| 116 | +} |
| 117 | + |
| 118 | +``` |
| 119 | + |
| 120 | + |
| 121 | +### RBG Operator Enhancements |
| 122 | + |
| 123 | + |
| 124 | +In large model inference services, considering service fault tolerance and resource-constrained cluster scenarios, it is often impractical to deploy all RBG workloads within a single network domain. Optimization strategies include: |
| 125 | + |
| 126 | +1. **Intra-Role-Instance Aggregation**: Aggregate P or D instances within high-performance networks. |
| 127 | +2. **Cross-Role Optimization**: Due to the high-speed communication requirements between P and D roles and the inability to place all instances in the same topology domain, place the optimally proportioned number of Pods within the same network topology domain. |
| 128 | + |
| 129 | +### API Extensions |
| 130 | + |
| 131 | +To achieve topology affinity scheduling between instances of different roles in a fixed ratio, it is necessary to further partition the role internally using a virtual object called RoleSubGroup. |
| 132 | + |
| 133 | +```go |
| 134 | +// RoleBasedGroupSpec defines the desired state of RoleBasedGroup. |
| 135 | +type RoleBasedGroupSpec struct { |
| 136 | + // +kubebuilder:pruning:PreserveUnknownFields |
| 137 | + // +kubebuilder:validation:MinItems=1 |
| 138 | + // +kubebuilder:validation:Required |
| 139 | + // +patchMergeKey=name |
| 140 | + // +patchStrategy=merge |
| 141 | + // +listType=map |
| 142 | + // +listMapKey=name |
| 143 | + Roles []RoleSpec `json:"roles" patchStrategy:"merge" patchMergeKey:"name"` |
| 144 | + |
| 145 | + // Configuration for the PodGroup to enable gang-scheduling via supported plugins. |
| 146 | + PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"` |
| 147 | + |
| 148 | + // CoordinationRequirements describes the requirements of coordination strategies for some specified roles. |
| 149 | + // +patchMergeKey=name |
| 150 | + // +patchStrategy=merge |
| 151 | + // +listType=map |
| 152 | + // +listMapKey=name |
| 153 | + CoordinationRequirements []Coordination `json:"coordination,omitempty" patchStrategy:"merge" patchMergeKey:"name"` |
| 154 | +} |
| 155 | + |
| 156 | +// Coordination describes the requirements of coordination strategies for roles. |
| 157 | +type Coordination struct { |
| 158 | + // Name of the coordination. |
| 159 | + Name string `json:"name"` |
| 160 | + |
| 161 | + // Roles that should be constrained by this coordination. |
| 162 | + Roles []string `json:"roles"` |
| 163 | + |
| 164 | + // RolloutStrategy describes the coordination strategies. |
| 165 | + Strategy *CoordinationStrategy `json:"strategy,omitempty"` |
| 166 | +} |
| 167 | + |
| 168 | +type CoordinationStrategy struct { |
| 169 | + // RollingUpdate defines the coordination strategies about rolling update. |
| 170 | + RollingUpdate *CoordinationRollingUpdate `json:"rollingUpdate,omitempty"` |
| 171 | + |
| 172 | + // new field |
| 173 | + // RollingDeploy defines the coordination strategies about rolling deploy. |
| 174 | + RollingDeploy *CoordinationRollingDeploy `json:"rollingDeploy,omitempty"` |
| 175 | +} |
| 176 | + |
| 177 | +// new field |
| 178 | +// CoordinationRollingDeploy describes the rolling deploy coordination strategy. |
| 179 | +type CoordinationRollingDeploy struct { |
| 180 | + // Number of Pods to deploy in each step, and the ratio of P:D that should form a logical group during deployment |
| 181 | + RoleDeployStep map[string]int |
| 182 | + // Topology requirements for the current logical group |
| 183 | + DeployTopologyConstraint *DeployTopologyConstraint |
| 184 | +} |
| 185 | + |
| 186 | +type DeployTopologyConstraint struct { |
| 187 | + // TODO: Consider whether this should be globally unique or allow users to customize the configuration. |
| 188 | + // If not configured by user, use default value: rbg-cluster-topology |
| 189 | + clusterTopologyName *string |
| 190 | + |
| 191 | + // Declares which topology level the current batch of Pods should be constrained to |
| 192 | + LayerName *TopologyLayerName `json:"topologyLayerName,omitempty"` |
| 193 | + |
| 194 | + // Default value: hard |
| 195 | + InternalConstraintMode TopologyConstraintMode |
| 196 | + // Default value: soft |
| 197 | + ExternalConstraintMode TopologyConstraintMode |
| 198 | + |
| 199 | + // Configuration for the PodGroup to enable gang-scheduling via supported plugins. |
| 200 | + PodGroupPolicy *PodGroupPolicy `json:"podGroupPolicy,omitempty"` |
| 201 | +} |
| 202 | + |
| 203 | +// +enum |
| 204 | +type TopologyConstraintMode string |
| 205 | + |
| 206 | +const ( |
| 207 | + // HardTopologyConstraintMode represents a strict network topology constraint that workload must adhere to. |
| 208 | + HardTopologyConstraintMode TopologyConstraintMode = "hard" |
| 209 | + |
| 210 | + // SoftTopologyConstraintMode represents a flexible network topology constraint that allows workload |
| 211 | + // to cross network boundaries under certain conditions. |
| 212 | + SoftTopologyConstraintMode TopologyConstraintMode = "soft" |
| 213 | +) |
| 214 | +``` |
| 215 | + |
| 216 | + |
| 217 | +#### Coordinated-roles Deploy Yaml Example |
| 218 | +Prefill and decode roles rollingDeploy with coordination |
| 219 | +```yaml |
| 220 | +apiVersion: workloads.x-k8s.io/v1alpha1 |
| 221 | +kind: RoleBasedGroup |
| 222 | +metadata: |
| 223 | + name: nginx-cluster |
| 224 | +spec: |
| 225 | + coordination: |
| 226 | + - name: pd-rollout-deploy |
| 227 | + strategy: |
| 228 | + rollingDeploy: |
| 229 | + prefill: 4 |
| 230 | + decode: 2 |
| 231 | + deployTopologyConstraint: |
| 232 | + clusterTopologyName: rbg-cluster-topology |
| 233 | + topologyLayerName: zone |
| 234 | + internalConstraintMode: hard |
| 235 | + externalConstraintMode: soft |
| 236 | + podGroupPolicy: nil |
| 237 | + - name: pd-rollout |
| 238 | + roles: |
| 239 | + - prefill |
| 240 | + - decode |
| 241 | + strategy: |
| 242 | + rollingUpdate: |
| 243 | + maxSkew: 1% |
| 244 | + maxUnavailable: 10% |
| 245 | + roles: |
| 246 | + - name: prefill |
| 247 | + replicas: 8 |
| 248 | + template: |
| 249 | + spec: |
| 250 | + containers: |
| 251 | + - image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 |
| 252 | + name: nginx-leader |
| 253 | + - name: decode |
| 254 | + replicas: 4 |
| 255 | + template: |
| 256 | + spec: |
| 257 | + containers: |
| 258 | + - image: anolis-registry.cn-zhangjiakou.cr.aliyuncs.com/openanolis/nginx:1.14.1-8.6 |
| 259 | + name: nginx-worker |
| 260 | + |
| 261 | +``` |
| 262 | + |
| 263 | +**Example Scenario**: |
| 264 | +- RBG object with 8P, 4D |
| 265 | +- Minimum optimal ratio between roles is P:D = 4:2 |
| 266 | +- Each 4P:2D group must satisfy the same topology affinity requirements |
| 267 | +- Divided into two deployment batches |
| 268 | + |
| 269 | +#### Controller Logic |
| 270 | + |
| 271 | +**Label Propagation Mechanism**: |
| 272 | +The RBG Operator propagates the Role Subgroup Size to the underlying workloads. In the specific workload (InstanceSet Operator), the specific subgroup for the current instance is calculated based on the Subgroup Size and Role Instance Index, and the corresponding affinity policy is configured for the Role Pod. |
| 273 | + |
| 274 | +**Workload Label Example**: |
| 275 | +```yaml |
| 276 | +rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1 |
| 277 | +rolebasedgroup.workloads.x-k8s.io/role: p |
| 278 | +rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4 |
| 279 | +rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test |
| 280 | +``` |
| 281 | +
|
| 282 | +**Topology Coordination Name Generation Strategy**: |
| 283 | +```shell |
| 284 | +hash(fmt.Sprintf("%s-%s", ${topology-coordination-name}, ${role-index} / ${role-subgroup})) |
| 285 | +``` |
| 286 | + |
| 287 | +**Pod Label Example**: |
| 288 | +```yaml |
| 289 | +rolebasedgroup.workloads.x-k8s.io/name: rbgs-test-1 |
| 290 | +rolebasedgroup.workloads.x-k8s.io/role: p |
| 291 | +rolebasedgroup.workloads.x-k8s.io/role-subgroup-size: 4 |
| 292 | +rolebasedgroup.workloads.x-k8s.io/role-instance-index: 1 |
| 293 | +rolebasedgroup.workloads.x-k8s.io/topology-coordination-name: p-d-4-2-deploy-test |
| 294 | +rolebasedgroup.workloads.x-k8s.io/deploy-topology-coordination-name: p-d-4-2-deploy-test-0 |
| 295 | +``` |
| 296 | +
|
| 297 | +### 6. Progressive Scheduling Deployment |
| 298 | +
|
| 299 | +**Implementation Mechanism**: Achieve batch scheduling using the PodSchedulingGate feature: |
| 300 | +- After the Pods of the previous batch are successfully scheduled |
| 301 | +- Remove the PodSchedulingGate for the Pods of the next batch |
| 302 | +- Trigger the scheduling process for the next batch of Pods |
| 303 | +
|
| 304 | +### Risks and Mitigations |
| 305 | +- Rolling deploy and rolling update are mutually exclusive within a single reconciliation cycle |
| 306 | +- If scaling occurs during deploy coordination, update coordination is skipped for that cycle |
0 commit comments