Checklist
Describe the bug
When using CoordinatedPolicy with rollingUpdate.maxSkew, a RoleInstanceSet rollout can get stuck if the coordinated roles have only one replica.
In this case the RBG controller calculates partition: 1 for a role with replicas: 1. The RoleInstanceSet controller then selects instances to update with ordinal >= partition, so no RoleInstance is able for update. The workload stays forever with currentRevision != updateRevision.
This is confusing because the RoleInstanceSet.spec.roleInstanceTemplate is updated, but the existing RoleInstance.spec remains on the old revision. For example, probes and container command are different between RoleInstanceSet and RoleInstance.
Reproduction
Create an RBG with two roles, e.g. prefill and decode, each with replicas: 1, and a CoordinatedPolicy:
apiVersion: workloads.x-k8s.io/v1alpha2
kind: CoordinatedPolicy
metadata:
name: <rbg-name>
namespace: <namespace>
spec:
policies:
- name: auto-all-predictors
roles:
- prefill
- decode
strategy:
rollingUpdate:
maxSkew: 10%
Then update the role template, for example change command, model path, probe delay, or any field that creates a new RoleInstanceSet revision.
Environment
latest 0.7.0 alpha3
Checklist
Describe the bug
When using
CoordinatedPolicywithrollingUpdate.maxSkew, aRoleInstanceSetrollout can get stuck if the coordinated roles have only one replica.In this case the RBG controller calculates
partition: 1for a role withreplicas: 1. The RoleInstanceSet controller then selects instances to update withordinal >= partition, so no RoleInstance is able for update. The workload stays forever withcurrentRevision != updateRevision.This is confusing because the
RoleInstanceSet.spec.roleInstanceTemplateis updated, but the existingRoleInstance.specremains on the old revision. For example, probes and container command are different between RoleInstanceSet and RoleInstance.Reproduction
Create an RBG with two roles, e.g.
prefillanddecode, each withreplicas: 1, and a CoordinatedPolicy:Then update the role template, for example change command, model path, probe delay, or any field that creates a new RoleInstanceSet revision.
Environment
latest 0.7.0 alpha3