Skip to content

[Bug] CoordinatedPolicy rollingUpdate maxSkew can stall RoleInstanceSet #307

@JasonHe-WQ

Description

@JasonHe-WQ

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When using CoordinatedPolicy with rollingUpdate.maxSkew, a RoleInstanceSet rollout can get stuck if the coordinated roles have only one replica.

In this case the RBG controller calculates partition: 1 for a role with replicas: 1. The RoleInstanceSet controller then selects instances to update with ordinal >= partition, so no RoleInstance is able for update. The workload stays forever with currentRevision != updateRevision.

This is confusing because the RoleInstanceSet.spec.roleInstanceTemplate is updated, but the existing RoleInstance.spec remains on the old revision. For example, probes and container command are different between RoleInstanceSet and RoleInstance.

Reproduction

Create an RBG with two roles, e.g. prefill and decode, each with replicas: 1, and a CoordinatedPolicy:

apiVersion: workloads.x-k8s.io/v1alpha2
kind: CoordinatedPolicy
metadata:
  name: <rbg-name>
  namespace: <namespace>
spec:
  policies:
  - name: auto-all-predictors
    roles:
    - prefill
    - decode
    strategy:
      rollingUpdate:
        maxSkew: 10%

Then update the role template, for example change command, model path, probe delay, or any field that creates a new RoleInstanceSet revision.

Environment

latest 0.7.0 alpha3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions