Skip to content

AutoscalingListener and EphemeralRunnerSet retain stale controller image/labels after upgrade — manual intervention required #4513

@xakaitetoia

Description

@xakaitetoia

Describe the bug

When upgrading the gha-runner-scale-set-controller, two controller-managed objects are not updated to reflect the new version:

  1. AutoscalingListener CRs — retain the old controller image in spec.image
  2. EphemeralRunnerSet objects — retain old version labels (app.kubernetes.io/version, helm.sh/chart)

The controller gates reconciliation on a spec hash. A controller-only upgrade does not change any AutoscalingRunnerSet spec, so the hash is identical and the controller skips reconciliation of both objects entirely. The updateStrategy flag has no effect here — it only governs spec-change rollout behaviour, not controller version upgrades.

Object hierarchy affected

AutoscalingRunnerSet   (Helm-managed)
├── AutoscalingListener   ← stale image after controller upgrade
└── EphemeralRunnerSet    ← stale labels/spec after controller upgrade
        └── EphemeralRunner

Why this matters

For minor bumps (e.g. 0.14.1 → 0.14.2) the EphemeralRunnerSet staleness may appear cosmetic. For major upgrades where the EphemeralRunnerSet or EphemeralRunner spec has breaking changes (new required fields, removed fields, changed defaults), stale objects under a new controller is a real correctness risk.

Steps to reproduce

  1. Deploy ARC controller + scale sets at version N
  2. Upgrade the controller chart to version N+1 (scale set chart also upgraded)
  3. Check listeners: kubectl get autoscalinglisteners -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.image}{"\n"}{end}'
  4. Check runner sets: kubectl get ephemeralrunnersets -A -o custom-columns='NAME:.metadata.name,VERSION:.metadata.labels.app\.kubernetes\.io/version'
  5. Both still show version N

Expected behaviour

After a controller upgrade, AutoscalingListener CRs and EphemeralRunnerSet objects should be reconciled to reflect the new version without manual intervention.

Actual behaviour

Both objects retain the old version. The controller sees the spec hash as unchanged and takes no action.

Workaround

For AutoscalingListener: delete all CRs — the controller recreates them immediately with the new image. Runner pods are unaffected.

kubectl delete autoscalinglisteners -A --all

For EphemeralRunnerSet: a dummy change to minRunners is not sufficient — it only affects the AutoscalingListener hash, not the EphemeralRunnerSet hash. A change to the runner pod template (spec.template.spec) is required, e.g. adding a dummy annotation:

spec:
  template:
    metadata:
      annotations:
        upgrade-trigger: "0.14.2"

This triggers a graceful transition — the old EphemeralRunnerSet drains (in-progress jobs complete) while the new one starts accepting jobs immediately. The annotation can be removed in a follow-up commit.

Full upgrade procedure (per scale set)

  1. Upgrade controller chart
  2. Delete all AutoscalingListener CRs
  3. Add a dummy annotation to spec.template.spec in each AutoscalingRunnerSet HelmRelease, then remove it

None of these steps should be manual.

Version: controller 0.14.2, scale-set 0.14.2, Kubernetes: RKE2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions