Skip to content

fix(rbg-controller): override role.Replicas from bound RBGSA#306

Open
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbg-prevent-helm-replicas-stomp
Open

fix(rbg-controller): override role.Replicas from bound RBGSA#306
sebest wants to merge 1 commit into
sgl-project:mainfrom
sebest:fix/rbg-prevent-helm-replicas-stomp

Conversation

@sebest
Copy link
Copy Markdown
Contributor

@sebest sebest commented Apr 25, 2026

Summary

When a chart that gates spec.roles[i].replicas on Release.IsInstall is upgraded with the default client-side merge, the rendered manifest no longer carries the field, the apiserver fills it with the CRD default of 1 (+kubebuilder:default=1 on RoleSpec.Replicas), the RBG controller scales the workload to 1, and pods churn. The same class of bug applies to any external mutator that drops the field (kubectl edit from a manifest that omits replicas, non-SSA scripts, ArgoCD without ServerSideApply=true, etc.).

When a RoleBasedGroupScalingAdapter (RBGSA) owns a role's replicas, its spec.replicas is the runtime source of truth. This change adds applyRBGSAReplicasOverride at the top of RoleBasedGroupReconciler.Reconcile:

  • For each role with scalingAdapter.enable=true whose bound RBGSA has a spec.replicas that differs from role.Replicas, role.Replicas is overridden in-memory to the adapter's value before any reconcile step reads it.
  • The on-disk RBG field becomes advisory when an autoscaler is bound; the workload is never sized from a stale spec value.
  • A Warning Event ReplicasOverriddenByAdapter is emitted on each override so operators see external-mutator drift in kubectl describe.

The override is gated so callers that intentionally manage replicas without an autoscaler are unaffected: when scalingAdapter.enable is false, when the adapter is missing or unbound, or when adapter.spec.replicas is nil, the role's own value is kept.

The companion deploy-side recommendation — helm install/upgrade --server-side (or ArgoCD ServerSideApply=true) — remains the right fix at the protocol layer. This operator change is the safety net for clients that don't use server-side apply.

Test plan

  • make test — all unit tests pass, including the two new ones.
    • TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride — 7-case table-driven suite covering every gating branch (disabled adapter, missing adapter, unbound adapter, nil adapter spec, equal/different replicas, role.Replicas nil), with record.FakeRecorder assertions that the Warning event fires exactly when the override does.
    • TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride_GetError — interceptor-based test for the apiserver Get error path; verifies the wrapped error preserves the original via %w.
  • go test -run applyRBGSAReplicasOverride -count=10 ./internal/controller/workloads/... — stable across 10 runs.
  • make fmt vet — clean.

Notes

  • No CRD or API changes. No new metrics (consistent with the rest of the rbg operator's observability conventions: Events + structured logs + conditions).
  • Reuses existing helpers: scale.IsScalingAdapterEnable, scale.GenerateScalingAdapterName, constants.AdapterPhaseBound. No new abstractions.
  • A latency-optimization follow-up is possible (waking the RBGSA controller's writeback immediately on a spec.replicas stomp via a new Watches(&RoleBasedGroup{}, ..., predicate) on its SetupWithManager). It is intentionally not bundled here since it is a separate design discussion (new watch infrastructure) and the in-memory override on the next reconcile already prevents the workload from being sized to the stale value.

When a chart that gates `spec.roles[i].replicas` on `Release.IsInstall`
is upgraded with the default client-side merge, the rendered manifest
no longer carries the field, the apiserver fills it with the CRD
default of 1 (`+kubebuilder:default=1`), the RBG controller scales
the workload to 1, and pods churn. The same class of bug applies to
any external mutator that drops the field (`kubectl edit` from a
manifest that omits replicas, non-SSA scripts, ArgoCD without
`ServerSideApply=true`, etc.).

When a `RoleBasedGroupScalingAdapter` (RBGSA) owns a role's replicas,
its `spec.replicas` is the runtime source of truth. This change adds
`applyRBGSAReplicasOverride` at the top of `Reconcile`: for each role
with a bound RBGSA whose `spec.replicas` differs from `role.Replicas`,
the role value is corrected in-memory before any reconcile step reads
it, so the workload is never sized from a stale spec value. A Warning
Event `ReplicasOverriddenByAdapter` is emitted on each override so
operators see external-mutator drift in `kubectl describe`.

The override is gated so callers that intentionally manage replicas
without an autoscaler are unaffected: when `scalingAdapter.enable`
is false, the adapter is missing or unbound, or `adapter.spec.replicas`
is nil, the role's own value is kept.

Tests:

- `TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride`: 7-case
  table-driven suite covering every gating branch (disabled adapter,
  missing adapter, unbound adapter, nil adapter spec, equal/different
  replicas, role.Replicas nil), with `record.FakeRecorder` assertions
  that the Warning event fires exactly when the override does.
- `TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride_GetError`:
  interceptor-based test for the apiserver Get error path; verifies
  the wrapped error preserves the original via %w.

Companion deploy-side recommendation: `helm install/upgrade
--server-side` (or ArgoCD `ServerSideApply=true`) is the right fix
at the protocol layer. The operator change is the safety net for
clients that don't use server-side apply.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to prevent replica count drift in RoleBasedGroup workloads when using a scaling adapter. It adds the applyRBGSAReplicasOverride method to the reconciler, which ensures that the scaling adapter's desired replica count takes precedence over values potentially reset by external mutators like Helm. A new event, ReplicasOverriddenByAdapter, is emitted when such an override occurs. Comprehensive unit tests have been added to verify the logic across various states of the scaling adapter. I have no feedback to provide.

@sebest sebest marked this pull request as ready for review April 25, 2026 20:50
sebest added a commit to sebest/rbg that referenced this pull request Apr 26, 2026
Extend RBGRoleSpecOrStatusPredicate (renamed from RBGRoleStatusPredicate)
to also fire when any role's spec.replicas value changes between the old
and new RBG. Without this, the RBGSA controller's writeback path only
wakes on status changes, so an external mutator that stomps
spec.roles[].replicas (e.g. helm upgrade dropping the field, kubectl
edit, non-SSA scripts) waits for the next status-change event before
the apiserver-stored spec is corrected.

Pairs with the in-memory override added in the RBG controller (PR sgl-project#306):
the override prevents the workload from being sized to the stale value
on the next reconcile, and this predicate change closes the latency
gap by waking the writeback immediately on the stomp instead of when
status next changes.

The new rolesReplicasDiffer helper matches roles by name; added,
removed, and renamed roles all count as a difference. We deliberately
do not fire on every spec change — other fields (image, env, command)
churn frequently and the adapter reconcile is irrelevant to them.
@cheyang cheyang requested a review from Copilot April 28, 2026 02:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents unintended scaling to the CRD default replicas (1) when external mutators (e.g., Helm client-side merge) drop spec.roles[i].replicas on upgrade, by treating a bound RoleBasedGroupScalingAdapter’s spec.replicas as the runtime source of truth during reconciliation.

Changes:

  • Add an in-memory override step (applyRBGSAReplicasOverride) at the start of RoleBasedGroupReconciler.Reconcile to align role.Replicas with a bound adapter’s spec.replicas before any downstream logic reads it.
  • Emit a Warning Event (ReplicasOverriddenByAdapter) whenever an override occurs to surface drift to operators.
  • Add unit tests covering gating branches and the adapter Get error path (via controller-runtime fake client interceptors).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/controller/workloads/rolebasedgroup_controller.go Adds the replicas override helper and invokes it early in Reconcile.
internal/controller/workloads/event.go Introduces the ReplicasOverriddenByAdapter event reason constant and documentation.
internal/controller/workloads/rolebasedgroup_controller_test.go Adds table-driven tests for override behavior and an interceptor-based error-path test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +827 to +836
adapter := &workloadsv1alpha2.RoleBasedGroupScalingAdapter{}
adapterName := scale.GenerateScalingAdapterName(rbg.Name, role.Name)
if err := r.client.Get(ctx, types.NamespacedName{
Namespace: rbg.Namespace, Name: adapterName,
}, adapter); err != nil {
if apierrors.IsNotFound(err) {
continue
}
return fmt.Errorf("get scaling adapter %s/%s: %w", rbg.Namespace, adapterName, err)
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyRBGSAReplicasOverride reads the scaling adapter using r.client.Get, which may be served from the controller-runtime cache. If the cache is stale (or the informer hasn’t observed the latest update yet), this can override role.Replicas to an out-of-date adapter value and potentially scale workloads incorrectly. Consider using r.apiReader.Get (similar to updateRBGStatus, which explicitly bypasses cache for freshness) for this read so the adapter is treated as the authoritative source of truth.

Copilot uses AI. Check for mistakes.
Comment on lines +1207 to +1213
err := r.applyRBGSAReplicasOverride(context.Background(), rbg)
if err == nil {
t.Fatal("expected error, got nil")
}
if !strings.Contains(err.Error(), "get scaling adapter") {
t.Fatalf("expected wrapped error mentioning 'get scaling adapter', got %v", err)
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test claims (in the PR description) to verify that the returned error preserves the original via %w, but it only asserts on the error string. To actually validate wrapping, assert errors.Is(err, getErr) (or errors.Unwrap(err) == getErr) in addition to the message check.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants