fix(rbg-controller): override role.Replicas from bound RBGSA#306
Conversation
When a chart that gates `spec.roles[i].replicas` on `Release.IsInstall` is upgraded with the default client-side merge, the rendered manifest no longer carries the field, the apiserver fills it with the CRD default of 1 (`+kubebuilder:default=1`), the RBG controller scales the workload to 1, and pods churn. The same class of bug applies to any external mutator that drops the field (`kubectl edit` from a manifest that omits replicas, non-SSA scripts, ArgoCD without `ServerSideApply=true`, etc.). When a `RoleBasedGroupScalingAdapter` (RBGSA) owns a role's replicas, its `spec.replicas` is the runtime source of truth. This change adds `applyRBGSAReplicasOverride` at the top of `Reconcile`: for each role with a bound RBGSA whose `spec.replicas` differs from `role.Replicas`, the role value is corrected in-memory before any reconcile step reads it, so the workload is never sized from a stale spec value. A Warning Event `ReplicasOverriddenByAdapter` is emitted on each override so operators see external-mutator drift in `kubectl describe`. The override is gated so callers that intentionally manage replicas without an autoscaler are unaffected: when `scalingAdapter.enable` is false, the adapter is missing or unbound, or `adapter.spec.replicas` is nil, the role's own value is kept. Tests: - `TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride`: 7-case table-driven suite covering every gating branch (disabled adapter, missing adapter, unbound adapter, nil adapter spec, equal/different replicas, role.Replicas nil), with `record.FakeRecorder` assertions that the Warning event fires exactly when the override does. - `TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride_GetError`: interceptor-based test for the apiserver Get error path; verifies the wrapped error preserves the original via %w. Companion deploy-side recommendation: `helm install/upgrade --server-side` (or ArgoCD `ServerSideApply=true`) is the right fix at the protocol layer. The operator change is the safety net for clients that don't use server-side apply.
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to prevent replica count drift in RoleBasedGroup workloads when using a scaling adapter. It adds the applyRBGSAReplicasOverride method to the reconciler, which ensures that the scaling adapter's desired replica count takes precedence over values potentially reset by external mutators like Helm. A new event, ReplicasOverriddenByAdapter, is emitted when such an override occurs. Comprehensive unit tests have been added to verify the logic across various states of the scaling adapter. I have no feedback to provide.
Extend RBGRoleSpecOrStatusPredicate (renamed from RBGRoleStatusPredicate) to also fire when any role's spec.replicas value changes between the old and new RBG. Without this, the RBGSA controller's writeback path only wakes on status changes, so an external mutator that stomps spec.roles[].replicas (e.g. helm upgrade dropping the field, kubectl edit, non-SSA scripts) waits for the next status-change event before the apiserver-stored spec is corrected. Pairs with the in-memory override added in the RBG controller (PR sgl-project#306): the override prevents the workload from being sized to the stale value on the next reconcile, and this predicate change closes the latency gap by waking the writeback immediately on the stomp instead of when status next changes. The new rolesReplicasDiffer helper matches roles by name; added, removed, and renamed roles all count as a difference. We deliberately do not fire on every spec change — other fields (image, env, command) churn frequently and the adapter reconcile is irrelevant to them.
There was a problem hiding this comment.
Pull request overview
This PR prevents unintended scaling to the CRD default replicas (1) when external mutators (e.g., Helm client-side merge) drop spec.roles[i].replicas on upgrade, by treating a bound RoleBasedGroupScalingAdapter’s spec.replicas as the runtime source of truth during reconciliation.
Changes:
- Add an in-memory override step (
applyRBGSAReplicasOverride) at the start ofRoleBasedGroupReconciler.Reconcileto alignrole.Replicaswith a bound adapter’sspec.replicasbefore any downstream logic reads it. - Emit a Warning Event (
ReplicasOverriddenByAdapter) whenever an override occurs to surface drift to operators. - Add unit tests covering gating branches and the adapter
Geterror path (via controller-runtime fake client interceptors).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/controller/workloads/rolebasedgroup_controller.go | Adds the replicas override helper and invokes it early in Reconcile. |
| internal/controller/workloads/event.go | Introduces the ReplicasOverriddenByAdapter event reason constant and documentation. |
| internal/controller/workloads/rolebasedgroup_controller_test.go | Adds table-driven tests for override behavior and an interceptor-based error-path test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| adapter := &workloadsv1alpha2.RoleBasedGroupScalingAdapter{} | ||
| adapterName := scale.GenerateScalingAdapterName(rbg.Name, role.Name) | ||
| if err := r.client.Get(ctx, types.NamespacedName{ | ||
| Namespace: rbg.Namespace, Name: adapterName, | ||
| }, adapter); err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| continue | ||
| } | ||
| return fmt.Errorf("get scaling adapter %s/%s: %w", rbg.Namespace, adapterName, err) | ||
| } |
There was a problem hiding this comment.
applyRBGSAReplicasOverride reads the scaling adapter using r.client.Get, which may be served from the controller-runtime cache. If the cache is stale (or the informer hasn’t observed the latest update yet), this can override role.Replicas to an out-of-date adapter value and potentially scale workloads incorrectly. Consider using r.apiReader.Get (similar to updateRBGStatus, which explicitly bypasses cache for freshness) for this read so the adapter is treated as the authoritative source of truth.
| err := r.applyRBGSAReplicasOverride(context.Background(), rbg) | ||
| if err == nil { | ||
| t.Fatal("expected error, got nil") | ||
| } | ||
| if !strings.Contains(err.Error(), "get scaling adapter") { | ||
| t.Fatalf("expected wrapped error mentioning 'get scaling adapter', got %v", err) | ||
| } |
There was a problem hiding this comment.
This test claims (in the PR description) to verify that the returned error preserves the original via %w, but it only asserts on the error string. To actually validate wrapping, assert errors.Is(err, getErr) (or errors.Unwrap(err) == getErr) in addition to the message check.
Summary
When a chart that gates
spec.roles[i].replicasonRelease.IsInstallis upgraded with the default client-side merge, the rendered manifest no longer carries the field, the apiserver fills it with the CRD default of1(+kubebuilder:default=1onRoleSpec.Replicas), the RBG controller scales the workload to 1, and pods churn. The same class of bug applies to any external mutator that drops the field (kubectl editfrom a manifest that omits replicas, non-SSA scripts, ArgoCD withoutServerSideApply=true, etc.).When a
RoleBasedGroupScalingAdapter(RBGSA) owns a role's replicas, itsspec.replicasis the runtime source of truth. This change addsapplyRBGSAReplicasOverrideat the top ofRoleBasedGroupReconciler.Reconcile:scalingAdapter.enable=truewhose bound RBGSA has aspec.replicasthat differs fromrole.Replicas,role.Replicasis overridden in-memory to the adapter's value before any reconcile step reads it.ReplicasOverriddenByAdapteris emitted on each override so operators see external-mutator drift inkubectl describe.The override is gated so callers that intentionally manage replicas without an autoscaler are unaffected: when
scalingAdapter.enableis false, when the adapter is missing or unbound, or whenadapter.spec.replicasis nil, the role's own value is kept.The companion deploy-side recommendation —
helm install/upgrade --server-side(or ArgoCDServerSideApply=true) — remains the right fix at the protocol layer. This operator change is the safety net for clients that don't use server-side apply.Test plan
make test— all unit tests pass, including the two new ones.TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride— 7-case table-driven suite covering every gating branch (disabled adapter, missing adapter, unbound adapter, nil adapter spec, equal/different replicas,role.Replicasnil), withrecord.FakeRecorderassertions that the Warning event fires exactly when the override does.TestRoleBasedGroupReconciler_applyRBGSAReplicasOverride_GetError— interceptor-based test for the apiserverGeterror path; verifies the wrapped error preserves the original via%w.go test -run applyRBGSAReplicasOverride -count=10 ./internal/controller/workloads/...— stable across 10 runs.make fmt vet— clean.Notes
scale.IsScalingAdapterEnable,scale.GenerateScalingAdapterName,constants.AdapterPhaseBound. No new abstractions.spec.replicasstomp via a newWatches(&RoleBasedGroup{}, ..., predicate)on itsSetupWithManager). It is intentionally not bundled here since it is a separate design discussion (new watch infrastructure) and the in-memory override on the next reconcile already prevents the workload from being sized to the stale value.