K8SPG-957: backup controller must watch VolumeSnapshots only if API is installed#1464
Conversation
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
There was a problem hiding this comment.
Pull request overview
This PR prevents the PerconaPGBackup controller from crashing on clusters (e.g., EKS) where the VolumeSnapshot API is not installed by conditionally registering the VolumeSnapshot watch only when the feature gate is enabled and the API is discoverable.
Changes:
- Added a Kubernetes discovery helper (
GroupVersionKindExists) to detect whether a GroupVersion/Kind is available on the API server. - Updated the pgbackup controller setup to only
Owns(VolumeSnapshot)whenfeature.BackupSnapshotsis enabled and the VolumeSnapshot API is installed. - Updated operator wiring to pass
ctxinto the updated pgbackup controllerSetupWithManager.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
percona/k8s/util.go |
Adds API discovery helper to check for GroupVersion/Kind availability. |
percona/controller/pgbackup/controller.go |
Gates the VolumeSnapshot watch behind feature flag + API discovery to avoid cache sync failure. |
cmd/postgres-operator/main.go |
Updates pgbackup controller initialization to pass context into SetupWithManager. |
| if r.DiscoveryClient == nil { | ||
| var err error | ||
| r.DiscoveryClient, err = discovery.NewDiscoveryClientForConfig(mgr.GetConfig()) | ||
| if err != nil { | ||
| return errors.Wrap(err, "failed to create discovery client") | ||
| } | ||
| } |
There was a problem hiding this comment.
- Problem:
DiscoveryClientis always initialized inSetupWithManager, even whenfeature.BackupSnapshotsis disabled (the only current use site). - Why it matters: This adds unnecessary client construction and discovery traffic during startup for the default/disabled case.
- Fix: Only create the discovery client inside the
feature.Enabled(ctx, feature.BackupSnapshots)block (or create a local client there) so it is only initialized when the conditional watch is actually needed.
| // Watch VolumeSnapshots if the feature is enabled and the API is available. | ||
| if feature.Enabled(ctx, feature.BackupSnapshots) { | ||
| gvk := volumesnapshotv1.SchemeGroupVersion.WithKind(pNaming.KindVolumeSnapshot) | ||
| gv := gvk.GroupVersion().String() | ||
| if ok, err := k8s.GroupVersionKindExists(r.DiscoveryClient, gv, gvk.Kind); err != nil { | ||
| return errors.Wrap(err, "check VolumeSnapshot API availability") | ||
| } else if ok { | ||
| b = b.Owns(&volumesnapshotv1.VolumeSnapshot{}) | ||
| } | ||
| } |
There was a problem hiding this comment.
- Problem: The new conditional watch behavior for VolumeSnapshots (to prevent startup crashes when the API is missing) is not covered by tests.
- Why it matters: This is a regression fix for an operator crash path; without a test, it’s easy to reintroduce the unconditional watch and break EKS again.
- Fix: Add a regression test (preferably envtest) that starts a manager without the VolumeSnapshot CRDs installed and asserts
PGBackupReconciler.SetupWithManager(...)succeeds (and/or that it only adds the watch when the feature gate is enabled and the API is present).
| // GroupVersionKindExists checks to see whether a given Kind for a given | ||
| // GroupVersion exists in the Kubernetes API Server. | ||
| func GroupVersionKindExists(dc *discovery.DiscoveryClient, groupVersion, kind string) (bool, error) { | ||
| if dc == nil { | ||
| return false, errors.New("discovery client is nil") | ||
| } | ||
|
|
||
| resourceList, err := dc.ServerResourcesForGroupVersion(groupVersion) | ||
| if err != nil { | ||
| if k8serrors.IsNotFound(err) { | ||
| return false, nil | ||
| } | ||
| return false, err | ||
| } | ||
|
|
||
| for _, resource := range resourceList.APIResources { | ||
| if resource.Kind == kind { | ||
| return true, nil | ||
| } | ||
| } | ||
|
|
||
| return false, nil | ||
| } |
There was a problem hiding this comment.
- Problem:
GroupVersionKindExistsduplicates the existingReconciler.GroupVersionKindExistshelper ininternal/controller/postgrescluster/controller.gowith near-identical discovery logic. - Why it matters: Duplicated helpers tend to drift (different nil/return behavior, error handling, etc.), making future changes harder and increasing the chance of inconsistent behavior.
- Fix: Consolidate this logic into a single shared helper (e.g., keep it in
percona/k8sand have the PostgresCluster reconciler call it), or remove this helper and reuse the existing one where appropriate.
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
| return (builder.ControllerManagedBy(mgr). | ||
|
|
||
| if r.DiscoveryClient == nil { | ||
| var err error |
There was a problem hiding this comment.
what's the reason for having this var instead of doing r.DiscoveryClient, err := discovery.NewDiscoveryClientForConfig(mgr.GetConfig())? The error is returned, and it is already scoped in the if-clause
There was a problem hiding this comment.
That's invalid syntax, you cannot use := with r.DiscoveryClient since it is not a new variable. That means err also must be declared first in order to use with =:
basically..
non-name r.DiscoveryClient on left side of := [BadDecl]
Signed-off-by: Mayank Shah <mayank.shah@percona.com>
commit: 0abbe2f |
CHANGE DESCRIPTION
Problem:
PG operator crash on EKS.
Cause:
The backup controller tries to watch VolumeSnapshots, but the the API does not come pre-installed in EKS (unlike GKE). As a result, the caches fail to start.
Solution:
The watcher must be enabled only if the API is installed and feature gate enabled.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability