K8SPG-374: handle standby lag detection errors#1462
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request enhances error handling for standby lag detection in PostgreSQL cluster replication. It introduces sentinel errors and graceful error handling for transient conditions that can occur during cluster initialization or when replication is not yet established.
Changes:
- Added sentinel errors
ErrPrimaryPodNotFoundandErrInvalidLagQueryOutputfor better error classification - Enhanced error handling in
reconcileStandbyLagto set condition status to Unknown for recoverable error scenarios - Added empty string validation before parsing lag values from database queries
- Reduced logging verbosity for periodic requeue operations
- Removed unused
fmtimport from pgbackup controller
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| percona/controller/pgcluster/standby.go | Implements improved error handling for standby lag detection with sentinel errors, graceful handling of transient conditions, and empty string validation for query outputs |
| percona/controller/pgbackup/controller.go | Removes unused fmt import (cleanup) |
|
|
||
| lagBytes, err := r.getStandbyLag(ctx, cr) | ||
| if err != nil { | ||
| if errors.Is(err, ErrPrimaryPodNotFound) && cr.Status.State != v2.AppStateReady { |
There was a problem hiding this comment.
is it possible to have app state ready while the primary cannot be found?
There was a problem hiding this comment.
It refers to the primary pod of the main site, not the standby
| if errors.Is(err, ErrPrimaryPodNotFound) && cr.Status.State != v2.AppStateReady { | ||
| meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{ | ||
| Type: postgrescluster.ConditionStandbyLagging, | ||
| Status: metav1.ConditionUnknown, | ||
| Reason: "PrimaryNotFound", | ||
| Message: "Cannot find primary for replication lag calculation", | ||
| }) | ||
| return nil | ||
| } | ||
| if errors.Is(err, ErrInvalidLagQueryOutput) { | ||
| meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{ | ||
| Type: postgrescluster.ConditionStandbyLagging, | ||
| Status: metav1.ConditionUnknown, | ||
| Reason: "InvalidLagQueryOutput", | ||
| Message: "Invalid output from lag query. The WAL receiver is probably not active", | ||
| }) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I think we should unconditionally add metav1.ConditionUnknown for this condition if we have a non-nil error, regardless of what the error actually is..
We can have a standard reason ErrorGettingLag, and the message can be the error string.. WDYT?
| var ( | ||
| ErrPrimaryPodNotFound = errors.New("primary pod not found") | ||
| ErrInvalidLagQueryOutput = errors.New("invalid lag query output") | ||
| ) |
There was a problem hiding this comment.
Problem: ErrPrimaryPodNotFound / ErrInvalidLagQueryOutput are exported but (currently) only used within this file.
Why it matters: Exporting expands the package API surface and makes future refactors harder.
Fix: Make these errors unexported (e.g., errPrimaryPodNotFound, errInvalidLagQueryOutput) unless they are intended to be referenced from other packages.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
You can also share your feedback on Copilot code review. Take the survey.
commit: b00d26a |
https://perconadev.atlassian.net/browse/K8SPG-374
DESCRIPTION
This PR improves standby lag detection by handling 2 errors that can occur when the source cluster is paused.
If the primary pod cannot be identified during lag detection, the operator sets the following condition on the
PerconaPGClusterresource:If the lag detection query returns no rows/NULL (for example, when
pg_stat_wal_receiveris empty), the operator sets the following condition on thePerconaPGClusterresource:Additionally, this PR moves the log message
"Requeuing standby cluster for lag check"fromINFOtoDEBUG.CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
Config/Logging/Testability