Skip to content

K8SPG-374: handle standby lag detection errors#1462

Merged
hors merged 9 commits into
mainfrom
K8SPG-374-fix
Mar 4, 2026
Merged

K8SPG-374: handle standby lag detection errors#1462
hors merged 9 commits into
mainfrom
K8SPG-374-fix

Conversation

@pooknull
Copy link
Copy Markdown
Contributor

@pooknull pooknull commented Feb 26, 2026

https://perconadev.atlassian.net/browse/K8SPG-374

DESCRIPTION

This PR improves standby lag detection by handling 2 errors that can occur when the source cluster is paused.

  1. If the primary pod cannot be identified during lag detection, the operator sets the following condition on the PerconaPGCluster resource:

     			Type:    postgrescluster.ConditionStandbyLagging,
     			Status:  metav1.ConditionUnknown,
     			Reason:  "ErrorGettingLag",
     			Message: "Cannot find primary for replication lag calculation",
  2. If the lag detection query returns no rows/NULL (for example, when pg_stat_wal_receiver is empty), the operator sets the following condition on the PerconaPGCluster resource:

     			Type:    postgrescluster.ConditionStandbyLagging,
     			Status:  metav1.ConditionUnknown,
     			Reason:  "ErrorGettingLag",
     			Message: "Invalid output from lag query. The WAL receiver is probably not active",

Additionally, this PR moves the log message "Requeuing standby cluster for lag check" from INFO to DEBUG.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Copilot AI review requested due to automatic review settings February 26, 2026 12:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances error handling for standby lag detection in PostgreSQL cluster replication. It introduces sentinel errors and graceful error handling for transient conditions that can occur during cluster initialization or when replication is not yet established.

Changes:

  • Added sentinel errors ErrPrimaryPodNotFound and ErrInvalidLagQueryOutput for better error classification
  • Enhanced error handling in reconcileStandbyLag to set condition status to Unknown for recoverable error scenarios
  • Added empty string validation before parsing lag values from database queries
  • Reduced logging verbosity for periodic requeue operations
  • Removed unused fmt import from pgbackup controller

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
percona/controller/pgcluster/standby.go Implements improved error handling for standby lag detection with sentinel errors, graceful handling of transient conditions, and empty string validation for query outputs
percona/controller/pgbackup/controller.go Removes unused fmt import (cleanup)

Comment thread percona/controller/pgcluster/standby.go
Comment thread percona/controller/pgcluster/standby.go Outdated
Comment thread percona/controller/pgcluster/standby.go
Comment thread percona/controller/pgcluster/standby.go Outdated
Copilot AI review requested due to automatic review settings February 26, 2026 14:11
@pooknull pooknull marked this pull request as ready for review February 26, 2026 14:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

egegunes
egegunes previously approved these changes Feb 27, 2026
Comment thread percona/controller/pgcluster/standby.go Outdated

lagBytes, err := r.getStandbyLag(ctx, cr)
if err != nil {
if errors.Is(err, ErrPrimaryPodNotFound) && cr.Status.State != v2.AppStateReady {
Copy link
Copy Markdown
Contributor

@gkech gkech Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to have app state ready while the primary cannot be found?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It refers to the primary pod of the main site, not the standby

Comment thread percona/controller/pgcluster/standby.go Outdated
Comment on lines +75 to +92
if errors.Is(err, ErrPrimaryPodNotFound) && cr.Status.State != v2.AppStateReady {
meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
Type: postgrescluster.ConditionStandbyLagging,
Status: metav1.ConditionUnknown,
Reason: "PrimaryNotFound",
Message: "Cannot find primary for replication lag calculation",
})
return nil
}
if errors.Is(err, ErrInvalidLagQueryOutput) {
meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
Type: postgrescluster.ConditionStandbyLagging,
Status: metav1.ConditionUnknown,
Reason: "InvalidLagQueryOutput",
Message: "Invalid output from lag query. The WAL receiver is probably not active",
})
return nil
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should unconditionally add metav1.ConditionUnknown for this condition if we have a non-nil error, regardless of what the error actually is..

We can have a standard reason ErrorGettingLag, and the message can be the error string.. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egegunes egegunes added this to the v2.9.0 milestone Mar 3, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 09:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +167 to +170
var (
ErrPrimaryPodNotFound = errors.New("primary pod not found")
ErrInvalidLagQueryOutput = errors.New("invalid lag query output")
)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem: ErrPrimaryPodNotFound / ErrInvalidLagQueryOutput are exported but (currently) only used within this file.
Why it matters: Exporting expands the package API surface and makes future refactors harder.
Fix: Make these errors unexported (e.g., errPrimaryPodNotFound, errInvalidLagQueryOutput) unless they are intended to be referenced from other packages.

Copilot uses AI. Check for mistakes.
Comment thread percona/controller/pgcluster/standby.go Outdated
Comment thread percona/controller/pgcluster/standby.go
gkech
gkech previously approved these changes Mar 3, 2026
egegunes
egegunes previously approved these changes Mar 4, 2026
Copilot AI review requested due to automatic review settings March 4, 2026 07:10
@pooknull pooknull dismissed stale reviews from egegunes and gkech via b00d26a March 4, 2026 07:10
@pooknull pooknull requested a review from gkech March 4, 2026 07:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


You can also share your feedback on Copilot code review. Take the survey.

@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
backup-enable-disable passed 00:11:15
builtin-extensions passed 00:04:51
cert-manager-tls passed 00:05:05
custom-envs passed 00:21:41
custom-extensions passed 00:14:56
custom-tls passed 00:06:40
database-init-sql passed 00:02:10
demand-backup passed 00:26:55
demand-backup-offline-snapshot passed 00:17:56
dynamic-configuration passed 00:05:04
finalizers passed 00:04:41
init-deploy passed 00:02:56
huge-pages passed 00:02:57
monitoring passed 00:07:41
monitoring-pmm3 passed 00:08:19
one-pod passed 00:05:44
operator-self-healing passed 00:10:38
pg-tde passed 00:08:46
pitr passed 00:12:05
scaling passed 00:05:09
scheduled-backup passed 00:30:17
self-healing passed 00:09:04
sidecars passed 00:02:43
standby-pgbackrest passed 00:15:23
standby-streaming passed 00:09:34
start-from-backup passed 00:14:14
tablespaces passed 00:08:11
telemetry-transfer passed 00:06:17
upgrade-consistency passed 00:05:50
upgrade-minor passed 00:06:43
users passed 00:04:31
Summary Value
Tests Run 31/31
Job Duration 01:41:49
Total Test Time 04:58:30

commit: b00d26a
image: perconalab/percona-postgresql-operator:PR-1462-b00d26a70

@hors hors merged commit 909fcd1 into main Mar 4, 2026
11 checks passed
@hors hors deleted the K8SPG-374-fix branch March 4, 2026 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants