Skip to content

K8SPG-1012 stanza timing fix#1575

Open
hors wants to merge 16 commits intocrd-renamefrom
stanza-timing-fix
Open

K8SPG-1012 stanza timing fix#1575
hors wants to merge 16 commits intocrd-renamefrom
stanza-timing-fix

Conversation

@hors
Copy link
Copy Markdown
Collaborator

@hors hors commented Apr 28, 2026

CHANGE DESCRIPTION

Problem:

During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command. pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist and postgres never retries. Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

hors and others added 15 commits April 29, 2026 00:01
During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command.  pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist — and postgres never retries.  Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.
Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.0.0-20221128193559-754e69321358 to 0.1.1.
- [Release notes](https://github.com/Azure/go-ntlmssp/releases)
- [Commits](https://github.com/Azure/go-ntlmssp/commits/v0.1.1)

---
updated-dependencies:
- dependency-name: github.com/Azure/go-ntlmssp
  dependency-version: 0.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Bumps [k8s.io/apimachinery](https://github.com/kubernetes/apimachinery) from 0.35.4 to 0.36.0.
- [Commits](kubernetes/apimachinery@v0.35.4...v0.36.0)

---
updated-dependencies:
- dependency-name: k8s.io/apimachinery
  dependency-version: 0.36.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
During a dataSource bootstrap restore, postgres promotes from TL1 to TL2
and immediately passes 00000002.history to archive_command.  pgBackRest's
async archiver silently drops the push (error 103) when archive.info does
not yet exist — and postgres never retries.  Without 00000002.history in
the archive, pg_rewind on replicas fails with "could not find common
ancestor" after any subsequent PITR restore.
@hors hors changed the title Stanza timing fix K8SPG-1012 stanza timing fix Apr 29, 2026
@hors hors marked this pull request as ready for review April 29, 2026 10:58
Comment on lines +3030 to +3043
// Re-push any timeline history files stranded by the async-archiver race:
// postgres archives 00000002.history during bootstrap promotion before the
// stanza exists; pgBackRest drops it silently (error 103) and postgres
// never retries. Without it pg_rewind fails on replicas after PITR.
log := logging.FromContext(ctx)
historyOut, historyErr := pgbackrest.Executor(exec).ArchivePushHistoryFiles(ctx)
if historyErr != nil {
r.Recorder.Event(postgresCluster, corev1.EventTypeWarning,
"ArchivePushHistoryFilesFailed", historyErr.Error())
log.Error(historyErr, "timeline history file recovery failed",
"pod", writableInstanceName, "output", historyOut)
} else if historyOut != "" {
log.Info("timeline history file recovery", "output", historyOut)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand we need to do this after stanza is created but i wonder if we can do this in the caller of this function reconcilePGBackRest after line 1642 and if configHashMismatch is false

@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
backup-enable-disable passed 00:14:32
builtin-extensions passed 00:05:52
cert-manager-tls passed 00:06:59
custom-envs passed 00:18:44
custom-extensions passed 00:14:38
custom-tls passed 00:07:33
database-init-sql passed 00:02:10
demand-backup passed 00:26:58
demand-backup-offline-snapshot passed 00:13:25
dynamic-configuration passed 00:03:04
finalizers passed 00:03:28
init-deploy passed 00:02:51
huge-pages passed 00:02:37
major-upgrade-13-to-14 passed 00:10:15
major-upgrade-14-to-15 passed 00:10:50
major-upgrade-15-to-16 passed 00:11:10
major-upgrade-16-to-17 passed 00:12:06
major-upgrade-17-to-18 passed 00:10:38
ldap passed 00:03:31
ldap-tls passed 00:07:36
monitoring passed 00:07:19
monitoring-pmm3 passed 00:08:09
one-pod passed 00:05:56
operator-self-healing passed 00:10:29
pitr passed 00:11:47
scaling passed 00:05:12
scheduled-backup passed 00:31:50
self-healing passed 00:08:48
sidecars passed 00:02:37
standby-pgbackrest passed 00:17:32
standby-streaming passed 00:12:59
start-from-backup passed 00:13:11
tablespaces passed 00:06:47
telemetry-transfer passed 00:04:12
upgrade-consistency passed 00:05:32
upgrade-minor failure 00:15:49
users passed 00:04:13
migration-from-crunchy-standby failure 00:22:02
migration-from-crunchy-pv failure 00:07:42
migration-from-crunchy-backup-restore failure 00:07:37
Summary Value
Tests Run 40/40
Job Duration 02:51:07
Total Test Time 06:38:58

commit: 783a445
image: perconalab/percona-postgresql-operator:PR-1575-783a44524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants