Skip to content

K8SPG-740 fix error message#1411

Merged
nmarukovich merged 11 commits into
mainfrom
K8SPG-740
Feb 12, 2026
Merged

K8SPG-740 fix error message#1411
nmarukovich merged 11 commits into
mainfrom
K8SPG-740

Conversation

@nmarukovich
Copy link
Copy Markdown
Contributor

@nmarukovich nmarukovich commented Jan 19, 2026

K8SPG-740 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:
Short explanation of the problem.
2025-03-04T11:01:27.724Z ERROR failed to cleanup outdated backups

Cause:
Short explanation of the root cause of the issue if applicable.
We encountered an error when attempting to delete backups while the repohost pod was not ready.
We've fixed this by checking the repohost pod status before attempting backup cleanup.

Solution:
Short explanation of the solution we are providing with this PR.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

Comment thread percona/controller/pgcluster/backup.go Outdated
if repoCondition == nil || repoCondition.Status != metav1.ConditionTrue {
log.Info("pgBackRest repo host not ready, skipping backup cleanup")
return nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary empty line

@nmarukovich nmarukovich requested a review from egegunes January 20, 2026 11:38
egegunes
egegunes previously approved these changes Jan 21, 2026
Comment thread percona/controller/pgcluster/backup.go Outdated
return errors.Wrap(err, "reconcile backup jobs")
}

repoCondition := meta.FindStatusCondition(cr.Status.Conditions, postgrescluster.ConditionRepoHostReady)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this check to cleanupOutdatedBackups method

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

shfmt

[shfmt] reported by reviewdog 🐶

[ $? -eq 0 ] || return 1


[shfmt] reported by reviewdog 🐶

echo "MachineConfig created"
echo "Waiting for worker pool to update (~10 minutes)..."


[shfmt] reported by reviewdog 🐶

kubectl wait --for=condition=Updated mcp/worker --timeout=900s 2>/dev/null || {
echo "Update taking longer than expected"
return 1
}


[shfmt] reported by reviewdog 🐶

echo "Worker pool updated"


[shfmt] reported by reviewdog 🐶

sleep 10
verify_hugepages_on_nodes


[shfmt] reported by reviewdog 🐶

echo "Verifying hugepages on nodes"


[shfmt] reported by reviewdog 🐶

# Get first worker node, fallback to first non-master, fallback to any node
local node_name=$(
kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || \
kubectl get nodes -l '!node-role.kubernetes.io/master,!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || \
kubectl get nodes -o jsonpath='{.items[0].metadata.name}'
)


[shfmt] reported by reviewdog 🐶

if [ -z "${node_name}" ]; then
echo "No nodes found"
return 1
fi


[shfmt] reported by reviewdog 🐶

echo "Checking node: ${node_name}"


[shfmt] reported by reviewdog 🐶

local hugepages_capacity=$(kubectl get node ${node_name} \
-o jsonpath='{.status.capacity.hugepages-2Mi}')


[shfmt] reported by reviewdog 🐶

if [ -n "${hugepages_capacity}" ] && [ "${hugepages_capacity}" != "0" ]; then
echo "Node has hugepages capacity: ${hugepages_capacity}"
return 0
else
echo "No hugepages capacity found on node ${node_name}"
return 1
fi


[shfmt] reported by reviewdog 🐶

local pod_name=$1
local namespace=$2
local container=${3:-postgres}


[shfmt] reported by reviewdog 🐶

echo "Verifying hugepages in pod ${pod_name}"


[shfmt] reported by reviewdog 🐶

# Check /proc/meminfo
local hugepages_total=$(kubectl exec ${pod_name} -n ${namespace} -c ${container} -- \
grep HugePages_Total /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_free=$(kubectl exec ${pod_name} -n ${namespace} -c ${container} -- \
grep HugePages_Free /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

echo "HugePages_Total: ${hugepages_total}"
echo "HugePages_Free: ${hugepages_free}"


[shfmt] reported by reviewdog 🐶

if [ "${hugepages_total}" -gt 0 ]; then
echo "Hugepages are available in pod"
return 0
else
echo "No hugepages in pod"
return 1
fi


[shfmt] reported by reviewdog 🐶

local cluster_name=$1
local expected_value=${2:-try}


[shfmt] reported by reviewdog 🐶

echo "Verifying PostgreSQL huge_pages setting..."


[shfmt] reported by reviewdog 🐶

local huge_pages=$(run_psql_local \
"SHOW huge_pages;" \
"postgres:$(get_psql_user_pass ${cluster_name}-pguser-postgres)@$(get_psql_user_host ${cluster_name}-pguser-postgres)")


[shfmt] reported by reviewdog 🐶

echo "huge_pages: ${huge_pages}"


[shfmt] reported by reviewdog 🐶

if [[ "${huge_pages}" == *"${expected_value}"* ]]; then
echo "PostgreSQL huge_pages is set to '${expected_value}'"
return 0
else
echo "PostgreSQL huge_pages not set to '${expected_value}' (value: ${huge_pages})"
return 1
fi


[shfmt] reported by reviewdog 🐶

local pod_name=$1
local namespace=$2
local container=${3:-database}


[shfmt] reported by reviewdog 🐶

echo "Checking hugepages usage..."


[shfmt] reported by reviewdog 🐶

kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages /proc/meminfo


[shfmt] reported by reviewdog 🐶

local hugepages_total=$(kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages_Total /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_free=$(kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages_Free /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_used=$((hugepages_total - hugepages_free))


[shfmt] reported by reviewdog 🐶

echo ""
echo "HugePages usage:"
echo " Total: ${hugepages_total}"
echo " Used: ${hugepages_used}"


[shfmt] reported by reviewdog 🐶

if [ "${hugepages_used}" -gt 0 ]; then
echo "PostgreSQL is using hugepages"
return 0
else
echo "Hugepages available but NOT being used by PostgreSQL"
return 1
fi
}

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining comments which cannot be posted as a review comment to avoid GitHub Rate Limit

shfmt

[shfmt] reported by reviewdog 🐶

[ $? -eq 0 ] || return 1


[shfmt] reported by reviewdog 🐶

echo "MachineConfig created"
echo "Waiting for worker pool to update (~10 minutes)..."


[shfmt] reported by reviewdog 🐶

kubectl wait --for=condition=Updated mcp/worker --timeout=900s 2>/dev/null || {
echo "Update taking longer than expected"
return 1
}


[shfmt] reported by reviewdog 🐶

echo "Worker pool updated"


[shfmt] reported by reviewdog 🐶

sleep 10
verify_hugepages_on_nodes


[shfmt] reported by reviewdog 🐶

echo "Verifying hugepages on nodes"


[shfmt] reported by reviewdog 🐶

# Get first worker node, fallback to first non-master, fallback to any node
local node_name=$(
kubectl get nodes -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || \
kubectl get nodes -l '!node-role.kubernetes.io/master,!node-role.kubernetes.io/control-plane' -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || \
kubectl get nodes -o jsonpath='{.items[0].metadata.name}'
)


[shfmt] reported by reviewdog 🐶

if [ -z "${node_name}" ]; then
echo "No nodes found"
return 1
fi


[shfmt] reported by reviewdog 🐶

echo "Checking node: ${node_name}"


[shfmt] reported by reviewdog 🐶

local hugepages_capacity=$(kubectl get node ${node_name} \
-o jsonpath='{.status.capacity.hugepages-2Mi}')


[shfmt] reported by reviewdog 🐶

if [ -n "${hugepages_capacity}" ] && [ "${hugepages_capacity}" != "0" ]; then
echo "Node has hugepages capacity: ${hugepages_capacity}"
return 0
else
echo "No hugepages capacity found on node ${node_name}"
return 1
fi


[shfmt] reported by reviewdog 🐶

local pod_name=$1
local namespace=$2
local container=${3:-postgres}


[shfmt] reported by reviewdog 🐶

echo "Verifying hugepages in pod ${pod_name}"


[shfmt] reported by reviewdog 🐶

# Check /proc/meminfo
local hugepages_total=$(kubectl exec ${pod_name} -n ${namespace} -c ${container} -- \
grep HugePages_Total /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_free=$(kubectl exec ${pod_name} -n ${namespace} -c ${container} -- \
grep HugePages_Free /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

echo "HugePages_Total: ${hugepages_total}"
echo "HugePages_Free: ${hugepages_free}"


[shfmt] reported by reviewdog 🐶

if [ "${hugepages_total}" -gt 0 ]; then
echo "Hugepages are available in pod"
return 0
else
echo "No hugepages in pod"
return 1
fi


[shfmt] reported by reviewdog 🐶

local cluster_name=$1
local expected_value=${2:-try}


[shfmt] reported by reviewdog 🐶

echo "Verifying PostgreSQL huge_pages setting..."


[shfmt] reported by reviewdog 🐶

local huge_pages=$(run_psql_local \
"SHOW huge_pages;" \
"postgres:$(get_psql_user_pass ${cluster_name}-pguser-postgres)@$(get_psql_user_host ${cluster_name}-pguser-postgres)")


[shfmt] reported by reviewdog 🐶

echo "huge_pages: ${huge_pages}"


[shfmt] reported by reviewdog 🐶

if [[ "${huge_pages}" == *"${expected_value}"* ]]; then
echo "PostgreSQL huge_pages is set to '${expected_value}'"
return 0
else
echo "PostgreSQL huge_pages not set to '${expected_value}' (value: ${huge_pages})"
return 1
fi


[shfmt] reported by reviewdog 🐶

local pod_name=$1
local namespace=$2
local container=${3:-database}


[shfmt] reported by reviewdog 🐶

echo "Checking hugepages usage..."


[shfmt] reported by reviewdog 🐶

kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages /proc/meminfo


[shfmt] reported by reviewdog 🐶

local hugepages_total=$(kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages_Total /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_free=$(kubectl -n ${namespace} exec ${pod_name} -c ${container} -- \
grep HugePages_Free /proc/meminfo | awk '{print $2}')


[shfmt] reported by reviewdog 🐶

local hugepages_used=$((hugepages_total - hugepages_free))


[shfmt] reported by reviewdog 🐶

echo ""
echo "HugePages usage:"
echo " Total: ${hugepages_total}"
echo " Used: ${hugepages_used}"


[shfmt] reported by reviewdog 🐶

if [ "${hugepages_used}" -gt 0 ]; then
echo "PostgreSQL is using hugepages"
return 0
else
echo "Hugepages available but NOT being used by PostgreSQL"
return 1
fi
}

@nmarukovich nmarukovich force-pushed the K8SPG-740 branch 2 times, most recently from 5441cbd to e337bf8 Compare February 2, 2026 13:13
Comment thread percona/controller/pgcluster/backup.go Outdated
Comment on lines +52 to +56
repoCondition := meta.FindStatusCondition(cr.Status.Conditions, postgrescluster.ConditionRepoHostReady)
if repoCondition == nil || repoCondition.Status != metav1.ConditionTrue {
log.Info("pgBackRest repo host not ready, skipping backup cleanup")
return nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this function is not using repo host, i am confused how this fixes the issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original error full stacktrace here :

ERROR   failed to cleanup outdated backups      {"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"cl
uster1","namespace":"pg2502"}, "namespace": "pg2502", "name": "cluster1", "reconcileID": "bf3f58e4-3d58-4112-b6ec-6af38241dcb7", "error": "get pgBackRest info: pgBackRest info command failed with code 99: other", "errorVerb
ose": "pgBackRest info command failed with code 99:

We have

info, err = pgbackrest.GetInfo(ctx, readyPod, repo.Name)

we try to do

which executes pgbackrest info --repo=repo1

If repohost is not ready, we got an error in this case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this assume that repo1 is a PVC and stored in repo host? what happens if repo1 is s3?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we assume that repo1 pvc,
well we can rewrite check a bit if repo1 is pvc and repohost is not ready, let me know if it will be ok for you.
if repo1 is s3, yee, you right, we will wait until repohost is ready (we don't need to wait it) and delete backups on the next iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've verified that RepoHost is only used for pvc. When using S3/Azure repos, pgBackRest connects directly to cloud storage.
Therefore, we should only wait for RepoHost to be ready when the repository type is volume.

@nmarukovich nmarukovich requested a review from egegunes February 10, 2026 15:55
@egegunes egegunes added this to the v2.9.0 milestone Feb 12, 2026
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
backup-enable-disable passed 00:07:02
builtin-extensions passed 00:05:02
custom-envs passed 00:17:57
custom-extensions passed 00:13:34
custom-tls passed 00:04:58
database-init-sql passed 00:02:22
demand-backup passed 00:22:04
finalizers passed 00:03:54
init-deploy passed 00:03:07
huge-pages passed 00:02:56
monitoring passed 00:07:10
monitoring-pmm3 passed 00:08:06
one-pod passed 00:05:43
operator-self-healing passed 00:08:00
pitr passed 00:11:44
scaling passed 00:04:43
scheduled-backup passed 00:24:15
self-healing passed 00:08:18
sidecars passed 00:02:39
standby-pgbackrest passed 00:11:40
standby-streaming passed 00:09:12
start-from-backup passed 00:10:43
tablespaces passed 00:06:55
telemetry-transfer passed 00:03:33
upgrade-consistency passed 00:05:40
upgrade-minor passed 00:05:25
users passed 00:04:50
Summary Value
Tests Run 27/27
Job Duration 01:17:58
Total Test Time 03:41:46

commit: cab2526
image: perconalab/percona-postgresql-operator:PR-1411-cab252661

@nmarukovich nmarukovich merged commit 6cdcf3d into main Feb 12, 2026
16 checks passed
@nmarukovich nmarukovich deleted the K8SPG-740 branch February 12, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants