Skip to content

Manual TLS fails to update certificates on SAN changes and causes cluster stuck on secret regeneration #2278

@myJamong

Description

@myJamong

Report

When cert-manager is not installed, the operator's manual TLS path does not handle splitHorizon additions or secret regeneration safely, resulting in either stale certificates or cluster unavailability during rolling restarts.

More about the problem

  1. When splitHorizons are added to an existing cluster without cert-manager, the operator does not detect the SAN change because reconcileSSL skips processing entirely if both TLS secrets already exist (ssl.go:89-90). The new horizon DNS names are never added to the certificate SANs.
  2. If a user manually deletes the TLS secrets to force regeneration, createSSLManually generates a completely new CA each time via tls.Issue(). During SmartUpdate rolling restart, the first restarted pod gets the new CA while remaining pods still have the old CA — they cannot verify each other's certificates, causing the pod to fail readiness and the cluster to remain stuck in SmartUpdate state indefinitely.

Steps to reproduce

  1. Deploy a PSMDB cluster with TLS enabled and without cert-manager installed.
  2. Wait for the cluster to be running with 3 members (all healthy).
  3. Add splitHorizons configuration to the CR (e.g., external DNS names for each pod).
  4. Observe that the TLS certificate SANs are not updated — the new horizon DNS names are missing from the certificate.
  5. Manually delete the TLS secrets ({name}-ssl and {name}-ssl-internal) to force regeneration.
  6. The operator regenerates the secrets with a new CA, triggering a SmartUpdate rolling restart.
  7. The first secondary pod restarts with the new CA/certificate.
  8. The restarted pod cannot rejoin the replica set because the remaining members still have the old CA and fail TLS verification.
  9. The pod never becomes Ready, and SmartUpdate waits indefinitely — 1 member remains in a non-healthy state and the cluster is stuck.

Versions

  1. Kubernetes 1.34
  2. Operator 1.22.0
  3. Database 8.0.19-7

Anything else?

I've submitted a PR that attempts to fix this issue: (#2277)

The approach introduces a persistent CA secret ({name}-ca-cert) that stores the CA private key, so the operator can re-sign TLS certificates with the same CA when SANs change — eliminating the need for CA merging during rolling restarts.

I'm not entirely sure if this is the best approach to solve the problem, so please feel free to discard it if there's a better solution in mind. Any feedback would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions