Conversation
There was a problem hiding this comment.
Pull request overview
This pull request adds pg_tde (Transparent Data Encryption) support for PostgreSQL clusters, integrating with HashiCorp Vault for key management. The implementation follows the existing extension pattern in the codebase and includes API changes, reconciliation logic, and CRD updates.
Changes:
- Added pg_tde extension support with Vault integration for managing encryption keys
- Introduced new API types (PGTDESpec, PGTDEVaultSpec, PGTDESecretObjectReference) for configuring pg_tde
- Refactored extension configuration to use individual extension structs instead of the deprecated builtin structure
- Added reconciliation logic for configuring pg_tde providers and managing encryption keys
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/apis/postgres-operator.crunchydata.com/v1beta1/postgrescluster_types.go | Added PGTDESpec, PGTDEVaultSpec, and PGTDESecretObjectReference types for pg_tde configuration |
| pkg/apis/postgres-operator.crunchydata.com/v1beta1/zz_generated.deepcopy.go | Auto-generated DeepCopy methods for new pg_tde types |
| pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go | Added BuiltInExtensionSpec and pg_tde support; refactored extension defaults handling |
| pkg/apis/pgv2.percona.com/v2/zz_generated.deepcopy.go | Auto-generated DeepCopy methods for BuiltInExtensionSpec |
| internal/pgtde/postgres.go | New package implementing pg_tde enable/disable and Vault provider configuration |
| internal/postgres/reconcile.go | Added pg_tde volume and volume mount configuration for PostgreSQL instance pods |
| internal/controller/postgrescluster/postgres.go | Added reconcilePGTDEProviders function to configure pg_tde Vault providers |
| internal/controller/postgrescluster/controller.go | Integrated pg_tde parameters and provider reconciliation into main reconciliation flow |
| internal/controller/postgrescluster/instance.go | Added TDEInstalledAnnotation to instance pods when pg_tde is enabled |
| internal/naming/names.go | Added constants for pg_tde volume names, mount paths, and provider identifiers |
| internal/naming/annotations.go | Added TDEInstalledAnnotation constant |
| internal/pgvector/postgres.go | Fixed comments to correctly reference pgvector instead of pgAudit |
| percona/controller/pgbackup/controller.go | Added pg.Default() call to ensure extension defaults before backup |
| deploy/.yaml and config/crd/bases/.yaml | Updated CRD definitions with pg_tde specifications |
| deploy/cr.yaml | Added example pg_tde configuration with Vault settings |
| pkg/apis/pgv2.percona.com/v2/perconapgbackup_types_test.go | Replaced custom ptr() function with k8s.io/utils/ptr.To |
Comments suppressed due to low confidence (1)
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go:436
- Inconsistent usage of extension enabled flags. Line 432 uses cr.Spec.Extensions.PGStatMonitor.Enabled (the new field structure), while lines 433-436 still use cr.Spec.Extensions.BuiltIn.* (the deprecated field structure). All lines should be updated to use the new field structure consistently for maintainability.
postgresCluster.Spec.Extensions.PGStatMonitor = *cr.Spec.Extensions.PGStatMonitor.Enabled
postgresCluster.Spec.Extensions.PGStatStatements = *cr.Spec.Extensions.BuiltIn.PGStatStatements
postgresCluster.Spec.Extensions.PGAudit = *cr.Spec.Extensions.BuiltIn.PGAudit
postgresCluster.Spec.Extensions.PGVector = *cr.Spec.Extensions.BuiltIn.PGVector
postgresCluster.Spec.Extensions.PGRepack = *cr.Spec.Extensions.BuiltIn.PGRepack
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 43 out of 45 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go:451
- Inconsistent use of new vs deprecated extension fields when converting to Crunchy format. Line 447 uses the new field
cr.Spec.Extensions.PGStatMonitor.Enabled, but lines 448-451 use the deprecatedcr.Spec.Extensions.BuiltIn.*fields. While SetExtensionDefaults ensures both are populated for backwards compatibility, this inconsistency is error-prone and makes the code harder to maintain. Consider using the new field structure consistently (e.g.,*cr.Spec.Extensions.PGStatStatements.Enabledinstead of*cr.Spec.Extensions.BuiltIn.PGStatStatements).
|
This looks really interesting, definitely would help to reduce doing some workarounds. As KMIP is the standard protocol which is also supported by Vault and most cloud providers integrate it into their secret manager, it should/must be supported besides Vault. |
@matserix we decided to just start with Vault. KMIP support will be added in future releases. |
This commit adds native pg_tde extension support into operator.
**This commit only adds Vault KMS support for pg_tde. KMIP support will
be added in future releases.**
When pg_tde is enabled and Vault configuration is provided, the operator:
- appends pg_tde into shared_preload_libraries,
- mounts Vault token and CA secrets into database containers,
- runs CREATE EXTENSION in all databases,
- creates Vault provider by running pg_tde_add_global_key_provider_vault_v2,
- create a global key by running pg_tde_create_key_using_global_key_provider,
- sets the default key by running pg_tde_set_default_key_using_global_key_provider.
-> Example configuration
pg_tde:
enabled: true
vault:
host: https://vault-service.vault-service.svc:8200
mountPath: tde
tokenSecret:
name: vault-secret
key: token
caSecret:
name: vault-secret
key: ca.crt
Note that:
- Mount path needs to be a KV v2 storage engine.
- caSecret is optional and can be omitted if you want to use http. But
in my testing I couldn't manage the make vault work without TLS. It
responds with HTTP 405 if I disable TLS in vault.
- tokenSecret and caSecret can be the same secret or different. Operator
doesn't assume anything about the contents of the secrets since you'll
need to set secret keys in cr.yaml yourself.
- Using a non-root token requires more configuration. Check out pg_tde
docs for that. But don't forget to add these in the Vault policy:
```
path "sys/internal/ui/mounts/*" {
capabilities = ["read"]
}
path "sys/mounts/*" {
capabilities = ["read"]
}
```
-> API changes
pg_tde requires more configuration options than other extensions
operator supports. This required us make some changes in the extensions
API. With these changes, 'spec.extensions.builtin' section is deprecated
and all builtin extensions are moved to 'spec.extensions.<extension>'
(i.e. 'spec.extensions.pg_stat_monitor'). Right now extensions can be
enabled/disabled with the old and the new method. If two methods are
used at the same time, 'spec.extensions.builtin' takes precedence.
-> Status changes
A hash will be calculated using pg_tde configuration provided by user.
Operator uses this hash to understand if config is changed and it should
reconfigure pg_tde. The hash can be found in status.pgTDERevision field
of **PostgresCluster** object. This hash will be removed when pg_tde is
disabled.
Operator also communicates the status of pg_tde with conditions. The
condition with type=PGTDEEnabled can be found in both PerconaPGCluster
and PostgresCluster statuses.
-> Disabling pg_tde
Disabling pg_tde is more complex than other extensions:
- First of all any encrypted objects must be dropped before disabling.
Otherwise DROP EXTENSION will fail with a descriptive error message.
**Operator won't drop anything, user needs to do this manually.**
- The extension needs to be disabled in two steps:
1. First set pg_tde.enabled=false without removing the vault section.
Operator will drop the extension and restart the pods.
2. Then you can remove pg_tde.vault. Database pods will be restarted
again to remove secret mounts from containers.
- It's recommended to run CHECKPOINT before removing pg_tde.vault. Even
though extension is dropped, Postgres might still try to use encrypted
objects during recovery after restart and it might try to access token
secret. CHECKPOINT helps you prevent this failure case.
-> Deleting and recreating clusters
If cluster with pg_tde enabled is deleted but PVCs are retained, on
recreation you'll see some errors about pg_tde in operator logs. They
happen because the vault provider and/or global key already exists.
Operator will handle these errors gracefully and configure pg_tde. Same
thing applies when pg_tde is disabled and re-enabled. Since both vault
provider and global key already exists, operator will handle "already
exists" errors and configure pg_tde.
The global key name is determined by cluster's .metadata.uid. For
example 'global-master-key-ad19534a-d778-460e-ac87-ca38ef5e6755'. This
means the key will be changed if cluster is deleted and recreated. As
long as the old key and the new key is accessible to pg_tde, this won't
cause any issues. pg_tde will handle it as it handles key rotation.
-> Validations
- You can't set pg_tde.enabled=true without setting pg_tde.vault.
- If you already had pg_tde.enabled, you can't remove pg_tde section
completely.
- If you already had pg_tde.enabled, you can't remove pg_tde.vault
section completely.
eleo007
left a comment
There was a problem hiding this comment.
Use new extensions structure in e2e-tests/tests/builtin-extensions/03-install-all-ext.yaml
| } | ||
| } | ||
|
|
||
| // K8SPG-911 |
There was a problem hiding this comment.
If possible, let's extract all this pgtde logic from the reconcilePostgresDatabases function. We are essentially doing 2 things here that are scattered around the function.
- run the enable/disable flow
- some lines later we set the status for this feature through
SetStatusCondition
We could have a function like func (r *Reconciler) reconcilePGTDE( ctx context.Context, cluster *v1beta1.PostgresCluster, exec postgres.Executor, ) bool
that handles all this pgtde logic in a single place, calling it through on this like
// K8SPG-911
pgTdeOK = r.reconcilePGTDE(ctx, cluster, exec)
There was a problem hiding this comment.
enabling extension needs to be done in reconcilePostgresDatabases because it's affected by DatabaseRevision. reconcilePGTDEProviders only configures pg_tde. I think it should stay like this.
| instances *observedInstances, | ||
| patchStatus func() error, | ||
| ) error { | ||
| var podExecutor postgres.Executor |
There was a problem hiding this comment.
Not sure why we need to define this var here. Couldn't we just podExecutor := func( ctx context.Context, stdin io.Reader, stdout, stderr io.Writer, command ...string, ) error { return r.PodExec(ctx, pod.Namespace, pod.Name, container, stdin, stdout, stderr, command...) } some lines later?
| log := logging.FromContext(ctx).WithName("PGTDE") | ||
|
|
||
| ctx = logging.NewContext(ctx, logging.FromContext(ctx).WithValues("pod", pod.Name)) |
There was a problem hiding this comment.
Maybe something more clear since fields are split across log and ctx
log := logging.FromContext(ctx).WithName("PGTDE").WithValues("pod", pod.Name)
ctx = logging.NewContext(ctx, log)
?
| log := log.WithValues("revision", revision) | ||
| log.Info("configuring pg_tde", "pod", pod.Name) | ||
| ctx := logging.NewContext(ctx, log) | ||
| if cluster.Status.PGTDERevision == "" { | ||
| err = pgtde.AddVaultProvider(ctx, podExecutor, vault) | ||
| if err == nil || errors.Is(err, pgtde.ErrAlreadyExists) { | ||
| err = pgtde.CreateGlobalKey(ctx, podExecutor, cluster.UID) | ||
| } | ||
| if err == nil || errors.Is(err, pgtde.ErrAlreadyExists) { | ||
| err = pgtde.SetDefaultKey(ctx, podExecutor, cluster.UID) | ||
| } | ||
| } else { | ||
| err = pgtde.ChangeVaultProvider(ctx, podExecutor, vault) | ||
| } | ||
| err = errors.WithStack(err) |
There was a problem hiding this comment.
Recommend we extract this to a function, we set the err in 4 different places and then we overwrite it with errors.WithStack(err) to handle it afterwards with the next if clause. It will be easier to reason.
| return expected | ||
| } | ||
|
|
||
| ctx := context.Background() |
There was a problem hiding this comment.
we can use the ctx of the test t.Context()
| } | ||
|
|
||
| func TestCreateGlobalKey(t *testing.T) { | ||
| t.Run("success", func(t *testing.T) { |
There was a problem hiding this comment.
TestCreateGlobalKey and TestSetDefaultKey "success" subtests are misnamed most probably since both only verify the whoops case. Is that correct?
There was a problem hiding this comment.
should we also test the nil error case in general?
There was a problem hiding this comment.
I don't know why but this pattern was used in upstream and we use the same in every unit tests for extensions.
I don't think testing the nil case will make any difference. Do you have a case in mind?
| if cluster.Spec.Extensions.PGTDE.Enabled || pgTDEEnabled { | ||
| pgtde.PostgreSQLParameters(&pgParameters) | ||
| } | ||
|
|
There was a problem hiding this comment.
What if the cluster is recreated with existing PVCs where the pg_tde extension had previously been enabled? As I understand it, the new cluster would not have the status condition set, causing this check to be false. As a result, pg_tde would be removed from shared libraries without the extension being dropped first.
There was a problem hiding this comment.
this condition checks two things: either cluster.Spec.Extensions.PGTDE.Enabled is true or condition is true.
if cluster is deleted and recreated with pg_tde enabled, it'll be added into shared libraries.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 52 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go:454
- Inconsistent use of extension APIs when converting PerconaPGCluster to PostgresCluster. Line 450 correctly uses the new API (cr.Spec.Extensions.PGStatMonitor.Enabled), but lines 451-454 still use the deprecated BuiltIn API (cr.Spec.Extensions.BuiltIn.*). This should be updated to use the new API consistently: cr.Spec.Extensions.PGStatStatements.Enabled, cr.Spec.Extensions.PGAudit.Enabled, cr.Spec.Extensions.PGVector.Enabled, and cr.Spec.Extensions.PGRepack.Enabled.
pkg/apis/pgv2.percona.com/v2/perconapgcluster_types.go:454 - Potential nil pointer dereference. The ToCrunchy method at lines 451-454 dereferences pointers (e.g., *cr.Spec.Extensions.BuiltIn.PGStatStatements) without checking if they are nil first. While SetExtensionDefaults ensures these values are set when called during Default(), if ToCrunchy is called on a PerconaPGCluster object that hasn't gone through the Default() webhook, this will panic. Add nil checks or ensure Default() is always called before ToCrunchy, or use ptr.Deref with a default value.
| if cluster.Status.PGTDERevision == "" { | ||
| err := addVaultProvider(ctx, exec, vault) | ||
|
|
||
| if err == nil || errors.Is(err, errAlreadyExists) { | ||
| err = createGlobalKey(ctx, exec, cluster.UID) | ||
| } | ||
|
|
||
| if err == nil || errors.Is(err, errAlreadyExists) { | ||
| err = setDefaultKey(ctx, exec, cluster.UID) | ||
| } | ||
|
|
||
| return err |
There was a problem hiding this comment.
The error handling in ReconcileVaultProvider has a subtle bug. If addVaultProvider succeeds (err == nil), the code will call createGlobalKey. If createGlobalKey returns errAlreadyExists, the condition on line 248 (err == nil || errors.Is(err, errAlreadyExists)) will be true, so setDefaultKey will be called. However, err is still errAlreadyExists at this point. When setDefaultKey succeeds (returns nil), err will be set to nil on line 249. But if setDefaultKey fails with a different error, that error will be returned. The issue is that if createGlobalKey returns errAlreadyExists and setDefaultKey succeeds, the function will return nil (success) even though the global key wasn't actually created in this reconciliation. This might be intentional for idempotency, but it should be documented or the logic should be clearer.
There was a problem hiding this comment.
i don't get it. if createGlobalKey returns errAlreadyExists, it already exists.
commit: fc2a47f |
CHANGE DESCRIPTION
This commit adds native pg_tde extension support into operator.
This commit only adds Vault KMS support for pg_tde. KMIP support will be added in future releases.
When pg_tde is enabled and Vault configuration is provided, the operator:
Example configuration
Note that:
API changes
pg_tde requires more configuration options than other extensions operator supports. This required us make some changes in the extensions API. With these changes, 'spec.extensions.builtin' section is deprecated and all builtin extensions are moved to 'spec.extensions.' (i.e. 'spec.extensions.pg_stat_monitor'). Right now extensions can be enabled/disabled with the old and the new method. If two methods are used at the same time, 'spec.extensions.builtin' takes precedence.
Status changes
A hash will be calculated using pg_tde configuration provided by user. Operator uses this hash to understand if config is changed and it should reconfigure pg_tde. The hash can be found in status.pgTDERevision field of PostgresCluster object. This hash will be removed when pg_tde is disabled.
Operator also communicates the status of pg_tde with conditions. The condition with type=PGTDEEnabled can be found in both PerconaPGCluster and PostgresCluster statuses.
Disabling pg_tde
Disabling pg_tde is more complex than other extensions:
Deleting and recreating clusters
If cluster with pg_tde enabled is deleted but PVCs are retained, on recreation you'll see some errors about pg_tde in operator logs. They happen because the vault provider and/or global key already exists. Operator will handle these errors gracefully and configure pg_tde. Same thing applies when pg_tde is disabled and re-enabled. Since both vault provider and global key already exists, operator will handle "already exists" errors and configure pg_tde.
The global key name is determined by cluster's .metadata.uid. For example 'global-master-key-ad19534a-d778-460e-ac87-ca38ef5e6755'. This means the key will be changed if cluster is deleted and recreated. As long as the old key and the new key is accessible to pg_tde, this won't cause any issues. pg_tde will handle it as it handles key rotation.
Validations
Multi platform testing