Skip to content

PG-2234 Fix pg_rewind with encrypted tables#559

Open
dAdAbird wants to merge 5 commits intopercona:mainfrom
dAdAbird:fix_pg_rewind
Open

PG-2234 Fix pg_rewind with encrypted tables#559
dAdAbird wants to merge 5 commits intopercona:mainfrom
dAdAbird:fix_pg_rewind

Conversation

@dAdAbird
Copy link
Copy Markdown
Member

@dAdAbird dAdAbird commented Mar 31, 2026

The rewind might update blocks on target's file with that of the source.
In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decryp blocks coming from the source and
re-encrypt them with the target's internal key.

Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with source's primary keys.

So the sequence is next:

  1. Copy source's pg_tde dir into a tmp location. All following reads
    and writes of source keys and providers happen in this tmp dir.
  2. When we have partial updates, check for the source key. And
    re-encrypt blocks if the key exists.
  3. Save the target key into the source _keys file replacning the
    existing one.
  4. Replace target's pg_tde dir with the tmp one.

See commit messages for additional info.

There is no need to do anything with WAL keys. Although rewind
may keep a segment from the destination, it might only be a segment
that was created before common history (before the replica was created);
therefore, the key for such a segment would be copied on the replica
while pg_basebackup. The rest of the segments would be wiped out from
the destination and replaced by the source's ones (along with their keys).

All changes only for pg18 so far. I'll port them to pg17 when I get approvals

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 76.65198% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.92%. Comparing base (58d1018) to head (33c7ba2).
⚠️ Report is 8 commits behind head on main.

❌ Your project status has failed because the head coverage (76.44%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #559      +/-   ##
==========================================
+ Coverage   57.67%   57.92%   +0.24%     
==========================================
  Files          68       69       +1     
  Lines       10753    10934     +181     
  Branches     2648     2689      +41     
==========================================
+ Hits         6202     6333     +131     
- Misses       3281     3310      +29     
- Partials     1270     1291      +21     
Components Coverage Δ
access 81.77% <100.00%> (-0.19%) ⬇️
bin 63.76% <ø> (ø)
catalog 78.33% <ø> (+0.73%) ⬆️
common 88.23% <ø> (ø)
encryption 63.63% <100.00%> (+5.65%) ⬆️
keyring 65.55% <ø> (ø)
src 87.33% <ø> (ø)
smgr 90.20% <100.00%> (+0.66%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 14 times, most recently from 83d50fb to 4d3c4ba Compare April 10, 2026 19:00
@dAdAbird dAdAbird changed the title Fix pg rewind PG-2234 Fix pg_rewind with encrypted tables Apr 10, 2026
@dAdAbird dAdAbird marked this pull request as ready for review April 10, 2026 19:04
Copy link
Copy Markdown
Collaborator

@jeltz jeltz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick review comments.

{
char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo);

if (strcmp(check_path, path) != 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you accidentally removed the comment explaining this part of the code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is in inside path_rlocator where the path parsing is

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No?

	/*
	 * The sscanf tests above can match files that have extra characters at
	 * the end. To eliminate such cases, cross-check that GetRelationPath
	 * creates the exact same filename, when passed the RelFileLocator
	 * information we extracted from the filename.
	 */

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I thought you are talking about another comment. Yes, indeed, this one got disappeared somehow. I'll fix, thanks!

Comment thread fetools/pg18/pg_rewind/filemap.c Outdated
/*
* Sets rlocator and segNo based on given path. Returns false if didn't find
* a match.
*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace.

Comment thread fetools/pg18/pg_rewind/filemap.c Outdated
if (strstr(path, ".DS_Store") != NULL)
return FILE_ACTION_NONE;

/*
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trailing whitespace.

Comment thread fetools/pg18/pg_rewind/pg_rewind.c Outdated
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental extra change.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static current_file_data current_tde_file =
{
0
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we usually write these kind of zero initializers as just {0} on the same line as the delcaration.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX";

void
flush_current_key(void)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named something related to tde.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
}

void
encrypt_block(unsigned char *buf, off_t file_offset)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named something with tde.

Copy link
Copy Markdown
Collaborator

@artemgavrilov artemgavrilov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review comments

Comment thread src/encryption/enc_tde.c Outdated
}

/*
* The intialization vector of a block is its block number conmverted to a
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The intialization vector of a block is its block number conmverted to a
* The initialization vector of a block is its block number converted to a

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
};

/* Dir for an operational copy of source's tde files (_keys, etc) */
static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX";
static char tde_tmp_source[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX";

Comment thread fetools/pg18/pg_rewind/libpq_source.c Outdated
PQresultErrorMessage(res));

/* no tde dir, nothing to do */
if (PQnfields(res) == 0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... it seems that it should be PQntuples (rows), not PQnfields (columns) here

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
}

void
encrypt_block(unsigned char *buf, off_t file_offset, ForkNumber fork)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that func should be named reencrypt

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
if (read_len < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
else if (read_len == 0)
break; /* EOF reached */
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that we read entire block here, because later in crypt/decrypt we expect that we have exactly BLCKSZ bytes.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
{
ssize_t read_len;

read_len = read(srcfd, buf.data, sizeof(buf));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(buf) is bigger than buf.data. It should be just BLCKSZ I think.

Comment thread fetools/pg18/pg_rewind/local_source.c Outdated
return false;
if (S_ISDIR(st.st_mode))
return true;
return false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified a bit as return stat(dir, &st) == 0 && S_ISDIR(st.st_mode);.

{
/*
* Nothing to do, local_queue_fetch_range() copies the ranges immediately.
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write a comment replacing this.

Comment thread src/access/pg_tde_tdemap.c
Comment thread src/encryption/enc_tde.c
CalcBlockIv(forknum, blocknum, relKey->base_iv, iv);

AesEncrypt(relKey->key, relKey->key_len, iv, in, BLCKSZ, out);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this change should be in a separate commit since this refactoring seems to make sense anyway. Turning it more int a library.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes sense

@jeltz
Copy link
Copy Markdown
Collaborator

jeltz commented Apr 28, 2026

The second commit message needs to be improved and the first contains some typos, e.g. "blok" instead of "block".

@@ -354,6 +364,17 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len)
static void
libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this wrapper instead of just changing all callsites to pass either true or false?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what rewind_source interface demands. For the "local source", queue_fetch_range and queue_fetch_range are actually two different functions. I'm not eager to change this interface tbh, as it might bring additional pain with upstream changes.

return;

create_tde_tmp_dir();
init_tde();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this change should be moved to the first commit?

source_has_tde = true;
create_tde_tmp_dir();
atexit(destroy_tde_tmp_dir);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this. Maybe atexti() should be in the frist commit?

@@ -0,0 +1,82 @@

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this extra newline.

@@ -0,0 +1,82 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do not add copyright headers to our files.

@@ -0,0 +1,65 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about empty line and copyright notice.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
static bool source_has_tde = false;

static void
recrypt_fork(ForkNumber fork)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is "reencrypt".

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
* Server can recover from wrecked VM/FSM, hence only warnings here
* and in the rest of the function
*/
pg_log_warning("could not open file for reading \"%s\": %m", srcpath);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these error messages maybe should be improved with some context.

Comment thread fetools/pg18/pg_rewind/tde_ops.c Outdated
read_len = read(srcfd, buf.data, sizeof(buf));

if (read_len < 0)
pg_fatal("could not read file \"%s\": %m", srcpath);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this fatal and not jsut a warning? Doesn't PostgreSQL recreate them if they are corrupt?

Comment thread t/pg_rewind_enc_fsm.pl Outdated
@@ -0,0 +1,68 @@

# Copyright (c) 2021-2024, PostgreSQL Global Development Group
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue with empty line and copyright.

@dAdAbird dAdAbird force-pushed the fix_pg_rewind branch 5 times, most recently from ed1d813 to 84bd5bb Compare May 4, 2026 16:13
Make it possible to encrypt and decrypt relation blocks outside of the
SMGR interface. This will be needed for data re-encryption in pg_rewind.
dAdAbird added 4 commits May 4, 2026 19:48
The rewind might update blocks on the target's file with those of the
source. In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decrypt blocks coming from the source and
re-encrypt them with the target's internal key.

Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with the source's primary keys.

So the sequence is next:
1. Copy the source's pg_tde dir into a tmp location. All following reads
and writes of source keys and providers happen in this tmp dir.
2. When we have partial updates, check for the source key. And
re-encrypt blocks if the key exists.
3. Save the target key into the source _keys file, replacing the
existing one.
4. Replace the target's pg_tde dir with the tmp one.
Extends changes made in the prev commit to work with the remote source.
This adds fetching `pgdata/pg_tde` content from the remote source.
Also, remote source fetches data in batches, with blocks from different
files queued and processed later. So we need to mark blocks that might
require re-encryption upon download.
Encrypted files might remain unchanged or truncated by rewind on the
target. That leaves them encrypted with the target's key wich vanishes
after the pg_tde replace. So we need to ensure keys for such files.
Meaning if this is an encrypted relation, we save its target key into
the source's keys.
We rewrite the internal key for relations when partially re-encrypting
blocks. That makes its FSM and VM fork unreadable as they are still
encrypted with the old key. To fix that, we re-encrypt such forks with
the proper key after we finish processing the main fork file.

As pg_rewind processes files in the order of operation types (see
file_action_t) and whole-file copies occur before any partial writes,
we assume that for files already in the target datadir, we rewrite
them in-place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants