PG-2234 Fix pg_rewind with encrypted tables#559
PG-2234 Fix pg_rewind with encrypted tables#559dAdAbird wants to merge 5 commits intopercona:mainfrom
Conversation
Codecov Report❌ Patch coverage is ❌ Your project status has failed because the head coverage (76.44%) is below the target coverage (90.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #559 +/- ##
==========================================
+ Coverage 57.67% 57.92% +0.24%
==========================================
Files 68 69 +1
Lines 10753 10934 +181
Branches 2648 2689 +41
==========================================
+ Hits 6202 6333 +131
- Misses 3281 3310 +29
- Partials 1270 1291 +21
🚀 New features to boost your workflow:
|
83d50fb to
4d3c4ba
Compare
jeltz
left a comment
There was a problem hiding this comment.
Just some quick review comments.
| { | ||
| char *check_path = datasegpath(rlocator, MAIN_FORKNUM, segNo); | ||
|
|
||
| if (strcmp(check_path, path) != 0) |
There was a problem hiding this comment.
Seems like you accidentally removed the comment explaining this part of the code.
There was a problem hiding this comment.
The comment is in inside path_rlocator where the path parsing is
There was a problem hiding this comment.
No?
/*
* The sscanf tests above can match files that have extra characters at
* the end. To eliminate such cases, cross-check that GetRelationPath
* creates the exact same filename, when passed the RelFileLocator
* information we extracted from the filename.
*/
There was a problem hiding this comment.
Ah, I thought you are talking about another comment. Yes, indeed, this one got disappeared somehow. I'll fix, thanks!
| /* | ||
| * Sets rlocator and segNo based on given path. Returns false if didn't find | ||
| * a match. | ||
| * |
| if (strstr(path, ".DS_Store") != NULL) | ||
| return FILE_ACTION_NONE; | ||
|
|
||
| /* |
| while (datapagemap_next(iter, &blkno)) | ||
| { | ||
| offset = blkno * BLCKSZ; | ||
|
|
| static current_file_data current_tde_file = | ||
| { | ||
| 0 | ||
| }; |
There was a problem hiding this comment.
I think we usually write these kind of zero initializers as just {0} on the same line as the delcaration.
| static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX"; | ||
|
|
||
| void | ||
| flush_current_key(void) |
There was a problem hiding this comment.
This should be named something related to tde.
| } | ||
|
|
||
| void | ||
| encrypt_block(unsigned char *buf, off_t file_offset) |
There was a problem hiding this comment.
This should be named something with tde.
artemgavrilov
left a comment
There was a problem hiding this comment.
Partial review comments
| } | ||
|
|
||
| /* | ||
| * The intialization vector of a block is its block number conmverted to a |
There was a problem hiding this comment.
| * The intialization vector of a block is its block number conmverted to a | |
| * The initialization vector of a block is its block number converted to a |
| }; | ||
|
|
||
| /* Dir for an operational copy of source's tde files (_keys, etc) */ | ||
| static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX"; |
There was a problem hiding this comment.
| static char tde_tmp_scource[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX"; | |
| static char tde_tmp_source[MAXPGPATH] = "/tmp/pg_tde_rewindXXXXXX"; |
| PQresultErrorMessage(res)); | ||
|
|
||
| /* no tde dir, nothing to do */ | ||
| if (PQnfields(res) == 0) |
There was a problem hiding this comment.
Mmm... it seems that it should be PQntuples (rows), not PQnfields (columns) here
| } | ||
|
|
||
| void | ||
| encrypt_block(unsigned char *buf, off_t file_offset, ForkNumber fork) |
There was a problem hiding this comment.
It seems that func should be named reencrypt
| if (read_len < 0) | ||
| pg_fatal("could not read file \"%s\": %m", srcpath); | ||
| else if (read_len == 0) | ||
| break; /* EOF reached */ |
There was a problem hiding this comment.
I think we should also check that we read entire block here, because later in crypt/decrypt we expect that we have exactly BLCKSZ bytes.
| { | ||
| ssize_t read_len; | ||
|
|
||
| read_len = read(srcfd, buf.data, sizeof(buf)); |
There was a problem hiding this comment.
sizeof(buf) is bigger than buf.data. It should be just BLCKSZ I think.
| return false; | ||
| if (S_ISDIR(st.st_mode)) | ||
| return true; | ||
| return false; |
There was a problem hiding this comment.
I think this could be simplified a bit as return stat(dir, &st) == 0 && S_ISDIR(st.st_mode);.
| { | ||
| /* | ||
| * Nothing to do, local_queue_fetch_range() copies the ranges immediately. | ||
| */ |
There was a problem hiding this comment.
Maybe write a comment replacing this.
| CalcBlockIv(forknum, blocknum, relKey->base_iv, iv); | ||
|
|
||
| AesEncrypt(relKey->key, relKey->key_len, iv, in, BLCKSZ, out); | ||
| } |
There was a problem hiding this comment.
Maybe this change should be in a separate commit since this refactoring seems to make sense anyway. Turning it more int a library.
|
The second commit message needs to be improved and the first contains some typos, e.g. "blok" instead of "block". |
| @@ -354,6 +364,17 @@ libpq_queue_fetch_file(rewind_source *source, const char *path, size_t len) | |||
| static void | |||
| libpq_queue_fetch_range(rewind_source *source, const char *path, off_t off, | |||
There was a problem hiding this comment.
Why have this wrapper instead of just changing all callsites to pass either true or false?
There was a problem hiding this comment.
This is what rewind_source interface demands. For the "local source", queue_fetch_range and queue_fetch_range are actually two different functions. I'm not eager to change this interface tbh, as it might bring additional pain with upstream changes.
| return; | ||
|
|
||
| create_tde_tmp_dir(); | ||
| init_tde(); |
There was a problem hiding this comment.
Maybe this change should be moved to the first commit?
| source_has_tde = true; | ||
| create_tde_tmp_dir(); | ||
| atexit(destroy_tde_tmp_dir); | ||
| } |
There was a problem hiding this comment.
Same with this. Maybe atexti() should be in the frist commit?
| @@ -0,0 +1,82 @@ | |||
|
|
|||
| @@ -0,0 +1,82 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
We generally do not add copyright headers to our files.
| @@ -0,0 +1,65 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
Same comment about empty line and copyright notice.
| static bool source_has_tde = false; | ||
|
|
||
| static void | ||
| recrypt_fork(ForkNumber fork) |
| * Server can recover from wrecked VM/FSM, hence only warnings here | ||
| * and in the rest of the function | ||
| */ | ||
| pg_log_warning("could not open file for reading \"%s\": %m", srcpath); |
There was a problem hiding this comment.
I think these error messages maybe should be improved with some context.
| read_len = read(srcfd, buf.data, sizeof(buf)); | ||
|
|
||
| if (read_len < 0) | ||
| pg_fatal("could not read file \"%s\": %m", srcpath); |
There was a problem hiding this comment.
Why is this fatal and not jsut a warning? Doesn't PostgreSQL recreate them if they are corrupt?
| @@ -0,0 +1,68 @@ | |||
|
|
|||
| # Copyright (c) 2021-2024, PostgreSQL Global Development Group | |||
There was a problem hiding this comment.
Same issue with empty line and copyright.
ed1d813 to
84bd5bb
Compare
Make it possible to encrypt and decrypt relation blocks outside of the SMGR interface. This will be needed for data re-encryption in pg_rewind.
The rewind might update blocks on the target's file with those of the source. In that case, encrypted files will end up with different blocks encrypted with different internal keys, leading to data corruption. To prevent this, we have to decrypt blocks coming from the source and re-encrypt them with the target's internal key. Along with that, providers might have been changed, primary keys rotated, etc., after the divergence of clusters. Generally, it's not a problem because we would replace the keys and providers on the target with those of the source. But it will render internal keys that we have used for the partial blocks unreadable. To fix this, we have to re-encrypt such internal keys with the source's primary keys. So the sequence is next: 1. Copy the source's pg_tde dir into a tmp location. All following reads and writes of source keys and providers happen in this tmp dir. 2. When we have partial updates, check for the source key. And re-encrypt blocks if the key exists. 3. Save the target key into the source _keys file, replacing the existing one. 4. Replace the target's pg_tde dir with the tmp one.
Extends changes made in the prev commit to work with the remote source. This adds fetching `pgdata/pg_tde` content from the remote source. Also, remote source fetches data in batches, with blocks from different files queued and processed later. So we need to mark blocks that might require re-encryption upon download.
Encrypted files might remain unchanged or truncated by rewind on the target. That leaves them encrypted with the target's key wich vanishes after the pg_tde replace. So we need to ensure keys for such files. Meaning if this is an encrypted relation, we save its target key into the source's keys.
We rewrite the internal key for relations when partially re-encrypting blocks. That makes its FSM and VM fork unreadable as they are still encrypted with the old key. To fix that, we re-encrypt such forks with the proper key after we finish processing the main fork file. As pg_rewind processes files in the order of operation types (see file_action_t) and whole-file copies occur before any partial writes, we assume that for files already in the target datadir, we rewrite them in-place.
The rewind might update blocks on target's file with that of the source.
In that case, encrypted files will end up with different blocks
encrypted with different internal keys, leading to data corruption.
To prevent this, we have to decryp blocks coming from the source and
re-encrypt them with the target's internal key.
Along with that, providers might have been changed, primary keys
rotated, etc., after the divergence of clusters. Generally, it's not a
problem because we would replace the keys and providers on the target
with those of the source. But it will render internal keys that we have
used for the partial blocks unreadable. To fix this, we have to
re-encrypt such internal keys with source's primary keys.
So the sequence is next:
and writes of source keys and providers happen in this tmp dir.
re-encrypt blocks if the key exists.
existing one.
See commit messages for additional info.
There is no need to do anything with WAL keys. Although rewind
may keep a segment from the destination, it might only be a segment
that was created before common history (before the replica was created);
therefore, the key for such a segment would be copied on the replica
while pg_basebackup. The rest of the segments would be wiped out from
the destination and replaced by the source's ones (along with their keys).
All changes only for pg18 so far. I'll port them to pg17 when I get approvals