MDEV-34358 Encryption threads consume CPU when no work available#5040
Conversation
|
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements an exponential backoff mechanism for InnoDB encryption threads to reduce CPU usage when tablespaces are temporarily unavailable, such as during DDL operations. It introduces a versioning system for encryption settings to trigger iteration restarts and adds debug status variables to track wait behavior. The review feedback highlights several improvement opportunities: addressing dead code wrapped in #if 0, correcting the timeout threshold logic to match the intended behavior described in comments, removing redundant counter resets, and increasing the bit-width of the settings version counter to prevent potential overflow.
| Incremented each time innodb_encrypt_tables or | ||
| innodb_encryption_rotate_key_age is modified to signal | ||
| encryption threads to restart iteration */ | ||
| static Atomic_counter<uint16_t> fil_crypt_settings_version; |
There was a problem hiding this comment.
Using uint16_t for a version counter is risky as it can wrap around relatively quickly (65535 changes). While unlikely to cause an issue in typical operation, using uint32_t or uint64_t is safer and generally has the same performance characteristics for atomic operations on modern architectures.
static Atomic_counter<uint32_t> fil_crypt_settings_version;There was a problem hiding this comment.
If user changes innodb_encrypt_tables variable every hour then this variable may ran out in 7.5 years. If we restart the server in the mean time then this variable becomes 0. This datatype is enough for practical usage. I don't think sane people is going to toggle encryption for every hour.
There was a problem hiding this comment.
While I agree with you Thiru, it's true that Atomic_counter<uint16_t> shows only one result in the new code. Atomic_counter<uint32_t> is more common, and behaves likely similar while being wider. I think it may be considered to use Atomic_counter<uint32_t>. It will maybe cause inefficiency in rotate_thread_t structure due to widening of rotate_thread_t::settings_version. Could you please check in gdb with ptype /o rotate_thread_t if there's any hole/padding that allows widening without size increase?
dr-m
left a comment
There was a problem hiding this comment.
I think that we need to consider two paths forward. I see that this pull request is currently targeting the 10.6 branch, to which only minimal changes should be done, given that the branch will soon reach its end of life.
There is also the hang MDEV-37946, which I expect we should be hitting when testing this extensively enough. I think that the hang needs to be fixed as part of a minimal fix; it should be fairly easy.
For newer branches, I would like to see a more comprehensive solution, which I am outlining below.
I’d like to see some testing of this subsystem. I have some doubts of the efficacy of the current multi-threaded implementation. We currently have multiple threads that can increase the size of buf_pool.flush_list, It could be the case that multiple threads are concurrently calling buf_flush_list_space(), which could repeatedly invalidate buf_pool.flush_hp.
As a starting point, I think that we have to test if there is any benefit of having multiple "encryption threads" which perform dummy writes to data pages. The name "encryption thread" is misleading, because the actual encryption will take place in the single buf_flush_page_cleaner() thread, in the function buf_page_encrypt(). Yes, this CPU intensive operation will be executed in a single thread!
How can it possibly be helpful to have multiple "producers" (fil_crypt_thread) that are dirtying the buffer pool, when we have a single "consumer" (buf_flush_page_cleaner) that is doing some very CPU intensive work (buf_page_encrypt() and computing CRC-32C)?
Could we implement tighter coupling between the current "encryption" (page-dirtying) and the buf_flush_page_cleaner()? That thread is already regularly scanning potentially the entire buffer pool. So, why not implement an "encryption step" that would process a number of pages that were missed by buf_flush_list_holding_mutex() and buf_flush_LRU()? Such a design could trivially follow the innodb_io_capacity budget and scale back the key rotation activity when there are bursts of application workload (some threads are blocked in buf_flush_wait() or buf_LRU_get_free_block()).
When a clean page that we wish to re-encrypt is in the buffer pool, the buf_flush_page_cleaner() thread can simply write a dummy log record and move the page to the buf_pool.flush_list.
When a page will need to be read into the buffer pool, the current implementation is invoking synchronous buf_page_read() via buf_page_get_gen(). It would be much better to submit asynchronous reads, implementing the logic with a future implementation of MDEV-11378 in mind. The read completion callback would have to detect that this is a special read for encryption, and it would write the dummy record so that the page will be re-encrypted.
The relevant parts of fil_space_rotate_state_t and key_state_t may have to be made part of fil_space_t or fil_space_crypt_t.
A final part of this would be to repurpose the parameter innodb_encrypt_threads to refer to worker tasks that the buf_flush_page_cleaner() would employ to encrypt pages or compute CRC-32C. The function buf_page_t::flush() would be split. Only initial part would be invoked by the buf_flush_page_cleaner() thread; the rest would be invoked by worker tasks, which would process a number of buf_page_t* from their queue.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an exponential backoff mechanism for InnoDB encryption threads to optimize CPU usage and a versioning system to manage configuration changes, including updates to status variable tests. The review feedback identifies several improvements: using 32-bit integers for timeout and version counters to prevent overflow and wrap-around, correcting the wait logic for the default encryption list, resetting timeout states when work is found, and addressing a busy-wait loop in IOPS allocation.
| Used only when iterating fil_system.space_list to periodically recheck for | ||
| spaces that become available after DDL operations complete. Reset to 5000ms | ||
| when woken by signal or when work is found. */ | ||
| uint16_t sleep_timeout_ms= 5000; |
There was a problem hiding this comment.
The variable sleep_timeout_ms should be changed to uint32_t to avoid potential overflow issues during calculations in increase_sleep_timeout(). When sleep_timeout_ms is 40,000, sleep_timeout_ms * 2 results in 80,000, which exceeds the uint16_t range and causes incorrect behavior when passed to std::min<uint16_t>.
uint32_t sleep_timeout_ms= 5000;There was a problem hiding this comment.
The writes are bounded, no need to widen IMHO.
| timed wait expires without being woken by a signal, indicating no work | ||
| became available during the timeout period. */ | ||
| void increase_sleep_timeout() { | ||
| sleep_timeout_ms = std::min<uint16_t>(sleep_timeout_ms * 2, 60000); |
There was a problem hiding this comment.
There was a problem hiding this comment.
Please, address this: you can widen sleep_timeout_ms to 32-bit before doing * 2 to avoid 16-bit overflow, then clamp it to the max of 60 seconds.
There was a problem hiding this comment.
Or, change to sleep_timeout_s and avoid the /1000 in set_timespec(abstime, sleep_timeout_ms / 1000);.
| Incremented each time innodb_encrypt_tables or | ||
| innodb_encryption_rotate_key_age is modified to signal | ||
| encryption threads to restart iteration */ | ||
| static Atomic_counter<uint16_t> fil_crypt_settings_version; |
There was a problem hiding this comment.
Using uint16_t for a version counter is risky in a long-running database server as it can wrap around relatively quickly (65,535 changes). While the logic only checks for inequality, using uint32_t is safer and more standard for versioning to avoid any collision during a single iteration.
static Atomic_counter<uint32_t> fil_crypt_settings_version;| /** Config version when thread started current iteration. | ||
| Used to detect innodb_encrypt_tables changes during iteration | ||
| and restart from beginning to ensure complete encryption coverage. */ | ||
| uint16_t settings_version; |
| return; | ||
| } | ||
|
|
||
| if (space == fil_system.space_list.end()) { |
There was a problem hiding this comment.
There is a discrepancy between the code and the intended design described in the PR and the comments (lines 1131-1133). The code currently performs a timed wait when space == .end(), regardless of whether it's iterating the default_encrypt_list. To follow the stated design of using an indefinite wait for the default list (where other threads are expected to wake this one), the condition should check the default_encrypt_list flag.
if (space == fil_system.space_list.end() && !default_encrypt_list) {There was a problem hiding this comment.
Are you kidding me, right?
Please check the callers for wait_for_work()
if (thr.default_encrypt_list) {
pthread_cond_broadcast(
&fil_crypt_threads_cond);
}
goto wait_for_work;
}
| goto wait_for_work; | ||
| } | ||
|
|
||
| /* we found a space to rotate */ |
There was a problem hiding this comment.
When a tablespace is successfully found for rotation, the sleep timeout and timed wait count should be reset. This ensures that the exponential backoff starts fresh the next time the thread becomes idle, rather than potentially jumping straight to an indefinite wait due to stale state from previous idle periods.
/* we found a space to rotate */
thr.reset_sleep_timeout();| do { | ||
| if (thr.should_shutdown()) { | ||
| thr.space->release(); | ||
| thr.space = fil_system.space_list.end(); | ||
| goto func_exit; | ||
| } | ||
| } while (!fil_crypt_alloc_iops(&thr)); | ||
|
|
There was a problem hiding this comment.
This loop for IOPS allocation is a busy-wait that consumes 100% CPU if fil_crypt_alloc_iops returns false (e.g., when the global encryption IOPS limit is reached). While this pattern existed in the previous version, it remains a performance bottleneck. Consider adding a short sleep or yielding the thread if IOPS are unavailable.
iMineLink
left a comment
There was a problem hiding this comment.
Leaved few comments, one about potential stats missing update, one about iteration restarts on heavy IOPS usage, few about comments improvements, and the most important one about a missing mtr test to "smoke-test" the new behavior measurable with the INNODB_ENCRYPTION_INDEFINITE_WAITS and INNODB_ENCRYPTION_TIMED_WAITS variables. I also commented previous Gemini review comments, please check, I think only the overflow on 16-bit timeout variable is important.
02f8903 to
b709e6a
Compare
iMineLink
left a comment
There was a problem hiding this comment.
Requesting a few touchups, the only critical one (please review if necessary) is in fil_crypt_find_space_to_rotate(). Thanks.
b709e6a to
4b43218
Compare
iMineLink
left a comment
There was a problem hiding this comment.
Thanks for addressing the previous review comments.
I found a minor stale comment and a potential case in which "polling" behavior would be still observed: the safest change is to throttle it with a sleep I guess, please take a look.
4b43218 to
6940997
Compare
iMineLink
left a comment
There was a problem hiding this comment.
There is a failure in encryption.create_or_replace which needs attention and some reasoning needed around the indefinite wait in wait_for_work() now that polling is removed. Thank you.
6940997 to
4a40c0d
Compare
4a40c0d to
05a2ff1
Compare
iMineLink
left a comment
There was a problem hiding this comment.
Thanks for the changes. I did some testing and everything was OK:
- 3x
./build/Debug/mysql-test/mtr --mem encryption.create_or_replace{,,,,,,,,,} --parallel=auto --repeat=100 ./build/Debug/mysql-test/mtr --mem --parallel=auto
I have one final comment (even though no failure was seen in mtr), please check.
I think this is a pretty thorough rework of the encryption subsystem, and will need dedicated testing too.
05a2ff1 to
7f2e88f
Compare
iMineLink
left a comment
There was a problem hiding this comment.
Thanks for addressing my last comment.
I did another round of stress testing with ./build/Debug/mysql-test/mtr --mem encryption.create_or_replace{,,,,,,,,,} --parallel=auto --repeat=100 and it passed.
The changes now look good to me, but they will need dedicated testing before merge because of the wide impact on encryption.
7f2e88f to
28e67ab
Compare
28e67ab to
3e67de1
Compare
| settings_version(fil_crypt_settings_version), thread_no(no){} | ||
|
|
||
| uint thread_no; | ||
| bool first = true; /*!< is position before first space */ |
There was a problem hiding this comment.
I’d find the name before_first more descriptive. I don’t insist that it be changed as part of this.
3e67de1 to
f8a932d
Compare
f8a932d to
a63fa8d
Compare
975b1d6 to
a5bef95
Compare
| if (done && !(crypt_data->rotate_state.active_threads | ||
| | crypt_data->rotate_state.aborted)) { |
There was a problem hiding this comment.
warning C4805: '|': unsafe mix of type 'ulint' and type 'bool' in operation
I think we need ulint{} around the second operand.
There was a problem hiding this comment.
Instead of implementing my suggestion, you added another conditional branch. I tested my suggestion on https://godbolt.org (using the built-in type unsigned instead of a type alias ulint):
unsigned t(bool a, unsigned b) {
return !(unsigned{a} | b);
}
unsigned w(bool a, unsigned b) {
return !(a | b);
}
unsigned bloat(bool a, unsigned b) {
return !a || !b;
}The first variant does not result in a warning; the second one does:
example.cpp
<source>(5): warning C4805: '|': unsafe mix of type 'bool' and type 'unsigned int' in operation
The third variant results in a conditional branch being emitted.
d8427cf to
3c5b8ce
Compare
dr-m
left a comment
There was a problem hiding this comment.
OK to push after eliminating the conditional branch.
…E/purge Problem: ======== 1. Encryption threads busy-wait when no work is available: When reaching fil_system.space_list.end(), fil_crypt_return_iops() is called with wake=true, causing pthread_cond_broadcast() to wake all threads unnecessarily, leading to CPU waste. 2. Tablespaces with CLOSING/STOPPING flags skipped during iteration: Since DDL completion doesn't wake encryption threads, these spaces may never be encrypted if threads sleep indefinitely. 3. For default_encrypt_list iteration, when spaces exist but none are acquirable, threads need to wake others for cooperative retry, but this case was not distinguished from fil_system.space_list.end(). 4. IOPS are allocated before searching for tablespaces, wasting resources during iteration when no I/O occurs. 5. Encryption threads use fil_crypt_threads_cond for two different purposes: waiting for encryption work and waiting for IOPS allocation. When fil_crypt_return_iops() or fil_crypt_realloc_iops() broadcasts after releasing IOPS, it wakes ALL threads including those correctly waiting for work, causing spurious wakeups and CPU waste. 6. When innodb_encrypt_tables or innodb_encryption_rotate_key_age is changed during encryption thread iteration, threads continue with stale configuration values, potentially missing tablespaces that should be encrypted or rotated under the new settings. 7. The InnoDB encryption thread could deadlock with DROP TABLE and purge operations in a three-way deadlock scenario: - DROP TABLE thread holds lock_sys.latch and waits for the tablespace pending reference count to reach zero before dropping the space. - Encryption thread holds a tablespace reference and waits to acquire the tablespace allocation latch in exclusive mode to call fseg_page_is_allocated() for checking if a page is allocated before encrypting it. - Purge coordinator thread holds the tablespace allocation latch in exclusive mode and waits to acquire lock_sys.latch in shared mode for record lock operations. This creates a circular dependency and leads to deadlock. Solution: ========= 1. Implement timed wait with exponential backoff: When space == fil_system.space_list.end() (applies to both default_encrypt_list and space_list iteration when no acquirable spaces are found): - First timeout: 5 seconds - Subsequent timeouts: (timed_wait_count + 1) * 5 seconds (10s, 20s, 40s, 60s) - After 5 consecutive timeouts (~135 seconds total), continue with 60-second timed waits to ensure threads periodically recheck for state changes - Timeout counter resets to 0 when woken by signal or when work is found 2. Move IOPS allocation from before tablespace search to after finding a space that needs rotation. If allocation fails, set recheck=true to skip waiting and immediately try next space. Encryption threads would hold space references while waiting in fil_crypt_alloc_iops(), blocking DROP TABLE. To prevent this, encyption should do release-wait-reacquire pattern. fil_crypt_alloc_iops(): Added nowait parameter (default false). When nowait=true, returns immediately if IOPS not available instead of waiting. fil_crypt_thread(): Try non-blocking IOPS allocation first with nowait=true. Only if IOPS not immediately available: - Save space ID and release space reference - Wait for IOPS with nowait=false - Reacquire space using fil_space_get_by_id() and space->acquire() - If space dropped or stopping, release IOPS and skip This ensures encryption threads never hold space references while waiting for IOPS, allowing DROP TABLE operations to proceed without deadlock. 3. Introduce separate condition variable fil_crypt_iops_cond specifically for IOPS allocation synchronization to prevent spurious wakeups: - fil_crypt_threads_cond: Used in wait_for_work() for waiting when no tablespaces need encryption. Signaled when settings change, new tablespaces are created, or thread count changes. - fil_crypt_iops_cond: Used in fil_crypt_alloc_iops() for waiting when IOPS limit is reached. Signaled when IOPS are returned via fil_crypt_return_iops(), released via fil_crypt_realloc_iops(), or when srv_n_fil_crypt_iops is increased. 4. Added atomic version counter fil_crypt_settings_version that is incremented whenever innodb_encrypt_tables or innodb_encryption_rotate_key_age changes. Encryption threads capture the version at iteration start and check for changes during iteration. If config changed, threads immediately restart iteration from the beginning to ensure complete coverage with new settings. New fields: - fil_crypt_settings_version: Atomic counter to track configuration changes - rotate_thread_t::timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff - rotate_thread_t::wait_for_work(): Implements timed/indefinite wait strategy - rotate_thread_t::settings_version: Compares with fil_crypt_settings_version to restart encryption from the beginning 5. Fix three-way deadlock with trylock mechanism: fseg_page_is_allocated(): Added optional caller_mtr parameter. Encryption thread passes an mtr, allocation bitmap page latch is now correctly held in that mtr until the caller commits it. fil_crypt_get_page_throttle(): Use fil_space_t::x_lock_try() to acquire the tablespace allocation latch non-blockingly. If the trylock fails, back off and retry the same page to avoid deadlock. The bitmap page S-latch is added to the mtr and held throughout the page read and encryption operations. Release the tablespace exclusive latch immediately after the allocation check to minimize contention, while the bitmap page S-latch remains held in the mtr. fil_crypt_rotate_page(): After 10 retries on page latch acquisition, release IOPS, sleep 10ms, then try to reacquire IOPS with nowait=true. If IOPS not immediately available, exit gracefully by setting first=true. This periodic IOPS release allows other encryption threads working on tablespaces being dropped to make progress and release their space references, helping to break the deadlock scenario. rotate_state.aborted: set under crypt_data->mutex when a thread bails in fil_crypt_rotate_pages(). Gate should_flush in fil_crypt_complete_rotate_space() on !aborted so a partial pass cannot commit min_key_version. Cleared when the next pass initializes the fresh pass restarts.
3c5b8ce to
8473a85
Compare
Problem:
space that needs rotation.