Skip to content

MDEV-34358 Encryption threads consume CPU when no work available#5040

Merged
Thirunarayanan merged 1 commit into
10.11from
MDEV-34358
Jun 8, 2026
Merged

MDEV-34358 Encryption threads consume CPU when no work available#5040
Thirunarayanan merged 1 commit into
10.11from
MDEV-34358

Conversation

@Thirunarayanan

@Thirunarayanan Thirunarayanan commented May 4, 2026

Copy link
Copy Markdown
Member

Problem:

1. Encryption threads busy-wait when no work is available. When reaching
fil_system.space_list.end(), fil_crypt_return_iops() is called with
wake=true, causing pthread_cond_broadcast() to wake all threads
unnecessarily, leading to CPU waste.

2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations)
are skipped during iteration. Since DDL completion doesn't wake
encryption threads, these spaces may never be encrypted if threads
sleep indefinitely.

3. For default_encrypt_list iteration, when spaces exist but none are
acquirable, threads need to wake others for
cooperative retry, but this case was not distinguished from
fil_system.space_list.end().

4. IOPS are allocated before searching for tablespaces, wasting resources
during iteration when no I/O occurs.

Solution:
========
1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s
-> 60s, max 5 attempts) for fil_system.space_list iteration. After
~135 seconds, switch to indefinite wait. This periodically rechecks
for spaces that become available after DDL completes.

2. Use indefinite wait for default_encrypt_list iteration since other
threads will retry and wake when needed.

3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between
the two iteration modes. Wake other threads only
when this flag is true (spaces exist but unacquirable).
  1. Move IOPS allocation from before tablespace search to after finding a
    space that needs rotation.
5. Handle wake logic explicitly at call site based on default_encrypt_list flag.

rotate_thread_t changes
====================
- default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces
- timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff
- sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max)
- wait_for_work(): Implements timed/indefinite wait based on iteration mode
- increase_sleep_timeout(): Doubles timeout up to 60s
- reset_sleep_timeout(): Resets timeout to 5s and clears count

Added debug-only status variables Innodb_encryption_timed_waits
and Innodb_encryption_indefinite_waits to track encryption thread
wait types, enabling verification that threads use timed wait with
exponential backoff instead of busy-waiting when idle.

The counters are incremented in rotate_thread_t::wait_for_work()
and exposed via SHOW STATUS in debug builds only. Also added a
debug sync point 'rotate_only_2_timed_waits' to reduce the timed
wait threshold from 5 to 2 for faster testing of the indefinite
wait transition.

@Thirunarayanan Thirunarayanan requested a review from dr-m May 4, 2026 19:15
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@vuvova

vuvova commented May 4, 2026

Copy link
Copy Markdown
Member

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an exponential backoff mechanism for InnoDB encryption threads to reduce CPU usage when tablespaces are temporarily unavailable, such as during DDL operations. It introduces a versioning system for encryption settings to trigger iteration restarts and adds debug status variables to track wait behavior. The review feedback highlights several improvement opportunities: addressing dead code wrapped in #if 0, correcting the timeout threshold logic to match the intended behavior described in comments, removing redundant counter resets, and increasing the bit-width of the settings version counter to prevent potential overflow.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Incremented each time innodb_encrypt_tables or
innodb_encryption_rotate_key_age is modified to signal
encryption threads to restart iteration */
static Atomic_counter<uint16_t> fil_crypt_settings_version;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using uint16_t for a version counter is risky as it can wrap around relatively quickly (65535 changes). While unlikely to cause an issue in typical operation, using uint32_t or uint64_t is safer and generally has the same performance characteristics for atomic operations on modern architectures.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user changes innodb_encrypt_tables variable every hour then this variable may ran out in 7.5 years. If we restart the server in the mean time then this variable becomes 0. This datatype is enough for practical usage. I don't think sane people is going to toggle encryption for every hour.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with you Thiru, it's true that Atomic_counter<uint16_t> shows only one result in the new code. Atomic_counter<uint32_t> is more common, and behaves likely similar while being wider. I think it may be considered to use Atomic_counter<uint32_t>. It will maybe cause inefficiency in rotate_thread_t structure due to widening of rotate_thread_t::settings_version. Could you please check in gdb with ptype /o rotate_thread_t if there's any hole/padding that allows widening without size increase?

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to consider two paths forward. I see that this pull request is currently targeting the 10.6 branch, to which only minimal changes should be done, given that the branch will soon reach its end of life.

There is also the hang MDEV-37946, which I expect we should be hitting when testing this extensively enough. I think that the hang needs to be fixed as part of a minimal fix; it should be fairly easy.

For newer branches, I would like to see a more comprehensive solution, which I am outlining below.

I’d like to see some testing of this subsystem. I have some doubts of the efficacy of the current multi-threaded implementation. We currently have multiple threads that can increase the size of buf_pool.flush_list, It could be the case that multiple threads are concurrently calling buf_flush_list_space(), which could repeatedly invalidate buf_pool.flush_hp.

As a starting point, I think that we have to test if there is any benefit of having multiple "encryption threads" which perform dummy writes to data pages. The name "encryption thread" is misleading, because the actual encryption will take place in the single buf_flush_page_cleaner() thread, in the function buf_page_encrypt(). Yes, this CPU intensive operation will be executed in a single thread!

How can it possibly be helpful to have multiple "producers" (fil_crypt_thread) that are dirtying the buffer pool, when we have a single "consumer" (buf_flush_page_cleaner) that is doing some very CPU intensive work (buf_page_encrypt() and computing CRC-32C)?

Could we implement tighter coupling between the current "encryption" (page-dirtying) and the buf_flush_page_cleaner()? That thread is already regularly scanning potentially the entire buffer pool. So, why not implement an "encryption step" that would process a number of pages that were missed by buf_flush_list_holding_mutex() and buf_flush_LRU()? Such a design could trivially follow the innodb_io_capacity budget and scale back the key rotation activity when there are bursts of application workload (some threads are blocked in buf_flush_wait() or buf_LRU_get_free_block()).

When a clean page that we wish to re-encrypt is in the buffer pool, the buf_flush_page_cleaner() thread can simply write a dummy log record and move the page to the buf_pool.flush_list.

When a page will need to be read into the buffer pool, the current implementation is invoking synchronous buf_page_read() via buf_page_get_gen(). It would be much better to submit asynchronous reads, implementing the logic with a future implementation of MDEV-11378 in mind. The read completion callback would have to detect that this is a special read for encryption, and it would write the dummy record so that the page will be re-encrypted.

The relevant parts of fil_space_rotate_state_t and key_state_t may have to be made part of fil_space_t or fil_space_crypt_t.

A final part of this would be to repurpose the parameter innodb_encrypt_threads to refer to worker tasks that the buf_flush_page_cleaner() would employ to encrypt pages or compute CRC-32C. The function buf_page_t::flush() would be split. Only initial part would be invoked by the buf_flush_page_cleaner() thread; the rest would be invoked by worker tasks, which would process a number of buf_page_t* from their queue.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
@Thirunarayanan

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exponential backoff mechanism for InnoDB encryption threads to optimize CPU usage and a versioning system to manage configuration changes, including updates to status variable tests. The review feedback identifies several improvements: using 32-bit integers for timeout and version counters to prevent overflow and wrap-around, correcting the wait logic for the default encryption list, resetting timeout states when work is found, and addressing a busy-wait loop in IOPS allocation.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Used only when iterating fil_system.space_list to periodically recheck for
spaces that become available after DDL operations complete. Reset to 5000ms
when woken by signal or when work is found. */
uint16_t sleep_timeout_ms= 5000;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable sleep_timeout_ms should be changed to uint32_t to avoid potential overflow issues during calculations in increase_sleep_timeout(). When sleep_timeout_ms is 40,000, sleep_timeout_ms * 2 results in 80,000, which exceeds the uint16_t range and causes incorrect behavior when passed to std::min<uint16_t>.

  uint32_t sleep_timeout_ms= 5000;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writes are bounded, no need to widen IMHO.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
timed wait expires without being woken by a signal, indicating no work
became available during the timeout period. */
void increase_sleep_timeout() {
sleep_timeout_ms = std::min<uint16_t>(sleep_timeout_ms * 2, 60000);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

As noted previously, sleep_timeout_ms * 2 can exceed 65,535. If sleep_timeout_ms is changed to uint32_t, the explicit template parameter for std::min can be removed or updated to ensure correct comparison.

	sleep_timeout_ms = std::min(sleep_timeout_ms * 2, 60000U);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, address this: you can widen sleep_timeout_ms to 32-bit before doing * 2 to avoid 16-bit overflow, then clamp it to the max of 60 seconds.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, change to sleep_timeout_s and avoid the /1000 in set_timespec(abstime, sleep_timeout_ms / 1000);.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Incremented each time innodb_encrypt_tables or
innodb_encryption_rotate_key_age is modified to signal
encryption threads to restart iteration */
static Atomic_counter<uint16_t> fil_crypt_settings_version;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using uint16_t for a version counter is risky in a long-running database server as it can wrap around relatively quickly (65,535 changes). While the logic only checks for inequality, using uint32_t is safer and more standard for versioning to avoid any collision during a single iteration.

static Atomic_counter<uint32_t> fil_crypt_settings_version;

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
/** Config version when thread started current iteration.
Used to detect innodb_encrypt_tables changes during iteration
and restart from beginning to ensure complete encryption coverage. */
uint16_t settings_version;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Update this to uint32_t to match the suggested change for the global fil_crypt_settings_version.

  uint32_t settings_version;

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
return;
}

if (space == fil_system.space_list.end()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a discrepancy between the code and the intended design described in the PR and the comments (lines 1131-1133). The code currently performs a timed wait when space == .end(), regardless of whether it's iterating the default_encrypt_list. To follow the stated design of using an indefinite wait for the default list (where other threads are expected to wake this one), the condition should check the default_encrypt_list flag.

		if (space == fil_system.space_list.end() && !default_encrypt_list) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you kidding me, right?
Please check the callers for wait_for_work()

                                if (thr.default_encrypt_list) {
                                        pthread_cond_broadcast(
                                                &fil_crypt_threads_cond);
                                }
                
                                goto wait_for_work;
                        }

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
goto wait_for_work;
}

/* we found a space to rotate */

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a tablespace is successfully found for rotation, the sleep timeout and timed wait count should be reset. This ensures that the exponential backoff starts fresh the next time the thread becomes idle, rather than potentially jumping straight to an indefinite wait due to stale state from previous idle periods.

			/* we found a space to rotate */
			thr.reset_sleep_timeout();

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment on lines +2213 to +2220
do {
if (thr.should_shutdown()) {
thr.space->release();
thr.space = fil_system.space_list.end();
goto func_exit;
}
} while (!fil_crypt_alloc_iops(&thr));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This loop for IOPS allocation is a busy-wait that consumes 100% CPU if fil_crypt_alloc_iops returns false (e.g., when the global encryption IOPS limit is reached). While this pattern existed in the previous version, it remains a performance bottleneck. Consider adding a short sleep or yielding the thread if IOPS are unavailable.

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaved few comments, one about potential stats missing update, one about iteration restarts on heavy IOPS usage, few about comments improvements, and the most important one about a missing mtr test to "smoke-test" the new behavior measurable with the INNODB_ENCRYPTION_INDEFINITE_WAITS and INNODB_ENCRYPTION_TIMED_WAITS variables. I also commented previous Gemini review comments, please check, I think only the overflow on 16-bit timeout variable is important.

Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/include/fil0fil.h Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
@Thirunarayanan Thirunarayanan force-pushed the MDEV-34358 branch 4 times, most recently from 02f8903 to b709e6a Compare May 19, 2026 08:32
@Thirunarayanan Thirunarayanan requested a review from iMineLink May 19, 2026 15:04

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting a few touchups, the only critical one (please review if necessary) is in fil_crypt_find_space_to_rotate(). Thanks.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the previous review comments.
I found a minor stale comment and a potential case in which "polling" behavior would be still observed: the safest change is to throttle it with a sleep I guess, please take a look.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a failure in encryption.create_or_replace which needs attention and some reasoning needed around the indefinite wait in wait_for_work() now that polling is removed. Thank you.

Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I did some testing and everything was OK:

  • 3x ./build/Debug/mysql-test/mtr --mem encryption.create_or_replace{,,,,,,,,,} --parallel=auto --repeat=100
  • ./build/Debug/mysql-test/mtr --mem --parallel=auto

I have one final comment (even though no failure was seen in mtr), please check.
I think this is a pretty thorough rework of the encryption subsystem, and will need dedicated testing too.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated

@iMineLink iMineLink left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my last comment.
I did another round of stress testing with ./build/Debug/mysql-test/mtr --mem encryption.create_or_replace{,,,,,,,,,} --parallel=auto --repeat=100 and it passed.
The changes now look good to me, but they will need dedicated testing before merge because of the wide impact on encryption.

@Thirunarayanan Thirunarayanan changed the base branch from 10.6 to 10.11 May 29, 2026 07:09
@Thirunarayanan Thirunarayanan requested a review from dr-m June 2, 2026 05:03
Comment thread storage/innobase/include/fsp0fsp.h
Comment thread storage/innobase/include/fsp0fsp.h
Comment thread storage/innobase/fsp/fsp0fsp.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
settings_version(fil_crypt_settings_version), thread_no(no){}

uint thread_no;
bool first = true; /*!< is position before first space */

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d find the name before_first more descriptive. I don’t insist that it be changed as part of this.

Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/include/fil0crypt.h
@Thirunarayanan Thirunarayanan force-pushed the MDEV-34358 branch 2 times, most recently from 975b1d6 to a5bef95 Compare June 5, 2026 17:41
@Thirunarayanan Thirunarayanan requested a review from dr-m June 5, 2026 17:42
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
Comment on lines +2157 to +2158
if (done && !(crypt_data->rotate_state.active_threads
| crypt_data->rotate_state.aborted)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning C4805: '|': unsafe mix of type 'ulint' and type 'bool' in operation

I think we need ulint{} around the second operand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing my suggestion, you added another conditional branch. I tested my suggestion on https://godbolt.org (using the built-in type unsigned instead of a type alias ulint):

unsigned t(bool a, unsigned b) {
    return !(unsigned{a} | b);
}
unsigned w(bool a, unsigned b) {
    return !(a | b);
}
unsigned bloat(bool a, unsigned b) {
    return !a || !b;
}

The first variant does not result in a warning; the second one does:

example.cpp
<source>(5): warning C4805: '|': unsafe mix of type 'bool' and type 'unsigned int' in operation

The third variant results in a conditional branch being emitted.

@Thirunarayanan Thirunarayanan force-pushed the MDEV-34358 branch 2 times, most recently from d8427cf to 3c5b8ce Compare June 8, 2026 05:28

@dr-m dr-m left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to push after eliminating the conditional branch.

Comment thread storage/innobase/fil/fil0crypt.cc
Comment thread storage/innobase/fil/fil0crypt.cc Outdated
…E/purge

Problem:
========
1. Encryption threads busy-wait when no work is available:

When reaching fil_system.space_list.end(), fil_crypt_return_iops() is called
with wake=true, causing pthread_cond_broadcast() to wake all threads
unnecessarily, leading to CPU waste.

2. Tablespaces with CLOSING/STOPPING flags skipped during iteration:

Since DDL completion doesn't wake encryption threads, these spaces may never
be encrypted if threads sleep indefinitely.

3. For default_encrypt_list iteration, when spaces exist but none are
acquirable, threads need to wake others for cooperative retry, but this
case was not distinguished from fil_system.space_list.end().

4. IOPS are allocated before searching for tablespaces, wasting resources
during iteration when no I/O occurs.

5. Encryption threads use fil_crypt_threads_cond for two different purposes:
waiting for encryption work and waiting for IOPS allocation. When
fil_crypt_return_iops() or fil_crypt_realloc_iops() broadcasts after
releasing IOPS, it wakes ALL threads including those correctly waiting
for work, causing spurious wakeups and CPU waste.

6. When innodb_encrypt_tables or innodb_encryption_rotate_key_age is changed
during encryption thread iteration, threads continue with stale configuration
values, potentially missing tablespaces that should be encrypted or rotated
under the new settings.

7. The InnoDB encryption thread could deadlock with DROP TABLE and purge
operations in a three-way deadlock scenario:

- DROP TABLE thread holds lock_sys.latch and waits for the tablespace
pending reference count to reach zero before dropping the space.

- Encryption thread holds a tablespace reference and waits to acquire
the tablespace allocation latch in exclusive mode to call
fseg_page_is_allocated() for checking if a page is allocated before
encrypting it.

- Purge coordinator thread holds the tablespace allocation latch
in exclusive mode and waits to acquire lock_sys.latch
in shared mode for record lock operations.

This creates a circular dependency and leads to deadlock.

Solution:
=========
1. Implement timed wait with exponential backoff:

When space == fil_system.space_list.end() (applies to both default_encrypt_list
and space_list iteration when no acquirable spaces are found):
- First timeout: 5 seconds
- Subsequent timeouts: (timed_wait_count + 1) * 5 seconds (10s, 20s, 40s, 60s)
- After 5 consecutive timeouts (~135 seconds total), continue with 60-second
timed waits to ensure threads periodically recheck for state changes
- Timeout counter resets to 0 when woken by signal or when work is found

2. Move IOPS allocation from before tablespace search to after finding a space
that needs rotation. If allocation fails, set recheck=true to skip waiting
and immediately try next space.

Encryption threads would hold space references while waiting in
fil_crypt_alloc_iops(), blocking DROP TABLE. To prevent this,
encyption should do release-wait-reacquire pattern.

fil_crypt_alloc_iops(): Added nowait parameter (default false). When
nowait=true, returns immediately if IOPS not available instead of waiting.

fil_crypt_thread(): Try non-blocking IOPS allocation first with nowait=true.
Only if IOPS not immediately available:
- Save space ID and release space reference
- Wait for IOPS with nowait=false
- Reacquire space using fil_space_get_by_id() and space->acquire()
- If space dropped or stopping, release IOPS and skip

This ensures encryption threads never hold space references while waiting
for IOPS, allowing DROP TABLE operations to proceed without deadlock.

3. Introduce separate condition variable fil_crypt_iops_cond specifically for
IOPS allocation synchronization to prevent spurious wakeups:

- fil_crypt_threads_cond: Used in wait_for_work() for waiting when no
tablespaces need encryption. Signaled when settings change, new
tablespaces are created, or thread count changes.

- fil_crypt_iops_cond: Used in fil_crypt_alloc_iops() for waiting
when IOPS limit is reached. Signaled when IOPS are returned via
fil_crypt_return_iops(), released via fil_crypt_realloc_iops(), or
when srv_n_fil_crypt_iops is increased.

4. Added atomic version counter fil_crypt_settings_version that is incremented
whenever innodb_encrypt_tables or innodb_encryption_rotate_key_age changes.
Encryption threads capture the version at iteration start and check for
changes during iteration. If config changed, threads immediately restart
iteration from the beginning to ensure complete coverage with new settings.

New fields:
- fil_crypt_settings_version: Atomic counter to track configuration changes
- rotate_thread_t::timed_wait_count (uint8_t): Counts consecutive timeouts
  for exponential backoff
- rotate_thread_t::wait_for_work(): Implements timed/indefinite wait strategy
- rotate_thread_t::settings_version: Compares with fil_crypt_settings_version
  to restart encryption from the beginning

5. Fix three-way deadlock with trylock mechanism:
fseg_page_is_allocated(): Added optional caller_mtr parameter. Encryption
thread passes an mtr, allocation bitmap page latch is now correctly held
in that mtr until the caller commits it.

fil_crypt_get_page_throttle(): Use fil_space_t::x_lock_try() to acquire
the tablespace allocation latch non-blockingly. If the trylock fails,
back off and retry the same page to avoid deadlock. The bitmap page
S-latch is added to the mtr and held throughout the page read and
encryption operations. Release the tablespace exclusive latch immediately
after the allocation check to minimize contention, while the bitmap page
S-latch remains held in the mtr.

fil_crypt_rotate_page(): After 10 retries on page latch acquisition,
release IOPS, sleep 10ms, then try to reacquire IOPS with nowait=true.
If IOPS not immediately available, exit gracefully by setting first=true.
This periodic IOPS release allows other encryption threads working on
tablespaces being dropped to make progress and release their space
references, helping to break the deadlock scenario.

rotate_state.aborted: set under crypt_data->mutex when a
thread bails in fil_crypt_rotate_pages(). Gate should_flush in
fil_crypt_complete_rotate_space() on !aborted so a partial pass
cannot commit min_key_version. Cleared when the next pass
initializes the fresh pass restarts.
@Thirunarayanan Thirunarayanan merged commit dfdc4b8 into 10.11 Jun 8, 2026
15 of 17 checks passed
@Thirunarayanan Thirunarayanan deleted the MDEV-34358 branch June 8, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

6 participants