Add timeout to QuietDatabase recovery wait loop#13300
Conversation
The waitForQuietDatabase recovery loop waits indefinitely for dbInfo->onChange() to signal FULLY_RECOVERED state. In simulation tests with RandomClogging and Attrition workloads, the tester process can lose its connection to the cluster controller after the test workload completes. When this happens, the AsyncVar<ServerDBInfo> never receives another update, causing dbInfo->onChange() to block forever. This leads to the simulation running for 22,000+ seconds producing over 1M trace lines until TracedTooManyLines kills it. Fix by polling with a 5-second delay and breaking out after 300 seconds. The subsequent quiet database checks will handle retries if recovery has not actually completed. Fixes: Sideband.toml simulation failure with seed 2881621233
There was a problem hiding this comment.
Pull request overview
This PR prevents waitForQuietDatabase() from potentially blocking forever while waiting for ServerDBInfo to reach RecoveryState::FULLY_RECOVERED, which can occur in simulation when the tester loses connectivity to the cluster controller and dbInfo->onChange() never fires again. The change makes the recovery wait loop periodically wake up and imposes a hard timeout, allowing the simulation to progress to later quiet-database checks rather than running indefinitely.
Changes:
- Add a 300-second timeout to the “wait for FULLY_RECOVERED” loop in
waitForQuietDatabase(). - Replace the indefinite
co_await dbInfo->onChange()wait with adbInfo->onChange() || delay(5.0)polling pattern. - Emit a warning
TraceEventwhen the recovery wait times out, including the current phase and recovery state.
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
| // Use a timeout because the tester can lose its connection to the cluster controller | ||
| // (e.g. due to RandomClogging), causing dbInfo to never update again. |
There was a problem hiding this comment.
I thought at some point later in simulation we go in an accelerated mode while disabling faults. The idea was to revert the chaos and give breathing room for final checks.
At this point in code, have we not enabled that mode?
| TraceEvent(SevWarn, "QuietDatabaseRecoveryTimeout") | ||
| .detail("Phase", phase) | ||
| .detail("RecoveryState", (int)dbInfo->get().recoveryState); | ||
| break; |
There was a problem hiding this comment.
If we break, we do follow-up checks while the db is not fully recovered, right? If so, can't it cause more issues?
So sounds like in the failing case, the db was fully recovered, just that tester didn't get the new serverdbinfo. Did we confirm that the db was fully recovered? Because if not, that's another issue and this change will mask that. |
The waitForQuietDatabase recovery loop waits indefinitely for dbInfo->onChange() to signal FULLY_RECOVERED state. In simulation tests with RandomClogging and Attrition workloads, the tester process can lose its connection to the cluster controller after the test workload completes. When this happens, the AsyncVar never receives another update, causing dbInfo->onChange() to block forever. This leads to the simulation running for 22,000+ seconds producing over 1M trace lines until TracedTooManyLines kills it.
Fix by polling with a 5-second delay and breaking out after 300 seconds. The subsequent quiet database checks will handle retries if recovery has not actually completed.
Fixes: Sideband.toml simulation failure with seed 2881621233 and gcc
20260529-054545-stack_QuietDatabaseRetry-7b72a8e39b781a9c compressed=True data_size=41247877 duration=2329457 ended=100000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:46:29 sanity=False started=100000 stopped=20260529-063214 submitted=20260529-054545 timeout=5400 username=stack_QuietDatabaseRetry