Skip to content

Add timeout to QuietDatabase recovery wait loop#13300

Open
saintstack wants to merge 1 commit into
apple:mainfrom
saintstack:QuietDatabaseRetry
Open

Add timeout to QuietDatabase recovery wait loop#13300
saintstack wants to merge 1 commit into
apple:mainfrom
saintstack:QuietDatabaseRetry

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

@saintstack saintstack commented May 29, 2026

The waitForQuietDatabase recovery loop waits indefinitely for dbInfo->onChange() to signal FULLY_RECOVERED state. In simulation tests with RandomClogging and Attrition workloads, the tester process can lose its connection to the cluster controller after the test workload completes. When this happens, the AsyncVar never receives another update, causing dbInfo->onChange() to block forever. This leads to the simulation running for 22,000+ seconds producing over 1M trace lines until TracedTooManyLines kills it.

Fix by polling with a 5-second delay and breaking out after 300 seconds. The subsequent quiet database checks will handle retries if recovery has not actually completed.

Fixes: Sideband.toml simulation failure with seed 2881621233 and gcc

20260529-054545-stack_QuietDatabaseRetry-7b72a8e39b781a9c compressed=True data_size=41247877 duration=2329457 ended=100000 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:46:29 sanity=False started=100000 stopped=20260529-063214 submitted=20260529-054545 timeout=5400 username=stack_QuietDatabaseRetry

The waitForQuietDatabase recovery loop waits indefinitely for
dbInfo->onChange() to signal FULLY_RECOVERED state. In simulation tests
with RandomClogging and Attrition workloads, the tester process can lose
its connection to the cluster controller after the test workload
completes. When this happens, the AsyncVar<ServerDBInfo> never receives
another update, causing dbInfo->onChange() to block forever. This leads
to the simulation running for 22,000+ seconds producing over 1M trace
lines until TracedTooManyLines kills it.

Fix by polling with a 5-second delay and breaking out after 300 seconds.
The subsequent quiet database checks will handle retries if recovery
has not actually completed.

Fixes: Sideband.toml simulation failure with seed 2881621233
@saintstack saintstack requested a review from Copilot May 29, 2026 03:24
@saintstack saintstack added the nightlies Issues to address failures in the nighty runs. label May 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents waitForQuietDatabase() from potentially blocking forever while waiting for ServerDBInfo to reach RecoveryState::FULLY_RECOVERED, which can occur in simulation when the tester loses connectivity to the cluster controller and dbInfo->onChange() never fires again. The change makes the recovery wait loop periodically wake up and imposes a hard timeout, allowing the simulation to progress to later quiet-database checks rather than running indefinitely.

Changes:

  • Add a 300-second timeout to the “wait for FULLY_RECOVERED” loop in waitForQuietDatabase().
  • Replace the indefinite co_await dbInfo->onChange() wait with a dbInfo->onChange() || delay(5.0) polling pattern.
  • Emit a warning TraceEvent when the recovery wait times out, including the current phase and recovery state.

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: c67adc6
  • Duration 0:22:15
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: c67adc6
  • Duration 0:33:59
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@saintstack saintstack requested a review from spraza May 29, 2026 03:58
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: c67adc6
  • Duration 0:44:32
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c67adc6
  • Duration 0:45:27
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c67adc6
  • Duration 0:48:03
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c67adc6
  • Duration 0:51:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c67adc6
  • Duration 1:07:09
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Copy Markdown
Contributor

@Ronitsabhaya75 Ronitsabhaya75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

Comment on lines +1001 to +1002
// Use a timeout because the tester can lose its connection to the cluster controller
// (e.g. due to RandomClogging), causing dbInfo to never update again.
Copy link
Copy Markdown
Collaborator

@spraza spraza May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought at some point later in simulation we go in an accelerated mode while disabling faults. The idea was to revert the chaos and give breathing room for final checks.

At this point in code, have we not enabled that mode?

TraceEvent(SevWarn, "QuietDatabaseRecoveryTimeout")
.detail("Phase", phase)
.detail("RecoveryState", (int)dbInfo->get().recoveryState);
break;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we break, we do follow-up checks while the db is not fully recovered, right? If so, can't it cause more issues?

@spraza
Copy link
Copy Markdown
Collaborator

spraza commented May 29, 2026

The waitForQuietDatabase recovery loop waits indefinitely for dbInfo->onChange() to signal FULLY_RECOVERED state. In simulation tests with RandomClogging and Attrition workloads, the tester process can lose its connection to the cluster controller after the test workload completes.

So sounds like in the failing case, the db was fully recovered, just that tester didn't get the new serverdbinfo. Did we confirm that the db was fully recovered? Because if not, that's another issue and this change will mask that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightlies Issues to address failures in the nighty runs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants