Skip to content

Remove premature ASSERT on gray failure status check in ClogRemoteTLog#13301

Open
saintstack wants to merge 1 commit into
apple:mainfrom
saintstack:clogremotetlog
Open

Remove premature ASSERT on gray failure status check in ClogRemoteTLog#13301
saintstack wants to merge 1 commit into
apple:mainfrom
saintstack:clogremotetlog

Conversation

@saintstack
Copy link
Copy Markdown
Contributor

The gray failure status check asserted that the clogged remote TLog must appear in the status JSON immediately upon being excluded from dbInfo. However, gray failure detection has inherent latency: worker health monitors must detect the degraded peer, report to the cluster controller, and the CC must populate the status response. With certain seeds the test loop detects the TLog exclusion from dbInfo before the gray failure pipeline has propagated to the status JSON, causing a spurious assertion failure.

Remove the ASSERT and let the test loop retry the status check on subsequent iterations. The core test validation (state path matching) is unaffected since the state transitions to CLOGGED_REMOTE_TLOG_EXCLUDED regardless of the status check result.

Fixes: ClogRemoteTLog.toml simulation failure with seed 428562693 and gcc

The gray failure status check asserted that the clogged remote TLog
must appear in the status JSON immediately upon being excluded from
dbInfo. However, gray failure detection has inherent latency: worker
health monitors must detect the degraded peer, report to the cluster
controller, and the CC must populate the status response. With certain
seeds the test loop detects the TLog exclusion from dbInfo before the
gray failure pipeline has propagated to the status JSON, causing a
spurious assertion failure.

Remove the ASSERT and let the test loop retry the status check on
subsequent iterations. The core test validation (state path matching)
is unaffected since the state transitions to CLOGGED_REMOTE_TLOG_EXCLUDED
regardless of the status check result.

Fixes: ClogRemoteTLog.toml simulation failure with seed 428562693 and
gcc
@saintstack saintstack requested review from Copilot and spraza May 29, 2026 03:55
@saintstack saintstack added the nightlies Issues to address failures in the nighty runs. label May 29, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the ClogRemoteTLog simulation workload to avoid a flaky failure caused by asserting on gray-failure status JSON too early after a remote TLog is excluded from dbInfo, acknowledging the inherent propagation latency in the gray-failure reporting pipeline.

Changes:

  • Removed an immediate ASSERT(statusCheckPassed) after calling grayFailureStatusCheck().
  • Allows the workload loop to continue and retry the gray-failure status check on later iterations.

Comment on lines 402 to 405
localState = TestState::CLOGGED_REMOTE_TLOG_EXCLUDED;
if (!statusCheckPassed) {
statusCheckPassed = co_await grayFailureStatusCheck(db, self->cloggedRemoteTLog.get());
ASSERT(statusCheckPassed);
}
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 60ae865
  • Duration 0:22:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 60ae865
  • Duration 0:33:46
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 60ae865
  • Duration 0:43:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 60ae865
  • Duration 0:48:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 60ae865
  • Duration 0:51:58
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 60ae865
  • Duration 0:54:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 60ae865
  • Duration 1:04:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Copy Markdown
Collaborator

@spraza spraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gray failure status check asserted that the clogged remote TLog must appear in the status JSON immediately upon being excluded from dbInfo. However, gray failure detection has inherent latency: worker health monitors must detect the degraded peer, report to the cluster controller, and the CC must populate the status response.

I need to recollect my memory around this (it's been a while). But my understanding is that there's a causal relationship here that this statement is missing.

The causal order of events is:

  1. CC excludes a worker W based on complaints
  2. CC updates its in-memory state (this is what status actor uses)
  3. Recovery is triggered (because W was in txn subsystem)
  4. We don't recruit W during recovery
  5. Post recovery, the new serverdbinfo does not have W

So "being excluded from dbinfo" i.e. (5) implies that (2) has happened, which implies that status json should have gray_failure field and W in it.

Furthermore, the test assert is under this condition:

remoteTLogNotInDbInfo(self->cloggedRemoteTLog.get(), self->dbInfo->get()))

Continuing the events from above:
6. CC broadcasts new serverdbinfo, that eventually reaches the tester client (running the workload)

So the if remote tlog (W) was not in the new serverdbinfo, that means CC should have it in excluded set, and therefore status json should return it.

Perhaps there is another edge case because of which W is not in status json but I don't understand the serverdbinfo reasoning. For example, if sequencer reports W as unhealthy before gray failure in CC could make the decision, recovery would still be triggered, just not via gray failure. In that case, the test condition would be true (W is not in new serverdbinfo), but since this was not gray failure triggered recovery, status json won't show W. Maybe something along those lines is happening.

@saintstack
Copy link
Copy Markdown
Contributor Author

Thanks for helpful review @spraza ... digesting....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nightlies Issues to address failures in the nighty runs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants