VTOrc: fix PrimaryIsReadOnly recovery deadlock against PrimarySemiSyncBlocked#20015
Merged
timvaillancourt merged 8 commits intoMay 12, 2026
Merged
Conversation
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20015 +/- ##
===========================================
- Coverage 69.67% 65.85% -3.82%
===========================================
Files 1614 22 -1592
Lines 216793 3778 -213015
===========================================
- Hits 151044 2488 -148556
+ Misses 65749 1290 -64459
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The pre-scan in `GetDetectionAnalysis` was added to address a Codex review concern that `BeforeAnalyses` on `PrimaryIsReadOnly` could leave `ca.hasShardWideAction` unset when both `PrimaryIsReadOnly` and a shard-wide problem matched the same primary, allowing replicas' non-dependent analyses to bypass cross-tablet suppression. On reflection the framing was wrong. The shard-wide suppression model exists to defer tablet recoveries during a disruptive shard-wide reparent (ERS). When `BeforeAnalyses` promotes `PrimaryIsReadOnly` ahead of the shard-wide problem, the chosen recovery is `fixPrimary` — a local operation, not ERS — so the suppression's underlying purpose no longer applies. Replicas' tablet-level recoveries can run safely alongside `fixPrimary`. Reverts the `analysis_dao.go` change and removes the `TestSameTabletShardWidePreservesSuppression` unit test added to document it. Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
PrimaryIsReadOnly and PrimaryDiskStalled recovery deadlocksPrimaryIsReadOnly recovery deadlock against PrimarySemiSyncBlocked
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
mattlord
approved these changes
May 12, 2026
Member
mattlord
left a comment
There was a problem hiding this comment.
LGTM. Thanks, @timvaillancourt ! ❤️
shlomi-noach
approved these changes
May 12, 2026
timvaillancourt
added a commit
to timvaillancourt/vitess
that referenced
this pull request
May 12, 2026
…SyncBlocked` (vitessio#20015) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com>
timvaillancourt
added a commit
that referenced
this pull request
May 14, 2026
…st `PrimarySemiSyncBlocked` (#20015) (#20082) Signed-off-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com> Co-authored-by: Tim Vaillancourt <tim@timvaillancourt.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Extends #19925 to close one more variant of the same
recheckPrimaryHealthdeadlock pattern. #19925 fixedReplicationStopped × PrimarySemiSyncBlocked; this PR adds:PrimaryIsReadOnly × PrimarySemiSyncBlocked— issue #20011's primary report. A read-only primary on a semi-sync-blocked shard never recovers becauserecheckPrimaryHealthabortsfixPrimaryevery cycleWhat changed
PrimaryIsReadOnlydeclaresBeforeAnalyses: [PrimarySemiSyncBlocked]— mirrors VTOrc: fixReplicationStopped+PrimarySemiSyncBlockedrecovery deadlock #19925's pattern. RoutesfixPrimarythroughGetDetectionAnalysis's suppression bypass sorecheckPrimaryHealthno longer aborts it when the shard also hasPrimarySemiSyncBlocked. OncefixPrimaryclears the read-only state,PrimarySemiSyncBlockedis handled by ERS on the next cycle if it still persistsThat's the whole production change — one
BeforeAnalysesdeclaration. The existingGetDetectionAnalysismachinery handles it unchangedWhy not
PrimaryDiskStalledtoo?An earlier version of this PR also widened
BeforeAnalysesto includePrimaryDiskStalled(for bothPrimaryIsReadOnlyandReplicationStopped). I dropped that after review feedback because:PrimaryDiskStalledonly matches when!LastCheckValid && IsDiskStalled— i.e. the primary is unreachable. In that regime,fixPrimary'sSetReadWriteandfixReplica'sCHANGE REPLICATION SOURCEeither fail outright (primary unreachable) or can't fix the underlying disk problem. ERS is the correct actionGetDetectionAnalysispicks one problem per tablet. If the primary matches bothPrimaryIsReadOnlyandPrimaryDiskStalled, declaringPrimaryIsReadOnlyBeforePrimaryDiskStalledcausesPrimaryIsReadOnlyto be selected andPrimaryDiskStalledto be masked entirely on that tablet —recoverDeadPrimaryFuncis never dispatched. IffixPrimarythen fails to clear read-only (plausible on an unreachable primary), ERS is never scheduled and the shard stays wedgedLastCheckValidflips true andPrimaryDiskStalledstops matching — so the cached read-only state would surface as plainPrimaryIsReadOnly, no pairing to deadlock onNet: for any pair involving
PrimaryDiskStalled, we want ERS to dispatch directly, which is exactly the pre-#20015 behaviourTesting
Unit tests:
TestDeclaresBefore/TestDeclaresAfterinanalysis_dao_test.gopin down both directions:PrimaryIsReadOnlydeclares BeforePrimarySemiSyncBlocked(positive), and does NOT declare BeforePrimaryDiskStalled(negative — guards against accidentally re-introducing the masking issue above)TestRecheckPrimaryHealthintopology_recovery_test.goexercises therecheckPrimaryHealthpath that was abortingfixPrimaryfor the new pairingE2E:
TestRecoveryDeadlocksingo/test/endtoend/vtorc/general/vtorc_test.goexercises the pairing on a real cluster. The test stops the acker's IO thread, hangs a write on the semi-sync wait, flipssuper_read_only=ONon the primary, and assertsSuccessfulRecoveries[FixPrimary]increments (pre-fix this counter never incremented becauserecheckPrimaryHealthaborted the recovery mid-flight) andRecoverDeadPrimarydoes not. The test inherits the same timing caveat as VTOrc: fixReplicationStopped+PrimarySemiSyncBlockedrecovery deadlock #19925'sTestReplicationStoppedWithSemiSyncBlocked—PrimarySemiSyncBlockedis hard to assert deterministically because VTOrc usually fixes the replica faster than we can sustain a blocked write, but theFixPrimarycounter increment is sufficient to distinguish pre-fix from post-fix behaviour when the dual condition does coincide during an analysis cycleWhy backport
Same range as #19925 — bug introduced by #18234 (
recheckPrimaryHealth), first released inv0.23.0. Should be backported to at leastrelease-23.0andrelease-24.0Related Issue(s)
Resolves: #20011
Related: #19925
Related: #19941
Checklist
Deployment Notes
No new flags or configuration. The fix changes VTOrc's internal recovery ordering behaviour — when a primary is both
PrimaryIsReadOnlyandPrimarySemiSyncBlocked,fixPrimarynow runs first instead of being indefinitely blocked byrecheckPrimaryHealth. ERS still runs on the next cycle if the shard-wide problem persistsAI Disclosure
Claude Code assisted with implementation and testing; I committed the change manually after reviewing each step. Claude prepared this PR summary