Skip to content

vttablet: wait on _vt.dt_state in WaitForPreparedTwoPCTransactions#20060

Open
johnlcox wants to merge 4 commits into
vitessio:mainfrom
johnlcox:fix/issue-20002-2pc-coordinator-drain
Open

vttablet: wait on _vt.dt_state in WaitForPreparedTwoPCTransactions#20060
johnlcox wants to merge 4 commits into
vitessio:mainfrom
johnlcox:fix/issue-20002-2pc-coordinator-drain

Conversation

@johnlcox
Copy link
Copy Markdown

@johnlcox johnlcox commented May 7, 2026

Description

This fixes a bug where a Reshard SwitchTraffic can permanently orphan participant redo logs.

It fixes it by gating CreateTransaction on IsTwoPCAllowed() (matching the existing Prepare gate), and also by waiting in WaitForPreparedTwoPCTransactions for _vt.dt_state to drain to empty. I'm also looking into the defense in depth changes mentioned in the bug report as a followup.

Backport request to release-23.0 and release-24.0. We are running 23.0.3 and this bug was found in that release.

Related Issue(s)

Fixes #20002

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

  • Reshard.SwitchTraffic (and any caller of tmState.prepareForDisableQueryService)
    now blocks until _vt.dt_state is empty in addition to the prepared pool.
    Cutovers without in-flight 2PC are unaffected; cutovers with in-flight
    coordinator-role 2PC will wait for TxResolver to drive resolution. The
    existing context deadline still bounds the wait, so a stuck cutover fails with
    the same Prepared transactions have not been resolved yet error.

  • DTExecutor.CreateTransaction now returns VT10002: atomic distributed transaction not allowed: 2pc is enabled, but not currently allowed when
    IsTwoPCAllowed() is false (mirrors the existing Prepare behavior).

  • Drive-by: Prepare's gate error message changed from "two-pc is enabled, but semi-sync is not" to "2pc is enabled, but not currently allowed". The old
    message was misleading because IsTwoPCAllowed() returns false for either
    semi-sync or tablet-controls reasons.

AI Disclosure

The main code was written primarily by me with Claude Code providing guidance to identify existing patterns and Claude Code writing the tests with some touch-ups by me. Claude Code wrote the deployment notes on this PR description.

Signed-off-by: John Leacox <jleacox@backblaze.com>
Copilot AI review requested due to automatic review settings May 7, 2026 18:41
@github-actions github-actions Bot added this to the v25.0.0 milestone May 7, 2026
@vitess-bot vitess-bot Bot added NeedsWebsiteDocsUpdate What it says NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels May 7, 2026
@vitess-bot
Copy link
Copy Markdown
Contributor

vitess-bot Bot commented May 7, 2026

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Reshard SwitchTraffic cutover safety gap with 2PC by preventing new coordinator-role 2PC metadata from being created when 2PC is disallowed, and by making the cutover drain wait account for unresolved coordinator state in _vt.dt_state (not just the prepared pool).

Changes:

  • Gate DTExecutor.CreateTransaction on IsTwoPCAllowed() (matching existing Prepare behavior).
  • Extend TabletServer.WaitForPreparedTwoPCTransactions to wait until both the prepared pool is empty and _vt.dt_state reports no unresolved transactions.
  • Add unit tests covering the coordinator-only _vt.dt_state drain behavior and the new CreateTransaction gate.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
go/vt/vttablet/tabletserver/tabletserver.go Adds dt_state-based unresolved 2PC detection and uses it in the cutover wait loop.
go/vt/vttablet/tabletserver/tabletserver_test.go Adds a regression test asserting the wait does not succeed while _vt.dt_state is non-empty.
go/vt/vttablet/tabletserver/dt_executor.go Gates CreateTransaction on IsTwoPCAllowed() and updates the not-allowed error message.
go/vt/vttablet/tabletserver/dt_executor_test.go Adds test coverage for Prepare/CreateTransaction when 2PC is disallowed.
go/vt/vttablet/tabletserver/controller.go Updates the interface comment to reflect the broader “unresolved 2PC” semantics.

Comment on lines +776 to +784
func (tsv *TabletServer) hasUnresolvedTwoPCTransactions(ctx context.Context) bool {
if !tsv.te.preparedPool.IsEmpty() {
return true
}
count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now())
if err != nil {
log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err))
return true
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an early return if twoPC is not enabled.

Comment on lines +780 to +785
count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now())
if err != nil {
log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err))
return true
}
return count > 0
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used a timestamp a year in the future.

Comment on lines +780 to +784
count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now())
if err != nil {
log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err))
return true
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the log from the boolean method and returned the error instead. The wait method now tracks the last error and logs it only when the waiting has timed out.

}
if !dte.te.IsTwoPCAllowed() {
return vterrors.VT10002("two-pc is enabled, but semi-sync is not")
return vterrors.VT10002("2pc is enabled, but not currently allowed")
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the message in the e2e test since it's a misleading message. If it's better to keep the misleading message for backwards compatibility I can flip to that direction instead.

Signed-off-by: John Leacox <jleacox@backblaze.com>
@johnlcox johnlcox requested a review from frouioui as a code owner May 7, 2026 19:38
}
count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now().Add(365*24*time.Hour))
if err != nil {
return true, err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be return false, err, no?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to false in this case. Also had Claude add 3 tests.

func (tsv *TabletServer) WaitForPreparedTwoPCTransactions(ctx context.Context) error {
if tsv.te.preparedPool.IsEmpty() {
hasUnresolved, lastErr := tsv.hasUnresolvedTwoPCTransactions(ctx)
if !hasUnresolved {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change the return value as I requested above, we should check for the error first here.

Copy link
Copy Markdown
Author

@johnlcox johnlcox May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an error check.

Signed-off-by: John Leacox <jleacox@backblaze.com>
Copilot AI review requested due to automatic review settings May 8, 2026 16:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated no new comments.

Signed-off-by: John Leacox <jleacox@backblaze.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: VTTablet NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: Reshard SwitchTraffic only drains the prepared pool, ignores _vt.dt_state, can permanently orphan participant redo logs

3 participants