vttablet: wait on _vt.dt_state in WaitForPreparedTwoPCTransactions#20060
vttablet: wait on _vt.dt_state in WaitForPreparedTwoPCTransactions#20060johnlcox wants to merge 4 commits into
Conversation
Signed-off-by: John Leacox <jleacox@backblaze.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a Reshard SwitchTraffic cutover safety gap with 2PC by preventing new coordinator-role 2PC metadata from being created when 2PC is disallowed, and by making the cutover drain wait account for unresolved coordinator state in _vt.dt_state (not just the prepared pool).
Changes:
- Gate
DTExecutor.CreateTransactiononIsTwoPCAllowed()(matching existingPreparebehavior). - Extend
TabletServer.WaitForPreparedTwoPCTransactionsto wait until both the prepared pool is empty and_vt.dt_statereports no unresolved transactions. - Add unit tests covering the coordinator-only
_vt.dt_statedrain behavior and the newCreateTransactiongate.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| go/vt/vttablet/tabletserver/tabletserver.go | Adds dt_state-based unresolved 2PC detection and uses it in the cutover wait loop. |
| go/vt/vttablet/tabletserver/tabletserver_test.go | Adds a regression test asserting the wait does not succeed while _vt.dt_state is non-empty. |
| go/vt/vttablet/tabletserver/dt_executor.go | Gates CreateTransaction on IsTwoPCAllowed() and updates the not-allowed error message. |
| go/vt/vttablet/tabletserver/dt_executor_test.go | Adds test coverage for Prepare/CreateTransaction when 2PC is disallowed. |
| go/vt/vttablet/tabletserver/controller.go | Updates the interface comment to reflect the broader “unresolved 2PC” semantics. |
| func (tsv *TabletServer) hasUnresolvedTwoPCTransactions(ctx context.Context) bool { | ||
| if !tsv.te.preparedPool.IsEmpty() { | ||
| return true | ||
| } | ||
| count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now()) | ||
| if err != nil { | ||
| log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err)) | ||
| return true | ||
| } |
There was a problem hiding this comment.
Added an early return if twoPC is not enabled.
| count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now()) | ||
| if err != nil { | ||
| log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err)) | ||
| return true | ||
| } | ||
| return count > 0 |
There was a problem hiding this comment.
Used a timestamp a year in the future.
| count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now()) | ||
| if err != nil { | ||
| log.Error(fmt.Sprintf("Error reading unresolved transactions: %v", err)) | ||
| return true | ||
| } |
There was a problem hiding this comment.
Removed the log from the boolean method and returned the error instead. The wait method now tracks the last error and logs it only when the waiting has timed out.
| } | ||
| if !dte.te.IsTwoPCAllowed() { | ||
| return vterrors.VT10002("two-pc is enabled, but semi-sync is not") | ||
| return vterrors.VT10002("2pc is enabled, but not currently allowed") |
There was a problem hiding this comment.
Updated the message in the e2e test since it's a misleading message. If it's better to keep the misleading message for backwards compatibility I can flip to that direction instead.
Signed-off-by: John Leacox <jleacox@backblaze.com>
| } | ||
| count, err := tsv.te.twoPC.CountUnresolvedTransaction(ctx, time.Now().Add(365*24*time.Hour)) | ||
| if err != nil { | ||
| return true, err |
There was a problem hiding this comment.
I think this should be return false, err, no?
There was a problem hiding this comment.
Changed to false in this case. Also had Claude add 3 tests.
| func (tsv *TabletServer) WaitForPreparedTwoPCTransactions(ctx context.Context) error { | ||
| if tsv.te.preparedPool.IsEmpty() { | ||
| hasUnresolved, lastErr := tsv.hasUnresolvedTwoPCTransactions(ctx) | ||
| if !hasUnresolved { |
There was a problem hiding this comment.
If we change the return value as I requested above, we should check for the error first here.
Signed-off-by: John Leacox <jleacox@backblaze.com>
Signed-off-by: John Leacox <jleacox@backblaze.com>
Description
This fixes a bug where a Reshard SwitchTraffic can permanently orphan participant redo logs.
It fixes it by gating
CreateTransactionon IsTwoPCAllowed() (matching the existing Prepare gate), and also by waiting inWaitForPreparedTwoPCTransactionsfor_vt.dt_stateto drain to empty. I'm also looking into the defense in depth changes mentioned in the bug report as a followup.Backport request to release-23.0 and release-24.0. We are running 23.0.3 and this bug was found in that release.
Related Issue(s)
Fixes #20002
Checklist
Deployment Notes
Reshard.SwitchTraffic(and any caller oftmState.prepareForDisableQueryService)now blocks until
_vt.dt_stateis empty in addition to the prepared pool.Cutovers without in-flight 2PC are unaffected; cutovers with in-flight
coordinator-role 2PC will wait for
TxResolverto drive resolution. Theexisting context deadline still bounds the wait, so a stuck cutover fails with
the same
Prepared transactions have not been resolved yeterror.DTExecutor.CreateTransactionnow returnsVT10002: atomic distributed transaction not allowed: 2pc is enabled, but not currently allowedwhenIsTwoPCAllowed()is false (mirrors the existingPreparebehavior).Drive-by:
Prepare's gate error message changed from"two-pc is enabled, but semi-sync is not"to"2pc is enabled, but not currently allowed". The oldmessage was misleading because
IsTwoPCAllowed()returns false for eithersemi-sync or tablet-controls reasons.
AI Disclosure
The main code was written primarily by me with Claude Code providing guidance to identify existing patterns and Claude Code writing the tests with some touch-ups by me. Claude Code wrote the deployment notes on this PR description.