Keep prepared transactions rollbackable after commit write failure#14778
Keep prepared transactions rollbackable after commit write failure#14778xingbowang wants to merge 1 commit into
Conversation
✅ clang-tidy: No findings on changed linesCompleted in 311.2s. |
Stress tests with WritePrepared transactions and retryable write fault injection can hit an injected write error while writing the commit marker for a prepared transaction. In that case the transaction has already been added to PreparedHeap, but PessimisticTransaction::Commit() left the transaction state as AWAITING_COMMIT after CommitInternal() returned the error. A later Rollback() would then reject the transaction as not rollbackable, leaving the prepared entry unresolved until WritePreparedTxnDB teardown asserted that PreparedHeap was empty. Restore the transaction state to PREPARED when a prepared commit write fails so callers can roll the transaction back. Do the same when rollback of a prepared transaction fails, since rollback itself can also hit retryable injected write errors and must remain retryable after DB::Resume() recovers the background error. Update db_stress to clean up prepared transactions whose commit returned non-OK. It first attempts Rollback(), then for retryable injected rollback errors it repeatedly calls DB::Resume() and retries rollback with a fixed 10ms sleep, bounded to 100 attempts. If cleanup still fails, log the transaction name, commit status, rollback status, retry count, Resume() busy count, and last Resume() status so the stress failure points at the unresolved prepared transaction instead of only failing later during close or reopen. Add a WritePrepared regression test that injects a retryable write error on Commit(), verifies rollback also remains retryable across an injected rollback write error, then retries rollback and reopens the DB. This covers the PreparedHeap teardown failure mode from T272425212. Test Plan: make -j128 db_stress write_prepared_transaction_test timeout 60s ./write_prepared_transaction_test --gtest_filter='*RollbackPreparedAfterCommitWriteFailure*' --gtest_repeat=5 python3 "/home/xbw/workspace/gtest-parallel/gtest_parallel.py" -w 32 --timeout=60 ./write_prepared_transaction_test make check-sources git diff --check -- db_stress_tool/db_stress_test_base.cc utilities/transactions/pessimistic_transaction.cc utilities/transactions/write_prepared_transaction_test.cc
9efb7bd to
ac31737
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106202437. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit ac31737 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit ac31737 SummaryWell-designed fix that correctly restores transaction state to PREPARED after failed commit/rollback, enabling retry. The core state machine changes are sound. One pre-existing issue in WriteUnprepared becomes more relevant with this change. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Pre-existing
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | YES | YES — CommitInternal single-write path; no partial commit concern | Safe |
| WriteUnpreparedTxnDB | YES | MOSTLY — CommitInternal single-write; RollbackInternal two-write path has pre-existing unprep_seqs_ issue (M1) | Follow-up for M1 |
| WriteCommittedTxnDB | YES | YES — CommitInternal/RollbackInternal are simple single writes | Safe |
| ReadOnly DB | NO | N/A | N/A |
| Concurrent observers (GetState) | Possible | YES — AWAITING_COMMIT/AWAITING_ROLLBACK were already transient; no code branches on them | Safe |
| Transaction expiration | NO — expiration_time_ cleared during Prepare | N/A | N/A |
Assumption stress-test results:
The claim "CommitInternal two-write path could partially commit" was investigated and found to be unreachable for valid inputs. In both WritePrepared and WriteUnprepared CommitInternal, the two-write path requires includes_data=true, which requires a non-empty commit-time batch without for_recovery=true. But if the batch is non-empty and for_recovery is false, the code returns InvalidArgument before reaching the write path. Therefore do_one_write is always true for valid CommitInternal calls, and partial commit is not a concern.
For RollbackInternal, the two-write path IS reachable (controlled only by two_write_queues option). However, if the first write fails, RollbackInternal returns early before any cleanup, and retry is safe. The M1 finding applies only when the first write succeeds and the second fails.
Positive Observations
- The core fix is minimal and well-targeted: two
txn_state_.store(PREPARED)calls in error paths. - The db_stress retry logic includes good diagnostics (txn name, status strings, retry counts).
- The new test covers the full cycle: commit failure -> resume -> rollback failure -> resume -> rollback success -> reopen.
- The fix correctly does NOT call
Clear()on commit failure, preserving locks and write batch for subsequent rollback. - The serialization reviewer confirmed WAL consistency is maintained in all failure scenarios.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary
PREPAREDstate when writing the commit marker fails, so callers can still roll them back.PREPAREDstate when rollback of a prepared transaction hits a retryable write error, allowing rollback to be retried afterDB::Resume().db_stressto clean up prepared transactions after failed commits and report detailed rollback cleanup failure diagnostics.Test Plan
CI