[Internal] DTS: Adds double-commit guardrail to prevent idempotency token bypass#5884
Conversation
…event idempotency token bypass Adds an Interlocked-based execute-once guard to DistributedWriteTransactionCore.CommitTransactionAsync() that throws InvalidOperationException on subsequent calls. Each CommitTransactionAsync() generates a new idempotency token (Guid), so calling it twice on the same instance would bypass server-side duplicate detection and risk a double-commit. The guard ensures only the first call proceeds. Changes: - DistributedWriteTransactionCore: Added 'commitCalled' field with Interlocked.CompareExchange guard - DistributedTransaction: Updated CommitTransactionAsync xmldoc with remarks and exception documentation - DistributedWriteTransactionTests: Added 3 tests covering sequential double-call, post-failure double-call, and concurrent race condition scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 Changelog reminder (non-blocking)This PR touches shipped source but does not appear to update the Touched (and missing entry for):
How to decideUse the rubric in
This check is non-blocking — merge is not gated on it. The |
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks @Meghana-Palaparthi — really nice, focused guardrail PR. The Interlocked.CompareExchange(ref this.commitCalled, 1, 0) != 0 pattern is exactly right (returns the original value), the exception message is excellent (states what / why / how-to-recover / when-not-to-blindly-retry — this is template-worthy for the rest of the DTS surface), and the four "permanently-consume" branches (success, error response, network exception, cancellation) are all covered in the test matrix.
Approving — two suggestions inline, neither blocking:
- The "ConcurrentCalls" test doesn't actually race the
CompareExchange(both tasks run sequentially on the same thread before either hitsawait) — would you mind hardening it so it actually exercises concurrent racers? - The new
<remarks>lives on the abstractDistributedTransaction.CommitTransactionAsync, butDistributedReadTransactionCore.CommitTransactionAsyncdoesn't enforce single-use. A caller writing a helper against the abstract base gets one behavior for write and another for read with no compile-time signal — could you please help me understand how you'd like to resolve this asymmetry?
Also wanted to flag a few small positives for the record:
- The "fail-once-and-done" semantics (the gate is NOT reset on failure) is the deliberate conservative choice, well-documented in both the exception message and the
<remarks>. Different fromHttpClient.SendAsync/DbCommand.ExecuteNonQuery(which let you retry after failure), but the right call here because the idempotency-token contract makes "did it succeed?" unknowable from the client side on a timeout/cancellation. - The existing
CommitTransaction_SameBodyAndTokenSentOnEveryRetryAttempttest inDistributedTransactionCommitterTests.csalready establishes the threat model this PR closes (one token reused across SDK-internal retries). Nice tie-in. - No cross-SDK parity gap — Java and Python don't have DTS yet, so this is the canonical place to set the contract.
NaluTripician
left a comment
There was a problem hiding this comment.
Approving. The Interlocked.CompareExchange execute-once guard is correctly implemented — right field type (int, no volatile needed), right architectural layer above the committer's internal retry loop (which reuses one IdempotencyToken across attempts, so the guard doesn't interfere with legitimate retries), correct CAS semantics. InvalidOperationException matches the Cosmos lifecycle-guard convention (ChangeFeedProcessorBuilder, ItemBatchOperation, etc.). The threat model is real and well-explained in the xmldoc: each CommitTransactionAsync calls new DistributedTransactionServerRequest(...) which unconditionally sets IdempotencyToken = Guid.NewGuid(), so a naïve caller retry would bypass server-side duplicate detection. Consuming the transaction even on a failed first attempt is the correct call given the SDK can't distinguish "request never sent" from "response lost." Changelog skip is policy-compliant — the type is internal in the shipping NuGet and absent from all API_*.txt contracts.
A few non-blocking follow-ups worth considering in a later PR:
-
Asymmetric enforcement of the documented contract. The new
<remarks>and<exception cref="InvalidOperationException">tags sit on the abstractDistributedTransaction.CommitTransactionAsyncbase method, so they apply universally — but onlyDistributedWriteTransactionCoreenforces the guard.DistributedReadTransactionCore.CommitTransactionAsynchas nocommitCalledfield and no CAS, so a read transaction called twice will issue two reads with two different idempotency tokens, contradicting the documented contract. Reads are idempotent at the data layer today so there's no immediate data-loss risk, but the asymmetry will silently become a real bug the moment the read path picks up any side effect. Two clean fixes: lift the guard into the abstract base via a template-method split (sealed override+protected abstract …InternalAsync) so the contract is honored universally, or narrow the xmldoc down to the Write override only. -
CommitAsync_ConcurrentCalls_OnlyOneSucceedsisn't actually concurrent. BothCommitTransactionAsynccalls run on the test thread; the first synchronous prefix (the CAS +new DistributedTransactionCommitter(...)) completes before the second call even starts, so the two CAS operations serialize deterministically through theasyncstate machine. The whole purpose ofInterlocked.CompareExchangeover a plainif (commitCalled == 0) { commitCalled = 1; … }is correctness under a real cross-thread race — the current test would pass against a naïve non-atomic implementation just as easily. Wrap each call inTask.Runwith aBarrier(2).SignalAndWait()to actually exercise the CAS. While you're in there, assert which task won and that exactly one succeeded (task.Status == TaskStatus.RanToCompletion && task.Result.IsSuccessStatusCode) vs. faulted with IOE — today the test only checks that some IOE was caught and thatinvocationCount == 1, so a regression where the first call threw and the second faulted with IOE would still pass. -
Operation-builder methods remain mutable after commit.
CreateItem,UpsertItem,ReplaceItem,DeleteItem,PatchItem(and the*Streamvariants) don't checkcommitCalled, so a consumed transaction will happily accept new operations and returnthis. The next commit attempt throws, but the object visibly behaves like a fresh transaction in the meantime. Given the new xmldoc explicitly tells users to "construct a newDistributedWriteTransaction", consider gating the builders too — or at least calling this out in the docs so it isn't a surprise.
Minor nits:
- The exception-message substring assertions (
ex.Message.Contains("already been called"), etc.) are brittle relative to team precedent in #5793, which asserts exception type only. Either drop the substring asserts or extract the message to aprivate const stringin the SUT and reference it from both sides. - Magic
0/1inInterlocked.CompareExchange(ref this.commitCalled, 1, 0)would read more cleanly with named constants (CommitNotCalled = 0,CommitCalled = 1). commitCalledis a past-participle; the file right next door (DistributedTransactionResponse.cs) usesisDisposed, and Cosmos lifecycle flags generally followis*(e.g.,isBuilt,isStarted).isCommitInvokedwould be more consistent.- The new
<exception>xmldoc documents onlyInvalidOperationException, butCommitTransactionAsyncclearly also surfacesOperationCanceledException(one of the new tests asserts this) plus transport/CosmosExceptionfailures. Worth adding for intellisense honesty. - PR description says "Added 3 tests" but the diff adds 5 (cancellation + transient-exception tests are missing from the body summary).
- Fix concurrent test to use real thread-level concurrency (16 Task.Run racers + ManualResetEventSlim) instead of sequential async calls - Add double-commit guard to DistributedReadTransactionCore for API consistency (both subclasses now enforce single-use contract) - Extract CAS constants (CommitNotStarted/CommitStarted) to DistributedTransactionConstants - Rename commitCalled -> isCommitInvoked (follows is* convention) - Replace brittle .Contains() assertions with Assert.AreEqual against internal const CommitAlreadyCalledMessage - Add <exception cref="OperationCanceledException"> xmldoc - Move shared single-use contract docs to abstract base class Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
7b1f782
kushagraThapar
left a comment
There was a problem hiding this comment.
LGTM, thanks @Meghana-Palaparthi
|
@copilot resolve the merge conflicts in this pull request |
Resolved the merge conflicts in |
Pull Request Template
Description
Adds an Interlocked-based execute-once guard to DistributedWriteTransactionCore.CommitTransactionAsync() that throws InvalidOperationException on subsequent calls. Each CommitTransactionAsync() generates a new idempotency token (Guid), so calling it twice on the same instance would bypass server-side duplicate detection and risk a double-commit. The guard ensures only the first call proceeds.
Changes:
Type of change
Please delete options that are not relevant.
Closing issues
To automatically close an issue: closes #IssueNumber
Changelog
### Unreleasedinchangelog.mdfor the user-facing impact of this change.
documentation-only, test-only, CI / build-only, or a pure internal refactor
with no observable customer impact.
If the second box is checked, briefly justify here: