Skip to content

[Internal] DTS: Adds double-commit guardrail to prevent idempotency token bypass#5884

Merged
Meghana-Palaparthi merged 3 commits into
mainfrom
users/Meghana-Palaparthi/dtx_double_commit_guardrail
May 28, 2026
Merged

[Internal] DTS: Adds double-commit guardrail to prevent idempotency token bypass#5884
Meghana-Palaparthi merged 3 commits into
mainfrom
users/Meghana-Palaparthi/dtx_double_commit_guardrail

Conversation

@Meghana-Palaparthi
Copy link
Copy Markdown
Contributor

Pull Request Template

Description

Adds an Interlocked-based execute-once guard to DistributedWriteTransactionCore.CommitTransactionAsync() that throws InvalidOperationException on subsequent calls. Each CommitTransactionAsync() generates a new idempotency token (Guid), so calling it twice on the same instance would bypass server-side duplicate detection and risk a double-commit. The guard ensures only the first call proceeds.

Changes:

  • DistributedWriteTransactionCore: Added 'commitCalled' field with Interlocked.CompareExchange guard
  • DistributedTransaction: Updated CommitTransactionAsync xmldoc with remarks and exception documentation
  • DistributedWriteTransactionTests: Added 3 tests covering sequential double-call, post-failure double-call, and concurrent race condition scenarios

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

Changelog

  • I have added a changelog entry under ### Unreleased in changelog.md
    for the user-facing impact of this change.
  • No changelog entry is required because this PR is one of:
    documentation-only, test-only, CI / build-only, or a pure internal refactor
    with no observable customer impact.

If the second box is checked, briefly justify here:

Feature not public yet

…event idempotency token bypass

Adds an Interlocked-based execute-once guard to DistributedWriteTransactionCore.CommitTransactionAsync()
that throws InvalidOperationException on subsequent calls. Each CommitTransactionAsync() generates a new
idempotency token (Guid), so calling it twice on the same instance would bypass server-side duplicate
detection and risk a double-commit. The guard ensures only the first call proceeds.

Changes:
- DistributedWriteTransactionCore: Added 'commitCalled' field with Interlocked.CompareExchange guard
- DistributedTransaction: Updated CommitTransactionAsync xmldoc with remarks and exception documentation
- DistributedWriteTransactionTests: Added 3 tests covering sequential double-call, post-failure double-call,
  and concurrent race condition scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

📝 Changelog reminder (non-blocking)

This PR touches shipped source but does not appear to update the
corresponding changelog.md.

Touched (and missing entry for):

  • Microsoft.Azure.Cosmos/src/** ⇒ expected an entry in changelog.md (### Unreleased)

How to decide

Use the rubric in .github/copilot-instructions.md ("Changelog
classifier") or in CONTRIBUTING.md ("Changelog entry"). Quick
version:

  • Customer-observable change (behavior, perf, memory, API) ⇒ add an entry, even if the PR is [Internal].
  • Test-only / CI-only / doc-only / pure-internal-refactor with zero customer-observable effect ⇒ no entry, add the no-changelog-needed label to silence this check.
  • Unsure? Default to adding a one-line entry under #### Other Changes. Reviewer will adjust.

This check is non-blocking — merge is not gated on it. The
reviewer is responsible for the final classification.

kushagraThapar
kushagraThapar previously approved these changes May 19, 2026
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Meghana-Palaparthi — really nice, focused guardrail PR. The Interlocked.CompareExchange(ref this.commitCalled, 1, 0) != 0 pattern is exactly right (returns the original value), the exception message is excellent (states what / why / how-to-recover / when-not-to-blindly-retry — this is template-worthy for the rest of the DTS surface), and the four "permanently-consume" branches (success, error response, network exception, cancellation) are all covered in the test matrix.

Approving — two suggestions inline, neither blocking:

  1. The "ConcurrentCalls" test doesn't actually race the CompareExchange (both tasks run sequentially on the same thread before either hits await) — would you mind hardening it so it actually exercises concurrent racers?
  2. The new <remarks> lives on the abstract DistributedTransaction.CommitTransactionAsync, but DistributedReadTransactionCore.CommitTransactionAsync doesn't enforce single-use. A caller writing a helper against the abstract base gets one behavior for write and another for read with no compile-time signal — could you please help me understand how you'd like to resolve this asymmetry?

Also wanted to flag a few small positives for the record:

  • The "fail-once-and-done" semantics (the gate is NOT reset on failure) is the deliberate conservative choice, well-documented in both the exception message and the <remarks>. Different from HttpClient.SendAsync / DbCommand.ExecuteNonQuery (which let you retry after failure), but the right call here because the idempotency-token contract makes "did it succeed?" unknowable from the client side on a timeout/cancellation.
  • The existing CommitTransaction_SameBodyAndTokenSentOnEveryRetryAttempt test in DistributedTransactionCommitterTests.cs already establishes the threat model this PR closes (one token reused across SDK-internal retries). Nice tie-in.
  • No cross-SDK parity gap — Java and Python don't have DTS yet, so this is the canonical place to set the contract.

NaluTripician
NaluTripician previously approved these changes May 19, 2026
Copy link
Copy Markdown
Contributor

@NaluTripician NaluTripician left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. The Interlocked.CompareExchange execute-once guard is correctly implemented — right field type (int, no volatile needed), right architectural layer above the committer's internal retry loop (which reuses one IdempotencyToken across attempts, so the guard doesn't interfere with legitimate retries), correct CAS semantics. InvalidOperationException matches the Cosmos lifecycle-guard convention (ChangeFeedProcessorBuilder, ItemBatchOperation, etc.). The threat model is real and well-explained in the xmldoc: each CommitTransactionAsync calls new DistributedTransactionServerRequest(...) which unconditionally sets IdempotencyToken = Guid.NewGuid(), so a naïve caller retry would bypass server-side duplicate detection. Consuming the transaction even on a failed first attempt is the correct call given the SDK can't distinguish "request never sent" from "response lost." Changelog skip is policy-compliant — the type is internal in the shipping NuGet and absent from all API_*.txt contracts.

A few non-blocking follow-ups worth considering in a later PR:

  1. Asymmetric enforcement of the documented contract. The new <remarks> and <exception cref="InvalidOperationException"> tags sit on the abstract DistributedTransaction.CommitTransactionAsync base method, so they apply universally — but only DistributedWriteTransactionCore enforces the guard. DistributedReadTransactionCore.CommitTransactionAsync has no commitCalled field and no CAS, so a read transaction called twice will issue two reads with two different idempotency tokens, contradicting the documented contract. Reads are idempotent at the data layer today so there's no immediate data-loss risk, but the asymmetry will silently become a real bug the moment the read path picks up any side effect. Two clean fixes: lift the guard into the abstract base via a template-method split (sealed override + protected abstract …InternalAsync) so the contract is honored universally, or narrow the xmldoc down to the Write override only.

  2. CommitAsync_ConcurrentCalls_OnlyOneSucceeds isn't actually concurrent. Both CommitTransactionAsync calls run on the test thread; the first synchronous prefix (the CAS + new DistributedTransactionCommitter(...)) completes before the second call even starts, so the two CAS operations serialize deterministically through the async state machine. The whole purpose of Interlocked.CompareExchange over a plain if (commitCalled == 0) { commitCalled = 1; … } is correctness under a real cross-thread race — the current test would pass against a naïve non-atomic implementation just as easily. Wrap each call in Task.Run with a Barrier(2).SignalAndWait() to actually exercise the CAS. While you're in there, assert which task won and that exactly one succeeded (task.Status == TaskStatus.RanToCompletion && task.Result.IsSuccessStatusCode) vs. faulted with IOE — today the test only checks that some IOE was caught and that invocationCount == 1, so a regression where the first call threw and the second faulted with IOE would still pass.

  3. Operation-builder methods remain mutable after commit. CreateItem, UpsertItem, ReplaceItem, DeleteItem, PatchItem (and the *Stream variants) don't check commitCalled, so a consumed transaction will happily accept new operations and return this. The next commit attempt throws, but the object visibly behaves like a fresh transaction in the meantime. Given the new xmldoc explicitly tells users to "construct a new DistributedWriteTransaction", consider gating the builders too — or at least calling this out in the docs so it isn't a surprise.

Minor nits:

  • The exception-message substring assertions (ex.Message.Contains("already been called"), etc.) are brittle relative to team precedent in #5793, which asserts exception type only. Either drop the substring asserts or extract the message to a private const string in the SUT and reference it from both sides.
  • Magic 0/1 in Interlocked.CompareExchange(ref this.commitCalled, 1, 0) would read more cleanly with named constants (CommitNotCalled = 0, CommitCalled = 1).
  • commitCalled is a past-participle; the file right next door (DistributedTransactionResponse.cs) uses isDisposed, and Cosmos lifecycle flags generally follow is* (e.g., isBuilt, isStarted). isCommitInvoked would be more consistent.
  • The new <exception> xmldoc documents only InvalidOperationException, but CommitTransactionAsync clearly also surfaces OperationCanceledException (one of the new tests asserts this) plus transport/CosmosException failures. Worth adding for intellisense honesty.
  • PR description says "Added 3 tests" but the diff adds 5 (cancellation + transient-exception tests are missing from the body summary).

- Fix concurrent test to use real thread-level concurrency (16 Task.Run racers + ManualResetEventSlim) instead of sequential async calls
- Add double-commit guard to DistributedReadTransactionCore for API
  consistency (both subclasses now enforce single-use contract)
- Extract CAS constants (CommitNotStarted/CommitStarted) to
  DistributedTransactionConstants
- Rename commitCalled -> isCommitInvoked (follows is* convention)
- Replace brittle .Contains() assertions with Assert.AreEqual against
  internal const CommitAlreadyCalledMessage
- Add <exception cref="OperationCanceledException"> xmldoc
- Move shared single-use contract docs to abstract base class

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>
kushagraThapar
kushagraThapar previously approved these changes May 26, 2026
Copy link
Copy Markdown
Member

@kushagraThapar kushagraThapar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @Meghana-Palaparthi

@Meghana-Palaparthi
Copy link
Copy Markdown
Contributor Author

@copilot resolve the merge conflicts in this pull request

Copy link
Copy Markdown
Contributor

Copilot AI commented May 27, 2026

@copilot resolve the merge conflicts in this pull request

Resolved the merge conflicts in 8d8b4398.

@Meghana-Palaparthi Meghana-Palaparthi merged commit daa1597 into main May 28, 2026
33 checks passed
@Meghana-Palaparthi Meghana-Palaparthi deleted the users/Meghana-Palaparthi/dtx_double_commit_guardrail branch May 28, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants