Skip to content

Adds caller cancellation token propagation in hedging and timeout strategies#3094

Open
DaRosenberg wants to merge 1 commit into
App-vNext:mainfrom
OrgFlow:main
Open

Adds caller cancellation token propagation in hedging and timeout strategies#3094
DaRosenberg wants to merge 1 commit into
App-vNext:mainfrom
OrgFlow:main

Conversation

@DaRosenberg
Copy link
Copy Markdown

Pull Request

The issue or feature being addressed

Fixes #3086 (Timeout strategy does not propagate the caller's CancellationToken)

When a resilience pipeline contains a strategy that substitutes the execution CancellationToken with an internal one (timeout, and also hedging), a caller-initiated cancellation surfaces an OperationCanceledException whose CancellationToken is Polly's internal token rather than the caller's. This breaks the common pattern of letting caller cancellation pass through unchanged while wrapping other failures, because callers cannot reliably compare OperationCanceledException.CancellationToken to their own token.

Details on the issue fix or feature implementation

Goal

Polly should throw an OperationCanceledException carrying the caller's token if and only if the cancellation was actually caused by a cancellation request on that token — for any pipeline, regardless of which strategies it is composed of or how they are nested.

Approach

A repo-wide audit shows that within Polly.Core exactly two strategies substitute the execution token: timeout (TimeoutResilienceStrategy) and hedging (TaskExecution via ResilienceContext.InitializeFrom). Every other strategy and the pipeline plumbing only read context.CancellationToken, so they already emit the correct token at their own level.

The fix therefore lives in those two strategies, via a small shared helper:

// Polly.Core/Utils/OutcomeUtilities.cs
public static Outcome<T> WithCallerCancellation<T>(this Outcome<T> outcome, CancellationToken callerToken)
{
    if (callerToken.IsCancellationRequested
        && outcome.Exception is OperationCanceledException oce
        && oce.CancellationToken != callerToken)
    {
        return Outcome.FromException<T>(new OperationCanceledException(callerToken).TrySetStackTrace());
    }

    return outcome;
}
  • Timeout applies it to the outcome it returns, using the token that was on the context before it substituted its own. The existing timeout-detection branch (which produces TimeoutRejectedException) is unchanged.
  • Hedging applies it at its accepted-outcome return sites, using the upstream token it already captures for the duration of the hedged execution.

Because each substituting strategy normalizes back to its own previous token, the behavior composes correctly through arbitrary nesting: an inner timeout rewrites to the mid-level token, the outer timeout rewrites that to the caller's token, and so on. The simplest case (AddTimeout only) and deeply nested cases both end up with the caller's token.

Design decisions and trade-offs

  • Surgical fix We considered normalizing the exception once at the outermost pipeline execution boundary. That would also cover hypothetical third-party strategies that substitute the token, but it adds work to the universal execution hot path. We chose the surgical approach because it adds zero overhead to the common path (work happens only inside the two strategies, only on the cancellation branch) and provably covers every composition of the built-in strategies. The trade-off is that a custom strategy that substitutes the token would remain responsible for its own token handling.
  • Exception shape When the token must be corrected we create a bare new OperationCanceledException(callerToken).TrySetStackTrace(), matching the existing convention in DelegatingComponent, CompositeComponent, and hedging's pre-execution cancellation check. We deliberately did not chain the original exception as an InnerException; it keeps the behavior consistent with the rest of the codebase, at the cost of not preserving the original deep stack trace in this specific path.
  • Scope: v8 (Polly.Core) only. The legacy v7 Policy API uses a combined linked token and already documents (in AsyncTimeoutEngine/TimeoutEngine) that the token on the exception is not reliable for this determination. We left v7 untouched to avoid changing long-stable behavior.
  • Deliberately ignoring a benign race Detection relies on callerToken.IsCancellationRequested, read after the exception has been produced. Because a CancellationToken is a monotonic latch with no record of when or why it fired, there is an inherent, unavoidable race: if a non-caller cause produces the exception and the caller then cancels within the small window before we inspect the token, the cancellation is attributed to the caller. We chose not to add machinery to fight this, for three reasons:
    1. The timeout-vs-caller variant of this race is pre-existing — Polly's existing timeout classification already uses the same previousToken.IsCancellationRequested proxy, so this change does not make it worse (it only makes the resulting token more coherent).
    2. It cannot be fully eliminated: the token model exposes no causal/temporal information after the fact.
    3. In every misfire the caller's token genuinely is cancelled, so attributing the cancellation to the caller is defensible rather than harmful. The "only if" guarantee that matters in practice — not claiming caller cancellation when the caller's token was never cancelled — is fully preserved.

Tests

  • A regression test file under Issues (IssuesTests.CancellationTokenPropagation_3086.cs) with end-to-end coverage: the exact issue repro, timeout±retry in both orders, hedging, nested timeouts, a no-substitution baseline, and two "only if" guards (a real timeout still throws TimeoutRejectedException; an unrelated OperationCanceledException is preserved when the caller token is not cancelled)
  • Unit tests for the new helper covering all branches
  • Focused token-identity assertions added to the timeout and hedging strategy tests

All Polly.Core.Tests pass; branch and method coverage remain at 100% and the new code is fully covered.

Confirm the following

  • I started this PR by branching from the head of the default branch
  • I have targeted the PR to merge into the default branch
  • I have included unit tests for the issue/feature
  • I have successfully run a local build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Timeout strategy does not propagate correct CancellationToken

1 participant