Skip to content

fix: add asyncio.Lock to prevent concurrent refresh races#286

Open
nix-tkobayashi wants to merge 8 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock
Open

fix: add asyncio.Lock to prevent concurrent refresh races#286
nix-tkobayashi wants to merge 8 commits into
aws:mainfrom
nix-tkobayashi:fix/async-refresh-lock

Conversation

@nix-tkobayashi
Copy link
Copy Markdown

@nix-tkobayashi nix-tkobayashi commented May 20, 2026

Summary

Changes

Follow-up to #270 (per review feedback).

SessionHolder.async_refresh_if_needed() can be called concurrently by multiple in-flight requests when the session is marked for refresh. Without synchronization, all callers race to create a new boto3 session simultaneously — wasting STS calls and risking inconsistent state.

This PR:

  • Adds an asyncio.Lock to SessionHolder with a double-check pattern: the outer _needs_refresh check avoids lock overhead on the fast path; the inner check ensures only one caller actually performs the refresh
  • Shields the to_thread call from task cancellation so that a cancelled caller does not release the lock while the worker thread is still running (prevents duplicate refresh on cancellation)
  • Adds two unit tests:
    • Concurrent-callers test using threading.Event barriers to assert create_aws_session is called exactly once
    • Cancellation test that verifies the lock is held until the refresh completes, even when the awaiting task is cancelled

User experience

No user-visible behavior change. This is a correctness hardening for the credential refresh path introduced in #270.

Checklist

  • I have reviewed the contributing guidelines
  • I have performed a self-review of this change
  • Changes have been tested
  • Changes are documented

Is this a breaking change? (Y/N)

  • Yes
  • No

Please add details about how this change was tested.

  • All 185 unit tests pass with 96% coverage (80% required)
  • Pre-commit hooks pass
  • Ruff linting and formatting pass
  • Pyright passes (no new errors)
  • Concurrent test uses threading.Event barriers to assert single refresh
  • Cancellation test verified to FAIL without shield and PASS with shield
  • Integration tests not run locally (requires AWS OIDC credentials)

Acknowledgment

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

nix-tkobayashi and others added 5 commits May 8, 2026 19:29
… hook

`_sign_request_hook` is an async httpx event hook, but it calls
`session_holder.session.get_credentials()` and
`session_holder.refresh_if_needed()` synchronously. For profiles that
use assumed IAM roles (chained credentials), `get_credentials()` triggers
a blocking STS `AssumeRole` call that stalls the event loop and causes
connection timeouts (~60 s).

Add `SessionHolder.async_get_credentials()` and
`SessionHolder.async_refresh_if_needed()` which delegate to
`asyncio.to_thread`, keeping the event loop responsive. Update
`_sign_request_hook` to call the async variants.

Fixes aws#176

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback:
- Clarify that get_credentials() is the primary blocking path;
  async_refresh_if_needed() is a defensive measure
- Replace timeout-based non-blocking test with ticker coroutine that
  proves the event loop stays responsive during credential resolution

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent concurrent refresh races when multiple in-flight requests
trigger credential refresh simultaneously. Uses a double-check
pattern: the outer check avoids lock overhead on the fast path,
the inner check ensures only one caller performs the refresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@nix-tkobayashi nix-tkobayashi requested a review from a team as a code owner May 20, 2026 12:30
nix-tkobayashi and others added 2 commits May 20, 2026 21:46
When the task holding _refresh_lock is cancelled during
asyncio.to_thread(), the lock would be released while the worker
thread is still running.  A second caller could then acquire the
lock, see _needs_refresh == True, and start a duplicate refresh.

Shield the to_thread call so that cancellation waits for the
refresh to complete before releasing the lock and re-raising
CancelledError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aswhit1
Copy link
Copy Markdown

aswhit1 commented May 26, 2026

Independent evidence for the refresh-race shape

Thanks for the response on #270 -- adding the cluster-shape
evidence specific to the refresh race this PR addresses, for the
same audit window, since you confirmed #270 and #286 are
complementary rather than interchangeable. Sharing as the same
downstream user, via the
opencode MCP client.

What clustered-within-batch looks like in our data

Same n=64 audit window referenced on #270, narrowed to the failure
shape this PR addresses: when a credential refresh fires mid-fanout,
multiple concurrent _sign_request_hook calls each observe
_needs_refresh=True and each call create_aws_session(profile).
The losing-side signers complete signing against credentials that
were replaced under them; AWS responds 401/403; proxy surfaces
InvalidClientTokenId / AuthFailure / InvalidToken flavors.

The shape we'd predict from this race -- and what we observed --
is clustering: failures don't distribute independently across
batches; they bunch within a batch when refresh fires during it,
and stay zero in batches where refresh doesn't fire.

Batch size Trials Total calls Failed batches Failures concentrated in failed batches
5 (mixed-service, warm) 5 25 3 of 5 All 3 failures in the 3 failed batches; remaining 2 batches 0/10
6, 7, 8 (mixed-service, warm) 1 each 21 0 of 3 n/a

If failures were independent per call, a 12% per-call rate (3/25)
would distribute as roughly 1 failure per batch, not 3 batches with
~1 failure each and 2 clean batches. The clustering is consistent
with the refresh-race shape this PR describes -- once one call
401s, the second-and-subsequent concurrent fanout calls each
observe _needs_refresh=True and race the refresh.

Why locked + double-checked is the right shape

The double-check pattern in this PR (outer _needs_refresh check
avoids lock overhead on the fast path; inner check ensures only
one caller actually performs the refresh) lines up with what our
data shows we need:

  • Fast path stays fast: in the 21/21 PASS warm batches above,
    no refresh fired, no lock contention -- the outer check returns
    immediately
  • Refresh path serializes: in the 3 batches that did refresh,
    all signers would queue on the lock, the first completes the
    refresh, the rest see _needs_refresh=False on the inner check
    and skip the redundant create_aws_session -- which is exactly
    the cluster of failures we observe today

The cancellation shield matters for our setup too: opencode cancels
in-flight tool calls when the user interrupts. Without the shield,
a cancelled caller mid-refresh would release the lock while the
worker thread is still running, and the next batch would race a
half-completed refresh.

In-the-wild reproducer eventIDs (refresh-race subset)

From the same audit window referenced on #270, narrowed to the
cluster-shape variant:

  • 053a0ced-..., 1f464cfa-... (mixed-service size-5 batch,
    same Trial 2.5 -- two failures in one batch with
    InvalidToken and InvalidClientTokenId variants
    respectively, consistent with two concurrent signers losing
    the refresh race against the same cycle)

The single-shot 502-from-runtime cluster on #270 (dbbc78f0 /
ae01a432 / 270c11a4 / 6d60eaf2) is likely the concurrent-signing
territory addressed by #270; the pair above is what we'd attribute
to the refresh race specifically.

Caveat I want to be honest about

We have not directly instrumented the lock contention or
measured how often _needs_refresh actually fires within a
batch -- our evidence is the failure shape (clustering),
which is consistent with the refresh race but doesn't prove
it over a hypothetical alternative race in the same code path.
If you'd find instrumented evidence useful (e.g., a debug log
counting create_aws_session invocations per batch), we can
patch our local install and rerun the bench script.

Happy to share additional CloudTrail evidence or the bench
script if it would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants