fix: add asyncio.Lock to prevent concurrent refresh races#286
fix: add asyncio.Lock to prevent concurrent refresh races#286nix-tkobayashi wants to merge 8 commits into
Conversation
… hook `_sign_request_hook` is an async httpx event hook, but it calls `session_holder.session.get_credentials()` and `session_holder.refresh_if_needed()` synchronously. For profiles that use assumed IAM roles (chained credentials), `get_credentials()` triggers a blocking STS `AssumeRole` call that stalls the event loop and causes connection timeouts (~60 s). Add `SessionHolder.async_get_credentials()` and `SessionHolder.async_refresh_if_needed()` which delegate to `asyncio.to_thread`, keeping the event loop responsive. Update `_sign_request_hook` to call the async variants. Fixes aws#176 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback: - Clarify that get_credentials() is the primary blocking path; async_refresh_if_needed() is a defensive measure - Replace timeout-based non-blocking test with ticker coroutine that proves the event loop stays responsive during credential resolution Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevent concurrent refresh races when multiple in-flight requests trigger credential refresh simultaneously. Uses a double-check pattern: the outer check avoids lock overhead on the fast path, the inner check ensures only one caller performs the refresh. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the task holding _refresh_lock is cancelled during asyncio.to_thread(), the lock would be released while the worker thread is still running. A second caller could then acquire the lock, see _needs_refresh == True, and start a duplicate refresh. Shield the to_thread call so that cancellation waits for the refresh to complete before releasing the lock and re-raising CancelledError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
# Conflicts: # tests/unit/test_hooks.py
Independent evidence for the refresh-race shapeThanks for the response on #270 -- adding the cluster-shape What clustered-within-batch looks like in our dataSame n=64 audit window referenced on #270, narrowed to the failure The shape we'd predict from this race -- and what we observed --
If failures were independent per call, a 12% per-call rate (3/25) Why locked + double-checked is the right shapeThe double-check pattern in this PR (outer
The cancellation shield matters for our setup too: opencode cancels In-the-wild reproducer eventIDs (refresh-race subset)From the same audit window referenced on #270, narrowed to the
The single-shot 502-from-runtime cluster on #270 (dbbc78f0 / Caveat I want to be honest aboutWe have not directly instrumented the lock contention or Happy to share additional CloudTrail evidence or the bench |
Summary
Changes
Follow-up to #270 (per review feedback).
SessionHolder.async_refresh_if_needed()can be called concurrently by multiple in-flight requests when the session is marked for refresh. Without synchronization, all callers race to create a new boto3 session simultaneously — wasting STS calls and risking inconsistent state.This PR:
asyncio.LocktoSessionHolderwith a double-check pattern: the outer_needs_refreshcheck avoids lock overhead on the fast path; the inner check ensures only one caller actually performs the refreshto_threadcall from task cancellation so that a cancelled caller does not release the lock while the worker thread is still running (prevents duplicate refresh on cancellation)threading.Eventbarriers to assertcreate_aws_sessionis called exactly onceUser experience
No user-visible behavior change. This is a correctness hardening for the credential refresh path introduced in #270.
Checklist
Is this a breaking change? (Y/N)
Please add details about how this change was tested.
Acknowledgment
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.