feat(replay): refresh-rotation race grace window#24
Merged
Conversation
Closes T2.3 from misc/next-steps.md. Adds a configurable grace
window during which a refresh-rotation claim collision is treated
as a benign concurrent submit (parallel-tab refresh, slow-network
double-submit) and surfaces as 429 + error_code
refresh_concurrent_submit + Retry-After: 2 instead of revoking
the family.
Outside the window the strict pre-T2.3 behavior applies — every
collision still revokes the lineage (RFC 6749 §10.4 / OAuth 2.1
§6.1). REFRESH_RACE_GRACE_SEC tunes the window: default 2s,
clamped to [0, 10] at startup. The 10s ceiling is a security cap
(wider windows are statistically attacker-shaped); 0 disables
the grace and keeps the strict pre-T2.3 contract for operators
who prefer it.
Implementation:
- replay.Store.ClaimOrCheckFamily gains a graceWindow parameter
and a third return bool `racing`. Atomicity preserved: the
Lua script on Redis and the single-mutex section on
MemoryStore both classify {revoked, racing, claimed, fresh}
in one linearizable step, so the new branch cannot deadlock
or double-decide under concurrent load.
- The Lua script now stores epoch-ms timestamps (passed by Go
so the proxy is authoritative about its clock) as claim
values; legacy "1" placeholder values from pre-T2.3 binaries
fall through to the strict revoke path during a rolling
deploy. Documented in the script docstring + threat-model
row 5.
- MemoryStore gets a memoryEntry struct (expiresAt + setAt) so
the racing branch can compare against the original claim
time without polluting Mark semantics.
- 429 on /token is a deliberate deviation from RFC 6749 §5.2
(which defines 400/401 for token-endpoint errors). Most OAuth
client libraries treat 429+Retry-After as "back off and
retry", which is exactly the desired behavior. Documented in
specs.md.
- Tests: 2 handler-level (racing happy path + zero disables
grace), 2 per store (racing within grace + past-grace
revokes), 6 config (default + zero + custom + non-int +
negative + above-ceiling).
- All four replay_store_error log sites already carry the `op`
field from PR #22; this change preserves it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes T2.3 from `misc/next-steps.md`. Adds a configurable grace window during which a refresh-rotation claim collision is treated as a benign concurrent submit (parallel-tab refresh, slow-network double-submit) and surfaces as 429 `refresh_concurrent_submit` + `Retry-After: 2` instead of revoking the family.
Outside the window the strict pre-T2.3 behavior applies — every collision still revokes the lineage (RFC 6749 §10.4 / OAuth 2.1 §6.1).
`REFRESH_RACE_GRACE_SEC` tunes the window: default 2s, clamped to [0, 10] at startup. The 10s ceiling is a security cap (wider windows are statistically attacker-shaped); 0 disables the grace and keeps the strict pre-T2.3 contract for operators who prefer it.
Why this matters
Pre-T2.3: a user with two browser tabs hitting the same API at expiry has a non-zero chance of one tab triggering reuse-detection, killing the family, and silently logging out both tabs. After this change: the racing tab gets a retry-friendly 429, the legitimate rotation succeeds, the family stays alive. Attacker replay past 10s still revokes (security boundary preserved).
Implementation
Test plan
Docs