Skip to content

feat(replay): refresh-rotation race grace window#24

Merged
babs merged 1 commit into
masterfrom
feat/refresh-race-grace
Apr 27, 2026
Merged

feat(replay): refresh-rotation race grace window#24
babs merged 1 commit into
masterfrom
feat/refresh-race-grace

Conversation

@babs
Copy link
Copy Markdown
Owner

@babs babs commented Apr 27, 2026

Summary

Closes T2.3 from `misc/next-steps.md`. Adds a configurable grace window during which a refresh-rotation claim collision is treated as a benign concurrent submit (parallel-tab refresh, slow-network double-submit) and surfaces as 429 `refresh_concurrent_submit` + `Retry-After: 2` instead of revoking the family.

Outside the window the strict pre-T2.3 behavior applies — every collision still revokes the lineage (RFC 6749 §10.4 / OAuth 2.1 §6.1).

`REFRESH_RACE_GRACE_SEC` tunes the window: default 2s, clamped to [0, 10] at startup. The 10s ceiling is a security cap (wider windows are statistically attacker-shaped); 0 disables the grace and keeps the strict pre-T2.3 contract for operators who prefer it.

Why this matters

Pre-T2.3: a user with two browser tabs hitting the same API at expiry has a non-zero chance of one tab triggering reuse-detection, killing the family, and silently logging out both tabs. After this change: the racing tab gets a retry-friendly 429, the legitimate rotation succeeds, the family stays alive. Attacker replay past 10s still revokes (security boundary preserved).

Implementation

  • `replay.Store.ClaimOrCheckFamily` signature: adds `graceWindow time.Duration` parameter + `racing bool` return. At most one of {familyRevoked, racing, alreadyClaimed} is true on a given call.
  • Atomicity preserved: the Redis Lua script and the MemoryStore single-mutex section both classify {revoked, racing, claimed, fresh} in one linearizable step.
  • Lua script stores epoch-ms timestamps as claim values (passed by Go so the proxy is authoritative about its clock). Legacy `"1"` placeholders from pre-T2.3 binaries fall through to the strict revoke path during a rolling deploy — documented in the script docstring + threat-model row 5.
  • MemoryStore: `memoryEntry` struct (expiresAt + setAt) so the racing branch can compare against the original claim time without polluting `Mark` semantics.
  • 429 on /token is a deliberate deviation from RFC 6749 §5.2 (which defines 400/401). Most OAuth client libraries treat 429+`Retry-After` as "back off and retry", which is exactly the desired behavior. Documented in `specs.md`.
  • All four `replay_store_error` log sites already carry the `op` field (from PR feat(replay): single-use claim on consent + callback state #22); preserved here with `op=claim_refresh_family` on the racing-store path.

Test plan

  • CI green (`test`, `lint`, `keycloak-e2e`, `fuzz-smoke`, `manifest-prod`, `govulncheck`, `build`).
  • Handler tests (`handlers/single_use_replay_test.go`):
    • `TestTokenRefresh_RaceGrace_RacingReturns429` — second submit within grace returns 429 + `refresh_concurrent_submit`, `Retry-After` set, `access_denied{reason=refresh_concurrent_submit}++`, family NOT revoked, no `replay_detected{kind=refresh}` increment.
    • `TestTokenRefresh_RaceGrace_ZeroDisablesGrace` — `RefreshRaceGrace=0` keeps strict pre-T2.3 behavior (every collision = `refresh_reuse_detected`).
  • Store tests (`replay/memory_test.go`, `replay/redis_test.go`):
    • Racing within grace: `racing=true`, family NOT marked.
    • Past grace: `alreadyClaimed=true`, family marker written atomically.
  • Config tests (`config/config_test.go`):
    • Default (2s), zero (disables), custom (5s), non-int rejected, negative rejected, > 10 rejected.
  • No regression in any prior-shipped functionality — existing tests pass `TokenConfig{}` (zero `RefreshRaceGrace`) and get the strict pre-T2.3 contract.

Docs

  • `specs.md` — `/token` refresh flow step 5b extended with the grace-window classification + a note explaining the deliberate 429 deviation from RFC 6749 §5.2.
  • `README.md` — new `REFRESH_RACE_GRACE_SEC` env-var entry under the configuration table.
  • `docs/threat-model.md` — row 5 (refresh-token replay) extended with the grace-window mitigation + rolling-deploy transient note.

Closes T2.3 from misc/next-steps.md. Adds a configurable grace
window during which a refresh-rotation claim collision is treated
as a benign concurrent submit (parallel-tab refresh, slow-network
double-submit) and surfaces as 429 + error_code
refresh_concurrent_submit + Retry-After: 2 instead of revoking
the family.

Outside the window the strict pre-T2.3 behavior applies — every
collision still revokes the lineage (RFC 6749 §10.4 / OAuth 2.1
§6.1). REFRESH_RACE_GRACE_SEC tunes the window: default 2s,
clamped to [0, 10] at startup. The 10s ceiling is a security cap
(wider windows are statistically attacker-shaped); 0 disables
the grace and keeps the strict pre-T2.3 contract for operators
who prefer it.

Implementation:

  - replay.Store.ClaimOrCheckFamily gains a graceWindow parameter
    and a third return bool `racing`. Atomicity preserved: the
    Lua script on Redis and the single-mutex section on
    MemoryStore both classify {revoked, racing, claimed, fresh}
    in one linearizable step, so the new branch cannot deadlock
    or double-decide under concurrent load.

  - The Lua script now stores epoch-ms timestamps (passed by Go
    so the proxy is authoritative about its clock) as claim
    values; legacy "1" placeholder values from pre-T2.3 binaries
    fall through to the strict revoke path during a rolling
    deploy. Documented in the script docstring + threat-model
    row 5.

  - MemoryStore gets a memoryEntry struct (expiresAt + setAt) so
    the racing branch can compare against the original claim
    time without polluting Mark semantics.

  - 429 on /token is a deliberate deviation from RFC 6749 §5.2
    (which defines 400/401 for token-endpoint errors). Most OAuth
    client libraries treat 429+Retry-After as "back off and
    retry", which is exactly the desired behavior. Documented in
    specs.md.

  - Tests: 2 handler-level (racing happy path + zero disables
    grace), 2 per store (racing within grace + past-grace
    revokes), 6 config (default + zero + custom + non-int +
    negative + above-ceiling).

  - All four replay_store_error log sites already carry the `op`
    field from PR #22; this change preserves it.
@babs babs merged commit 008e400 into master Apr 27, 2026
7 checks passed
@babs babs deleted the feat/refresh-race-grace branch April 27, 2026 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant