Replace async-disabling mechanism with retry backoff on refresh failure (#1315)

mihaimitrea-db · claude · web-flow · commit 0d0e80782e9b · 2026-03-19T09:14:33.000Z
## Summary

Replace the async-disabling mechanism on token refresh failure with a
1-minute retry backoff, allowing the SDK to recover from transient
errors without waiting for a full token expiry.

## Why

When an asynchronous token refresh failed, the `Refreshable` class set a
`_refresh_err` flag that completely disabled async refresh. The only way
to clear this flag was through a blocking refresh, which only triggers
when the token fully expires. This meant the SDK could not recover from
transient refresh failures (e.g. a brief network blip) until the token
expired — potentially tens of minutes later — even though the underlying
issue may have resolved in seconds.

This PR replaces the binary disable flag with a short cooldown: after a
failed async refresh, the `_stale_after` threshold is pushed 1 minute
into the future so the token appears fresh for a brief backoff period.
Once the cooldown elapses the token becomes stale again and a new async
refresh is attempted, giving the SDK a chance to recover proactively.

## What changed

### Interface changes

None.

### Behavioral changes

- **Async refresh retry on failure** — Previously, a failed async
refresh disabled all future async attempts until a blocking refresh on
expiry. Now, the SDK waits 1 minute (`_ASYNC_REFRESH_RETRY_BACKOFF`) and
then retries the async refresh. This makes token refresh more resilient
to transient errors.
- **Late async result guard** — When a slow async refresh completes
after a blocking refresh already obtained a newer token, the stale async
result is now discarded instead of overwriting the fresher token.

### Internal changes

- **`_stale_after` replaces `_stale_duration`** — Staleness is now
tracked as an absolute timestamp (`_stale_after`) instead of a relative
`timedelta` (`_stale_duration`). This simplifies `_token_state()` to a
direct comparison rather than computing `expiry - now` and comparing
against a duration.
- **`_handle_failed_async_refresh()`** — New method that advances
`_stale_after` by the backoff period, replacing the `_refresh_err` flag.
- **`_now()` helper** — Centralises "current time" so that naive and
timezone-aware `datetime` objects from different token sources are
compared consistently.
- **`_use_dynamic_stale_duration` renamed to
`_use_legacy_stale_duration`** — Inverted boolean to clarify intent: the
legacy path is the one where callers supply an explicit
`stale_duration`.
- **`_MockRefreshable.refresh()` no longer mutates `self._token`** — The
mock now returns the token without setting `self._token` as a side
effect, avoiding a data race between async and blocking refresh threads.
The production code's `_update_token` handles storage.

## How is this tested?

Tests are rewritten to be fully deterministic by introducing a
`_ManualExecutor` that replaces the real `ThreadPoolExecutor`. Async
refreshes are queued but only execute when `executor.run_all()` is
called, eliminating all `time.sleep()` calls and thread synchronization
from async-path tests. This makes the test suite faster and removes
flakiness from timing-dependent assertions.

New test cases:

- `test_repeated_calls_during_async_failure_cooldown_do_not_refresh` —
verifies that calls during the cooldown period do not trigger additional
async refreshes.
- `test_call_after_async_failure_cooldown_refreshes_token_async` —
verifies that a call after the cooldown elapses triggers a new async
refresh that succeeds.
- `test_late_async_refresh_does_not_overwrite_blocking_refresh` —
verifies that a slow async refresh completing after a blocking refresh
does not overwrite the newer token.
- `test_stale_after_is_recomputed_after_blocking_refresh` — verifies
that `_stale_after` is recomputed from the refreshed token after a
blocking refresh.
- `test_stale_after_computation` — verifies that `_stale_after` is
computed correctly for both the dynamic and legacy stale-duration paths.

---------

Co-authored-by: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/NEXT_CHANGELOG.md b/NEXT_CHANGELOG.md
@@ -11,5 +11,6 @@
 ### Documentation
 
 ### Internal Changes
+* Replace the async-disabling mechanism on token refresh failure with a 1-minute retry backoff. Previously, a single failed async refresh would disable proactive token renewal until the token expired. Now, the SDK waits a short cooldown period and retries, improving resilience to transient errors.
 
 ### API Changes
diff --git a/databricks/sdk/oauth.py b/databricks/sdk/oauth.py
@@ -248,12 +248,11 @@ class Refreshable(TokenSource):
 
     _EXECUTOR = None
     _EXECUTOR_LOCK = threading.Lock()
-    # Legacy default duration for the stale period. This value is chosen to cover the
-    # maximum monthly downtime allowed by a 99.99% uptime SLA (~4.38 minutes).
-    _DEFAULT_STALE_DURATION = timedelta(minutes=5)
     # Default maximum stale duration. Chosen to cover the maximum monthly downtime
     # allowed by a 99.99% uptime SLA (~4.38 minutes) with generous overhead guarantees
     _MAX_STALE_DURATION = timedelta(minutes=20)
+    # Backoff time after an async refresh failure before trying another one.
+    _ASYNC_REFRESH_RETRY_BACKOFF = timedelta(minutes=1)
 
     @classmethod
     def _get_executor(cls):
@@ -272,15 +271,25 @@ def __init__(
         stale_duration: Optional[timedelta] = None,
     ):
         # Config properties
-        self._use_dynamic_stale_duration = stale_duration is None
+        self._use_legacy_stale_duration = stale_duration is not None
+        # Only read on the legacy path (when _use_legacy_stale_duration is True).
         self._stale_duration = stale_duration if stale_duration is not None else timedelta(seconds=0)
         self._disable_async = disable_async
         # Lock
         self._lock = threading.Lock()
         # Non Thread safe properties. They should be accessed only when protected by the lock above.
+        self._stale_after: Optional[datetime] = None
+        self._token_generation: int = 0
         self._update_token(token or Token(""))
         self._is_refreshing = False
-        self._refresh_err = False
+
+    def _now(self) -> datetime:
+        """Return the current time, matching the tz-awareness of the cached token."""
+        if self._token.expiry:
+            return datetime.now(tz=self._token.expiry.tzinfo)
+        if self._stale_after:
+            return datetime.now(tz=self._stale_after.tzinfo)
+        return datetime.now()
 
     def _update_token(self, token: Token) -> None:
         """Stores the new token and pre-computes the stale threshold.
@@ -290,17 +299,28 @@ def _update_token(self, token: Token) -> None:
 
         This ensures short-lived tokens (e.g. FastPath with 10-minute TTL) get a
         proportionally smaller stale window, while standard OAuth tokens (≥1 hour TTL)
-        use the full cap of _DEFAULT_STALE_DURATION.
+        use the full cap of _MAX_STALE_DURATION.
         """
         self._token = token
+        self._token_generation += 1
+        self._stale_after = None
 
-        if self._use_dynamic_stale_duration and self._token.expiry:
-            ttl = self._token.expiry - datetime.now()
-
-            if ttl < timedelta(seconds=0):
-                self._stale_duration = timedelta(seconds=0)
+        if self._token.expiry:
+            if self._use_legacy_stale_duration:
+                self._stale_after = self._token.expiry - self._stale_duration
             else:
-                self._stale_duration = min(ttl // 2, self._MAX_STALE_DURATION)
+                ttl = self._token.expiry - self._now()
+                stale_duration = max(timedelta(seconds=0), min(ttl // 2, self._MAX_STALE_DURATION))
+                self._stale_after = self._token.expiry - stale_duration
+
+    def _handle_failed_async_refresh(self) -> None:
+        """Pushes _stale_after forward by the retry backoff, making the token appear fresh temporarily.
+
+        This may set _stale_after past the token's expiry; that is safe because
+        _token_state() checks expiry before staleness.
+        """
+        if self._stale_after:
+            self._stale_after = self._now() + self._ASYNC_REFRESH_RETRY_BACKOFF
 
     # This is the main entry point for the Token. Do not access the token
     # using any of the internal functions.
@@ -334,19 +354,16 @@ def _token_state(self) -> _TokenState:
         if not self._token.expiry:
             return _TokenState.FRESH
 
-        lifespan = self._token.expiry - datetime.now()
-        if lifespan < timedelta(seconds=0):
+        now = self._now()
+        if self._token.expiry < now:
             return _TokenState.EXPIRED
-        if lifespan < self._stale_duration:
+        if self._stale_after and self._stale_after < now:
             return _TokenState.STALE
         return _TokenState.FRESH
 
     def _blocking_token(self) -> Token:
         """Returns a token, blocking if necessary to refresh it."""
         state = self._token_state()
-        # This is important to recover from potential previous failed attempts
-        # to refresh the token asynchronously.
-        self._refresh_err = False
         self._is_refreshing = False
 
         # It's possible that the token got refreshed (either by a _blocking_refresh or
@@ -360,28 +377,31 @@ def _blocking_token(self) -> Token:
 
     def _trigger_async_refresh(self):
         """Starts an asynchronous refresh if none is in progress."""
+        gen_at_submit = self._token_generation
 
         def _refresh_internal():
             new_token = None
             try:
                 new_token = self.refresh()
             except Exception as e:
                 # This happens on a thread, so we don't want to propagate the error.
-                # Instead, if there is no new_token for any reason, we will disable async refresh below
-                # But we will do it inside the lock.
+                # Instead, if there is no new_token for any reason, we apply a retry
+                # backoff below so the token appears fresh for a short cooldown period.
                 logger.warning(f"Tried to refresh token asynchronously, but failed: {e}")
 
             with self._lock:
-                if new_token is not None:
+                if self._token_generation != gen_at_submit:
+                    logger.debug("Async refresh completed but token was already updated; discarding result.")
+                elif new_token is not None:
                     self._update_token(new_token)
                 else:
-                    self._refresh_err = True
+                    self._handle_failed_async_refresh()
                 self._is_refreshing = False
 
         # The token may have been refreshed by another thread.
         if self._token_state() == _TokenState.FRESH:
             return
-        if not self._is_refreshing and not self._refresh_err:
+        if not self._is_refreshing:
             self._is_refreshing = True
             Refreshable._get_executor().submit(_refresh_internal)
 
diff --git a/tests/test_refreshable.py b/tests/test_refreshable.py