Skip to content

Replace async-disabling mechanism with retry backoff on refresh failure#1315

Merged
mihaimitrea-db merged 8 commits intomainfrom
mihaimitrea-db/async-refresh-retry
Mar 19, 2026
Merged

Replace async-disabling mechanism with retry backoff on refresh failure#1315
mihaimitrea-db merged 8 commits intomainfrom
mihaimitrea-db/async-refresh-retry

Conversation

@mihaimitrea-db
Copy link
Copy Markdown
Contributor

@mihaimitrea-db mihaimitrea-db commented Mar 10, 2026

Summary

Replace the async-disabling mechanism on token refresh failure with a 1-minute retry backoff, allowing the SDK to recover from transient errors without waiting for a full token expiry.

Why

When an asynchronous token refresh failed, the Refreshable class set a _refresh_err flag that completely disabled async refresh. The only way to clear this flag was through a blocking refresh, which only triggers when the token fully expires. This meant the SDK could not recover from transient refresh failures (e.g. a brief network blip) until the token expired — potentially tens of minutes later — even though the underlying issue may have resolved in seconds.

This PR replaces the binary disable flag with a short cooldown: after a failed async refresh, the _stale_after threshold is pushed 1 minute into the future so the token appears fresh for a brief backoff period. Once the cooldown elapses the token becomes stale again and a new async refresh is attempted, giving the SDK a chance to recover proactively.

What changed

Interface changes

None.

Behavioral changes

  • Async refresh retry on failure — Previously, a failed async refresh disabled all future async attempts until a blocking refresh on expiry. Now, the SDK waits 1 minute (_ASYNC_REFRESH_RETRY_BACKOFF) and then retries the async refresh. This makes token refresh more resilient to transient errors.
  • Late async result guard — When a slow async refresh completes after a blocking refresh already obtained a newer token, the stale async result is now discarded instead of overwriting the fresher token.

Internal changes

  • _stale_after replaces _stale_duration — Staleness is now tracked as an absolute timestamp (_stale_after) instead of a relative timedelta (_stale_duration). This simplifies _token_state() to a direct comparison rather than computing expiry - now and comparing against a duration.
  • _handle_failed_async_refresh() — New method that advances _stale_after by the backoff period, replacing the _refresh_err flag.
  • _now() helper — Centralises "current time" so that naive and timezone-aware datetime objects from different token sources are compared consistently.
  • _use_dynamic_stale_duration renamed to _use_legacy_stale_duration — Inverted boolean to clarify intent: the legacy path is the one where callers supply an explicit stale_duration.
  • _MockRefreshable.refresh() no longer mutates self._token — The mock now returns the token without setting self._token as a side effect, avoiding a data race between async and blocking refresh threads. The production code's _update_token handles storage.

How is this tested?

Tests are rewritten to be fully deterministic by introducing a _ManualExecutor that replaces the real ThreadPoolExecutor. Async refreshes are queued but only execute when executor.run_all() is called, eliminating all time.sleep() calls and thread synchronization from async-path tests. This makes the test suite faster and removes flakiness from timing-dependent assertions.

New test cases:

  • test_repeated_calls_during_async_failure_cooldown_do_not_refresh — verifies that calls during the cooldown period do not trigger additional async refreshes.
  • test_call_after_async_failure_cooldown_refreshes_token_async — verifies that a call after the cooldown elapses triggers a new async refresh that succeeds.
  • test_late_async_refresh_does_not_overwrite_blocking_refresh — verifies that a slow async refresh completing after a blocking refresh does not overwrite the newer token.
  • test_stale_after_is_recomputed_after_blocking_refresh — verifies that _stale_after is recomputed from the refreshed token after a blocking refresh.
  • test_stale_after_computation — verifies that _stale_after is computed correctly for both the dynamic and legacy stale-duration paths.

Previously, a failed asynchronous token refresh would set a `_refresh_err`
flag that completely disabled async refresh until a blocking refresh (on
token expiry) cleared it. This meant the SDK could not recover from
transient errors without first letting the token expire.

Replace this with a 1-minute cooldown: on async failure, advance the
`_stale_after` threshold so the token appears fresh for a short backoff
period, then allow another async attempt. This lets the SDK retry
proactively instead of waiting for a full token expiry.

Supporting changes:
- Store staleness as an absolute `_stale_after` timestamp instead of a
  relative `_stale_duration` offset, simplifying `_token_state()` comparisons.
- Add `_now()` helper to keep naive/aware datetime comparisons consistent.

Rewrite tests to be fully deterministic by introducing a _ManualExecutor
that replaces the real thread pool, eliminating all time.sleep() calls and
thread synchronization. Test suite runtime drops from ~9.5s to <0.1s.
…token

If a token expires while an async refresh is in flight, a blocking
refresh runs and caches a fresh token. Without this guard the async
thread could then overwrite it with an older one. Skip the async
update when the new token expires before the currently cached token.
Also fix stale comments and docstrings left over from the previous
async-disabling mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@renaudhartert-db renaudhartert-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core idea is sound -- replacing a permanent disable flag with a timed backoff is strictly better. The _ManualExecutor in the tests is a nice improvement too. A few things inline.

- Adapt tests to use new Refreshable.__init__ signature (stale_duration param)
- Replace _refresh_err-based tests with _stale_after cooldown tests
- Replace _stale_duration assertions with _stale_after assertions
- Add test for late async refresh not overwriting a blocking refresh
- Fix _MockRefreshable.refresh() to not mutate self._token directly,
  avoiding a data race between async and blocking refresh threads
@mihaimitrea-db mihaimitrea-db force-pushed the mihaimitrea-db/async-refresh-retry branch from b1919eb to ce28c1d Compare March 11, 2026 16:34
- Reset _stale_after to None at the top of _update_token so it does not
  retain a stale value when a token transitions from having an expiry to
  not having one.
- Replace expiry comparison in the async refresh guard with a monotonic
  _token_generation counter, avoiding the assumption that newer tokens
  always have a later expiry (handles TTL policy changes correctly).
- Document the tz-awareness invariant on _now(): all tokens supplied to
  a single Refreshable must use the same convention (all naive or all aware).
A queued async refresh could still apply retry backoff after a blocking
refresh had already installed a newer token, perturbing the fresh token's
staleness window. Ignore async completions from older token generations and
add a regression test for the late-failure race.
mihaimitrea-db

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-py

Inputs:

  • PR number: 1315
  • Commit SHA: b02ac6128ef169ae7ab428eea90dd5e0b74b64c7

Checks will be approved automatically on success.

@mihaimitrea-db mihaimitrea-db added this pull request to the merge queue Mar 19, 2026
Merged via the queue into main with commit 0d0e807 Mar 19, 2026
17 checks passed
@mihaimitrea-db mihaimitrea-db deleted the mihaimitrea-db/async-refresh-retry branch March 19, 2026 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants