Replace async-disabling mechanism with retry backoff on refresh failure#1315
Merged
mihaimitrea-db merged 8 commits intomainfrom Mar 19, 2026
Merged
Replace async-disabling mechanism with retry backoff on refresh failure#1315mihaimitrea-db merged 8 commits intomainfrom
mihaimitrea-db merged 8 commits intomainfrom
Conversation
Previously, a failed asynchronous token refresh would set a `_refresh_err` flag that completely disabled async refresh until a blocking refresh (on token expiry) cleared it. This meant the SDK could not recover from transient errors without first letting the token expire. Replace this with a 1-minute cooldown: on async failure, advance the `_stale_after` threshold so the token appears fresh for a short backoff period, then allow another async attempt. This lets the SDK retry proactively instead of waiting for a full token expiry. Supporting changes: - Store staleness as an absolute `_stale_after` timestamp instead of a relative `_stale_duration` offset, simplifying `_token_state()` comparisons. - Add `_now()` helper to keep naive/aware datetime comparisons consistent. Rewrite tests to be fully deterministic by introducing a _ManualExecutor that replaces the real thread pool, eliminating all time.sleep() calls and thread synchronization. Test suite runtime drops from ~9.5s to <0.1s.
…token If a token expires while an async refresh is in flight, a blocking refresh runs and caches a fresh token. Without this guard the async thread could then overwrite it with an older one. Skip the async update when the new token expires before the currently cached token. Also fix stale comments and docstrings left over from the previous async-disabling mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
renaudhartert-db
requested changes
Mar 11, 2026
Contributor
renaudhartert-db
left a comment
There was a problem hiding this comment.
The core idea is sound -- replacing a permanent disable flag with a timed backoff is strictly better. The _ManualExecutor in the tests is a nice improvement too. A few things inline.
- Adapt tests to use new Refreshable.__init__ signature (stale_duration param) - Replace _refresh_err-based tests with _stale_after cooldown tests - Replace _stale_duration assertions with _stale_after assertions - Add test for late async refresh not overwriting a blocking refresh - Fix _MockRefreshable.refresh() to not mutate self._token directly, avoiding a data race between async and blocking refresh threads
b1919eb to
ce28c1d
Compare
- Reset _stale_after to None at the top of _update_token so it does not retain a stale value when a token transitions from having an expiry to not having one. - Replace expiry comparison in the async refresh guard with a monotonic _token_generation counter, avoiding the assumption that newer tokens always have a later expiry (handles TTL policy changes correctly). - Document the tz-awareness invariant on _now(): all tokens supplied to a single Refreshable must use the same convention (all naive or all aware).
A queued async refresh could still apply retry backoff after a blocking refresh had already installed a newer token, perturbing the fresh token's staleness window. Ignore async completions from older token generations and add a regression test for the late-failure race.
|
If integration tests don't run automatically, an authorized user can run them manually by following the instructions below: Trigger: Inputs:
Checks will be approved automatically on success. |
renaudhartert-db
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the async-disabling mechanism on token refresh failure with a 1-minute retry backoff, allowing the SDK to recover from transient errors without waiting for a full token expiry.
Why
When an asynchronous token refresh failed, the
Refreshableclass set a_refresh_errflag that completely disabled async refresh. The only way to clear this flag was through a blocking refresh, which only triggers when the token fully expires. This meant the SDK could not recover from transient refresh failures (e.g. a brief network blip) until the token expired — potentially tens of minutes later — even though the underlying issue may have resolved in seconds.This PR replaces the binary disable flag with a short cooldown: after a failed async refresh, the
_stale_afterthreshold is pushed 1 minute into the future so the token appears fresh for a brief backoff period. Once the cooldown elapses the token becomes stale again and a new async refresh is attempted, giving the SDK a chance to recover proactively.What changed
Interface changes
None.
Behavioral changes
_ASYNC_REFRESH_RETRY_BACKOFF) and then retries the async refresh. This makes token refresh more resilient to transient errors.Internal changes
_stale_afterreplaces_stale_duration— Staleness is now tracked as an absolute timestamp (_stale_after) instead of a relativetimedelta(_stale_duration). This simplifies_token_state()to a direct comparison rather than computingexpiry - nowand comparing against a duration._handle_failed_async_refresh()— New method that advances_stale_afterby the backoff period, replacing the_refresh_errflag._now()helper — Centralises "current time" so that naive and timezone-awaredatetimeobjects from different token sources are compared consistently._use_dynamic_stale_durationrenamed to_use_legacy_stale_duration— Inverted boolean to clarify intent: the legacy path is the one where callers supply an explicitstale_duration._MockRefreshable.refresh()no longer mutatesself._token— The mock now returns the token without settingself._tokenas a side effect, avoiding a data race between async and blocking refresh threads. The production code's_update_tokenhandles storage.How is this tested?
Tests are rewritten to be fully deterministic by introducing a
_ManualExecutorthat replaces the realThreadPoolExecutor. Async refreshes are queued but only execute whenexecutor.run_all()is called, eliminating alltime.sleep()calls and thread synchronization from async-path tests. This makes the test suite faster and removes flakiness from timing-dependent assertions.New test cases:
test_repeated_calls_during_async_failure_cooldown_do_not_refresh— verifies that calls during the cooldown period do not trigger additional async refreshes.test_call_after_async_failure_cooldown_refreshes_token_async— verifies that a call after the cooldown elapses triggers a new async refresh that succeeds.test_late_async_refresh_does_not_overwrite_blocking_refresh— verifies that a slow async refresh completing after a blocking refresh does not overwrite the newer token.test_stale_after_is_recomputed_after_blocking_refresh— verifies that_stale_afteris recomputed from the refreshed token after a blocking refresh.test_stale_after_computation— verifies that_stale_afteris computed correctly for both the dynamic and legacy stale-duration paths.