Replace async-disabling mechanism with retry backoff on refresh failure by mihaimitrea-db · Pull Request #1315 · databricks/databricks-sdk-py

mihaimitrea-db · 2026-03-10T09:41:09Z

Summary

Replace the async-disabling mechanism on token refresh failure with a 1-minute retry backoff, allowing the SDK to recover from transient errors without waiting for a full token expiry.

Why

When an asynchronous token refresh failed, the Refreshable class set a _refresh_err flag that completely disabled async refresh. The only way to clear this flag was through a blocking refresh, which only triggers when the token fully expires. This meant the SDK could not recover from transient refresh failures (e.g. a brief network blip) until the token expired — potentially tens of minutes later — even though the underlying issue may have resolved in seconds.

This PR replaces the binary disable flag with a short cooldown: after a failed async refresh, the _stale_after threshold is pushed 1 minute into the future so the token appears fresh for a brief backoff period. Once the cooldown elapses the token becomes stale again and a new async refresh is attempted, giving the SDK a chance to recover proactively.

What changed

Interface changes

None.

Behavioral changes

Async refresh retry on failure — Previously, a failed async refresh disabled all future async attempts until a blocking refresh on expiry. Now, the SDK waits 1 minute (_ASYNC_REFRESH_RETRY_BACKOFF) and then retries the async refresh. This makes token refresh more resilient to transient errors.
Late async result guard — When a slow async refresh completes after a blocking refresh already obtained a newer token, the stale async result is now discarded instead of overwriting the fresher token.

Internal changes

_stale_after replaces _stale_duration — Staleness is now tracked as an absolute timestamp (_stale_after) instead of a relative timedelta (_stale_duration). This simplifies _token_state() to a direct comparison rather than computing expiry - now and comparing against a duration.
_handle_failed_async_refresh() — New method that advances _stale_after by the backoff period, replacing the _refresh_err flag.
_now() helper — Centralises "current time" so that naive and timezone-aware datetime objects from different token sources are compared consistently.
_use_dynamic_stale_duration renamed to _use_legacy_stale_duration — Inverted boolean to clarify intent: the legacy path is the one where callers supply an explicit stale_duration.
_MockRefreshable.refresh() no longer mutates self._token — The mock now returns the token without setting self._token as a side effect, avoiding a data race between async and blocking refresh threads. The production code's _update_token handles storage.

How is this tested?

Tests are rewritten to be fully deterministic by introducing a _ManualExecutor that replaces the real ThreadPoolExecutor. Async refreshes are queued but only execute when executor.run_all() is called, eliminating all time.sleep() calls and thread synchronization from async-path tests. This makes the test suite faster and removes flakiness from timing-dependent assertions.

New test cases:

test_repeated_calls_during_async_failure_cooldown_do_not_refresh — verifies that calls during the cooldown period do not trigger additional async refreshes.
test_call_after_async_failure_cooldown_refreshes_token_async — verifies that a call after the cooldown elapses triggers a new async refresh that succeeds.
test_late_async_refresh_does_not_overwrite_blocking_refresh — verifies that a slow async refresh completing after a blocking refresh does not overwrite the newer token.
test_stale_after_is_recomputed_after_blocking_refresh — verifies that _stale_after is recomputed from the refreshed token after a blocking refresh.
test_stale_after_computation — verifies that _stale_after is computed correctly for both the dynamic and legacy stale-duration paths.

Previously, a failed asynchronous token refresh would set a `_refresh_err` flag that completely disabled async refresh until a blocking refresh (on token expiry) cleared it. This meant the SDK could not recover from transient errors without first letting the token expire. Replace this with a 1-minute cooldown: on async failure, advance the `_stale_after` threshold so the token appears fresh for a short backoff period, then allow another async attempt. This lets the SDK retry proactively instead of waiting for a full token expiry. Supporting changes: - Store staleness as an absolute `_stale_after` timestamp instead of a relative `_stale_duration` offset, simplifying `_token_state()` comparisons. - Add `_now()` helper to keep naive/aware datetime comparisons consistent. Rewrite tests to be fully deterministic by introducing a _ManualExecutor that replaces the real thread pool, eliminating all time.sleep() calls and thread synchronization. Test suite runtime drops from ~9.5s to <0.1s.

…token If a token expires while an async refresh is in flight, a blocking refresh runs and caches a fresh token. Without this guard the async thread could then overwrite it with an older one. Skip the async update when the new token expires before the currently cached token. Also fix stale comments and docstrings left over from the previous async-disabling mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

renaudhartert-db

The core idea is sound -- replacing a permanent disable flag with a timed backoff is strictly better. The _ManualExecutor in the tests is a nice improvement too. A few things inline.

databricks/sdk/oauth.py

- Adapt tests to use new Refreshable.__init__ signature (stale_duration param) - Replace _refresh_err-based tests with _stale_after cooldown tests - Replace _stale_duration assertions with _stale_after assertions - Add test for late async refresh not overwriting a blocking refresh - Fix _MockRefreshable.refresh() to not mutate self._token directly, avoiding a data race between async and blocking refresh threads

- Reset _stale_after to None at the top of _update_token so it does not retain a stale value when a token transitions from having an expiry to not having one. - Replace expiry comparison in the async refresh guard with a monotonic _token_generation counter, avoiding the assumption that newer tokens always have a later expiry (handles TTL policy changes correctly). - Document the tz-awareness invariant on _now(): all tokens supplied to a single Refreshable must use the same convention (all naive or all aware).

A queued async refresh could still apply retry backoff after a blocking refresh had already installed a newer token, perturbing the fresh token's staleness window. Ignore async completions from older token generations and add a regression test for the late-failure race.

github-actions · 2026-03-18T13:52:20Z

If integration tests don't run automatically, an authorized user can run them manually by following the instructions below:

Trigger:
go/deco-tests-run/sdk-py

Inputs:

PR number: 1315
Commit SHA: b02ac6128ef169ae7ab428eea90dd5e0b74b64c7

Checks will be approved automatically on success.

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 09:41 — with GitHub Actions Inactive

Fix linting errors and add changelog description

624e76b

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 10:02 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 10:03 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 10, 2026 10:45 — with GitHub Actions Inactive

mihaimitrea-db requested a review from renaudhartert-db March 10, 2026 11:43

mihaimitrea-db self-assigned this Mar 10, 2026

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 11:57 — with GitHub Actions Inactive

renaudhartert-db requested changes Mar 11, 2026

View reviewed changes

databricks/sdk/oauth.py Show resolved Hide resolved

databricks/sdk/oauth.py Outdated Show resolved Hide resolved

databricks/sdk/oauth.py Show resolved Hide resolved

mihaimitrea-db force-pushed the mihaimitrea-db/async-refresh-retry branch from b1919eb to ce28c1d Compare March 11, 2026 16:34

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 16:34 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 16:40 — with GitHub Actions Inactive

Fix black formatting in test_refreshable.py

108ebec

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 16:46 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 11, 2026 16:47 — with GitHub Actions Inactive

mihaimitrea-db temporarily deployed to test-trigger-is March 13, 2026 09:59 — with GitHub Actions Inactive

mihaimitrea-db requested a review from renaudhartert-db March 13, 2026 14:44

This comment was marked as resolved.

Sign in to view

Clean up Refreshable: trim docstrings, cache _now(), and improve logging

b02ac61

mihaimitrea-db temporarily deployed to test-trigger-is March 18, 2026 13:52 — with GitHub Actions Inactive

mihaimitrea-db requested a review from hectorcast-db March 18, 2026 15:02

renaudhartert-db approved these changes Mar 18, 2026

View reviewed changes

mihaimitrea-db added this pull request to the merge queue Mar 19, 2026

Merged via the queue into main with commit 0d0e807 Mar 19, 2026
17 checks passed

mihaimitrea-db deleted the mihaimitrea-db/async-refresh-retry branch March 19, 2026 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace async-disabling mechanism with retry backoff on refresh failure#1315

Replace async-disabling mechanism with retry backoff on refresh failure#1315
mihaimitrea-db merged 8 commits intomainfrom
mihaimitrea-db/async-refresh-retry

mihaimitrea-db commented Mar 10, 2026 •

edited

Loading

Uh oh!

renaudhartert-db left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihaimitrea-db commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Interface changes

Behavioral changes

Internal changes

How is this tested?

Uh oh!

renaudhartert-db left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihaimitrea-db commented Mar 10, 2026 •

edited

Loading