Skip to content

Fix long overflow in CachedSupplier.maxStaleFailureJitter() that permanently disables credential refresh after 58 consecutive failures#6806

Open
vinubhagavath-dev wants to merge 1 commit intoaws:masterfrom
vinubhagavath-dev:fix-cached-supplier-overflow
Open

Fix long overflow in CachedSupplier.maxStaleFailureJitter() that permanently disables credential refresh after 58 consecutive failures#6806
vinubhagavath-dev wants to merge 1 commit intoaws:masterfrom
vinubhagavath-dev:fix-cached-supplier-overflow

Conversation

@vinubhagavath-dev
Copy link

Description

CachedSupplier.maxStaleFailureJitter() has a long overflow that permanently disables credential refresh after 58 consecutive failures.

When numFailures reaches 58, (1L << 57) * 100 overflows signed long, wrapping to a negative value.
ComparableUtils.minimum(negativeDuration, 10s) returns the negative duration (since it's "less than" 10s), which flows into
jitterTime() and produces a stale time millions of years in the future. The SDK never attempts to refresh credentials again.

Observed in production

During the recent outage, InstanceProfileCredentialsProvider with StaleValueBehavior.ALLOW hit 58 consecutive failures:

(consecutive failures: 57) Cached value expiration has been extended to 2026-03-18T04:45:56Z
(consecutive failures: 58) Cached value expiration has been extended to +75191085-06-13T12:51:54Z

Entire application restart was required to restore credential refresh.

Fix

Clamp overflowed negative values to Long.MAX_VALUE, ensuring ComparableUtils.minimum() always caps at 10 seconds as intended.

Testing

Verified that for numFailures values 1 through 100, maxStaleFailureJitter() always returns a positive duration ≤ 10 seconds.

When numFailures reaches 58, (1L << 57) * 100 overflows signed long,
producing a negative duration that bypasses the 10-second cap in
ComparableUtils.minimum(). This permanently sets the cached value's
stale time to millions of years in the future, preventing any further
credential refresh for the lifetime of the process.

Clamp overflowed values to Long.MAX_VALUE so the 10-second cap is
always respected.
@vinubhagavath-dev vinubhagavath-dev requested a review from a team as a code owner March 20, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant