Skip to content

fix(run): cap retry backoff at max_delay on OverflowError#3226

Draft
adityasingh2400 wants to merge 2 commits into
openai:mainfrom
adityasingh2400:fix/retry-backoff-overflow
Draft

fix(run): cap retry backoff at max_delay on OverflowError#3226
adityasingh2400 wants to merge 2 commits into
openai:mainfrom
adityasingh2400:fix/retry-backoff-overflow

Conversation

@adityasingh2400
Copy link
Copy Markdown
Contributor

Summary

_default_retry_delay in src/agents/run_internal/model_retry.py computes initial_delay * (multiplier ** max(attempt - 1, 0)) and then caps it with min(..., max_delay). The intermediate exponential can overflow Python's float range before the cap is applied, raising OverflowError even though the final value would be immediately clamped.

This happens with realistic-ish misconfigurations:

  • multiplier=100.0, attempt=200 (e.g. an aggressive retry policy on a long-running operation)
  • Default multiplier=2.0, attempt >= 1075 (a stuck retry loop)

The OverflowError escapes the helper and crashes the retry loop instead of just returning max_delay.

The fix catches OverflowError and treats it as saturation - the helper now behaves the same way at the overflow point as it does just before it. Two unit tests cover both scenarios.

Test plan

  • pytest tests/models/test_model_retry.py - all 54 existing + 2 new tests pass
  • ruff check and ruff format --check clean
  • Manual repro: _default_retry_delay(200, ModelRetryBackoffSettings(initial_delay=1.0, max_delay=5.0, multiplier=100.0, jitter=False)) returned OverflowError before, returns 5.0 after

@github-actions github-actions Bot added bug Something isn't working feature:core labels May 8, 2026
@seratch
Copy link
Copy Markdown
Member

seratch commented May 8, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a9c515d1ae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/run_internal/model_retry.py Outdated
Comment on lines +292 to +295
try:
scaled = initial_delay * (multiplier**exponent)
except OverflowError:
scaled = max_delay
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve zero-delay backoff when exponent overflows

The overflow handler always substitutes max_delay, but initial_delay=0.0 is a valid way to request immediate retries. Once multiplier ** exponent overflows, _default_retry_delay(200, ModelRetryBackoffSettings(initial_delay=0, max_delay=5, multiplier=100, jitter=False)) returns 5.0 even though every non-overflowing attempt returns 0.0, unexpectedly stalling retries.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in b4278a35: Preserve zero initial_delay through the OverflowError handler. When base delay is 0 the cap is 0; otherwise it stays max_delay. Added a regression test for initial_delay=0.

@seratch seratch marked this pull request as draft May 8, 2026 17:35
@seratch seratch removed the bug Something isn't working label May 12, 2026
When _default_retry_delay computes initial_delay * (multiplier ** exponent)
with a large attempt count or multiplier (e.g. multiplier=100, attempt>=200,
or default multiplier=2.0 with attempt>=1075), the intermediate exponential
overflows Python's float range and raises OverflowError - even though the
final value would be immediately capped at max_delay. The exception then
escapes the retry helper and crashes the retry loop.

Catch OverflowError and short-circuit to max_delay so the backoff helper
behaves the same way at the saturation point as it does just before it.
initial_delay=0 is a valid way to request immediate retries; the
OverflowError fallback was substituting max_delay even though every
non-overflowing attempt returns 0. Short-circuit when initial_delay
is 0 so retries stay immediate at any attempt count.
@adityasingh2400 adityasingh2400 force-pushed the fix/retry-backoff-overflow branch from 80b0bc0 to b4278a3 Compare May 19, 2026 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants