fix: eliminate retry storm on 429/TPM rate limits (issue #1120)#1184
Merged
limityan merged 1 commit intoJun 13, 2026
Merged
Conversation
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Issue #1120 reports that complex tasks cause the app to freeze / appear dead for several minutes, requiring an app restart to recover.
Log analysis from the issue's attached
logs.ziprevealed the root cause: TPM (Tokens Per Minute) rate limiting causes a retry storm of up to 100 attempts.What happens
429 Too Many Requests: TPM limit reached"failed after 10 attempts: ... 429 ..."RoundExecutor::is_transient_network_error()sees"429"/"rate limit"in the error text and classifies it as transientsend_message_stream()again — which triggers another 10 SSE-layer retriesEvidence from logs
2476c221had 86 occurrences ofTPM limit reachedacross a single conversationFix
1. Stop retry-storm at the round executor layer
round_executor.rs—is_transient_network_error()now checks for budget-exhausted error patterns before falling through to keyword matching. To avoid false positives, it requires both"failed after "and"attempts:"to co-occur (the exact format produced by the SSE layer and round executor itself).2. Raise Retry-After cap to 60s
sse.rs—MAX_RETRY_AFTER_DELAY_MSraised from 10s to 60s. Some providers (e.g. NVIDIA integrate API) return Retry-After values of 30-60s for TPM limits. The 10s cap caused tight retry loops that burned through the request budget without actually waiting for the TPM window to reset.The existing fallback (exponential backoff when Retry-After header is absent) is unchanged and still works correctly.
3. Improve rate limit error messages
Locale resources (en-US, zh-CN, zh-TW) updated to mention TPM as a possible cause and give actionable guidance.
What was deliberately NOT changed
DialogTurnFailedevent withErrorCategory::RateLimitalready flows to the frontend, which haswait_and_retry/switch_modelaction buttons. With the retry storm fixed, users now get this feedback within seconds instead of after minutes of silence.Verification
cargo test -p bitfun-core --lib -- round_executor::tests— 12 tests pass (5 new)cargo test -p bitfun-ai-adapters --lib -- sse— 11 tests pass (1 new)pnpm run type-check:web— passpnpm run i18n:audit— passpnpm run fmt:rs— appliedFixes #1120