fix: eliminate retry storm on 429/TPM rate limits (issue #1120) by limityan · Pull Request #1184 · GCWing/BitFun

limityan · 2026-06-13T10:32:38Z

Problem

Issue #1120 reports that complex tasks cause the app to freeze / appear dead for several minutes, requiring an app restart to recover.

Log analysis from the issue's attached logs.zip revealed the root cause: TPM (Tokens Per Minute) rate limiting causes a retry storm of up to 100 attempts.

What happens

Provider returns 429 Too Many Requests: TPM limit reached
SSE layer retries up to 10 times (with exponential backoff + Retry-After parsing)
After SSE layer exhausts its budget, it returns an error like "failed after 10 attempts: ... 429 ..."
Bug: RoundExecutor::is_transient_network_error() sees "429" / "rate limit" in the error text and classifies it as transient
RoundExecutor retries — calling send_message_stream() again — which triggers another 10 SSE-layer retries
Total: up to 100 retries, lasting several minutes of complete silence

Evidence from logs

Session 2476c221 had 86 occurrences of TPM limit reached across a single conversation
Token usage never exceeded 50% — the 80% compression threshold was never reached
The only successful compression was triggered manually by the user

Fix

1. Stop retry-storm at the round executor layer

round_executor.rs — is_transient_network_error() now checks for budget-exhausted error patterns before falling through to keyword matching. To avoid false positives, it requires both "failed after " and "attempts:" to co-occur (the exact format produced by the SSE layer and round executor itself).

2. Raise Retry-After cap to 60s

sse.rs — MAX_RETRY_AFTER_DELAY_MS raised from 10s to 60s. Some providers (e.g. NVIDIA integrate API) return Retry-After values of 30-60s for TPM limits. The 10s cap caused tight retry loops that burned through the request budget without actually waiting for the TPM window to reset.

The existing fallback (exponential backoff when Retry-After header is absent) is unchanged and still works correctly.

3. Improve rate limit error messages

Locale resources (en-US, zh-CN, zh-TW) updated to mention TPM as a possible cause and give actionable guidance.

What was deliberately NOT changed

No TPM-aware compression threshold adjustment: TPM limits are account-level, not session-level. Lowering the compression threshold would harm all users (more frequent compression means context loss means degraded model performance) while the compression call itself consumes tokens and worsens TPM limits.
No new event types added: The existing DialogTurnFailed event with ErrorCategory::RateLimit already flows to the frontend, which has wait_and_retry / switch_model action buttons. With the retry storm fixed, users now get this feedback within seconds instead of after minutes of silence.

Verification

cargo test -p bitfun-core --lib -- round_executor::tests — 12 tests pass (5 new)
cargo test -p bitfun-ai-adapters --lib -- sse — 11 tests pass (1 new)
pnpm run type-check:web — pass
pnpm run i18n:audit — pass
pnpm run fmt:rs — applied

Fixes #1120

fix: eliminate retry storm on 429/TPM rate limits (issue GCWing#1120)

d6831c2

limityan mentioned this pull request Jun 13, 2026

【bug】上下文管理希望加强 #1120

Closed

limityan merged commit 00e03b2 into GCWing:main Jun 13, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: eliminate retry storm on 429/TPM rate limits (issue #1120)#1184

fix: eliminate retry storm on 429/TPM rate limits (issue #1120)#1184
limityan merged 1 commit into
GCWing:mainfrom
limityan:fix/issue-1120-rate-limit-retry-storm

limityan commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

limityan commented Jun 13, 2026

Problem

What happens

Evidence from logs

Fix

1. Stop retry-storm at the round executor layer

2. Raise Retry-After cap to 60s

3. Improve rate limit error messages

What was deliberately NOT changed

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant