Skip to content

fix: eliminate retry storm on 429/TPM rate limits (issue #1120)#1184

Merged
limityan merged 1 commit into
GCWing:mainfrom
limityan:fix/issue-1120-rate-limit-retry-storm
Jun 13, 2026
Merged

fix: eliminate retry storm on 429/TPM rate limits (issue #1120)#1184
limityan merged 1 commit into
GCWing:mainfrom
limityan:fix/issue-1120-rate-limit-retry-storm

Conversation

@limityan

Copy link
Copy Markdown
Collaborator

Problem

Issue #1120 reports that complex tasks cause the app to freeze / appear dead for several minutes, requiring an app restart to recover.

Log analysis from the issue's attached logs.zip revealed the root cause: TPM (Tokens Per Minute) rate limiting causes a retry storm of up to 100 attempts.

What happens

  1. Provider returns 429 Too Many Requests: TPM limit reached
  2. SSE layer retries up to 10 times (with exponential backoff + Retry-After parsing)
  3. After SSE layer exhausts its budget, it returns an error like "failed after 10 attempts: ... 429 ..."
  4. Bug: RoundExecutor::is_transient_network_error() sees "429" / "rate limit" in the error text and classifies it as transient
  5. RoundExecutor retries — calling send_message_stream() again — which triggers another 10 SSE-layer retries
  6. Total: up to 100 retries, lasting several minutes of complete silence

Evidence from logs

  • Session 2476c221 had 86 occurrences of TPM limit reached across a single conversation
  • Token usage never exceeded 50% — the 80% compression threshold was never reached
  • The only successful compression was triggered manually by the user

Fix

1. Stop retry-storm at the round executor layer

round_executor.rsis_transient_network_error() now checks for budget-exhausted error patterns before falling through to keyword matching. To avoid false positives, it requires both "failed after " and "attempts:" to co-occur (the exact format produced by the SSE layer and round executor itself).

2. Raise Retry-After cap to 60s

sse.rsMAX_RETRY_AFTER_DELAY_MS raised from 10s to 60s. Some providers (e.g. NVIDIA integrate API) return Retry-After values of 30-60s for TPM limits. The 10s cap caused tight retry loops that burned through the request budget without actually waiting for the TPM window to reset.

The existing fallback (exponential backoff when Retry-After header is absent) is unchanged and still works correctly.

3. Improve rate limit error messages

Locale resources (en-US, zh-CN, zh-TW) updated to mention TPM as a possible cause and give actionable guidance.

What was deliberately NOT changed

  • No TPM-aware compression threshold adjustment: TPM limits are account-level, not session-level. Lowering the compression threshold would harm all users (more frequent compression means context loss means degraded model performance) while the compression call itself consumes tokens and worsens TPM limits.
  • No new event types added: The existing DialogTurnFailed event with ErrorCategory::RateLimit already flows to the frontend, which has wait_and_retry / switch_model action buttons. With the retry storm fixed, users now get this feedback within seconds instead of after minutes of silence.

Verification

  • cargo test -p bitfun-core --lib -- round_executor::tests — 12 tests pass (5 new)
  • cargo test -p bitfun-ai-adapters --lib -- sse — 11 tests pass (1 new)
  • pnpm run type-check:web — pass
  • pnpm run i18n:audit — pass
  • pnpm run fmt:rs — applied

Fixes #1120

@limityan limityan merged commit 00e03b2 into GCWing:main Jun 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

【bug】 上下文管理希望加强

1 participant