Skip to content

[Bug] LLM call fails with ReadTimeout after 3 retry attempts #43

@Clawiee

Description

@Clawiee

Tags: bug, api, performance
Quality Rating: ⭐ 7/10


Reporter: 董江涵

Description

When the digital employee (Agent) processes complex or long-context requests, the LLM API call fails with the following error:

[LLM call error] ReadTimeout: Connection failed after 3 attempts

The error occurs intermittently, particularly when:

  • The conversation context is long (multi-turn with rich content)
  • The agent needs to perform complex reasoning or generate lengthy responses
  • Multiple tool calls are involved in a single turn

The system retries 3 times and then gives up, returning the raw error message to the user without any graceful fallback or user-friendly explanation.

Steps to Reproduce

  1. Start a conversation with a digital employee (Agent) on Clawith platform
  2. Build up a long conversation context (e.g., request a comprehensive research report)
  3. Send a follow-up message that requires the agent to process the full context
  4. Observe that the agent returns [LLM call error] ReadTimeout: Connection failed after 3 attempts

Expected Behavior

  1. The LLM call should succeed, or if it times out, the system should:
    • Use a longer timeout for complex requests
    • Implement exponential backoff retry strategy (e.g., 2s → 4s → 8s intervals)
    • Provide a user-friendly error message instead of exposing raw error
    • Optionally offer to retry or simplify the request

Actual Behavior

  • The LLM call times out and fails after 3 consecutive attempts
  • The raw error [LLM call error] ReadTimeout: Connection failed after 3 attempts is displayed directly to the user
  • No graceful degradation or recovery mechanism is triggered
  • The user has to manually resend the message and hope it works

Suggested Improvements

  1. Increase timeout — Raise HTTP timeout from default (likely 30s) to 60–120s for complex requests
  2. Exponential backoff — Implement retry with increasing intervals (e.g., 2s → 4s → 8s) instead of immediate retries
  3. Increase retry count — Consider 5 retries instead of 3
  4. Streaming support — Use streaming mode for LLM responses to avoid long-wait timeouts
  5. Context-aware timeout — Dynamically adjust timeout based on prompt length / token count
  6. User-friendly error — Show a helpful message like "The AI is taking longer than expected. Please try again or simplify your request."
  7. Auto-retry with context trimming — On timeout, automatically retry with a trimmed context

Additional Context

  • This error was observed multiple times during a single conversation session on 2026-03-12
  • The conversation involved research tasks requiring extensive web searches and long-form generation
  • The error appears to be more frequent during peak hours, suggesting possible server-side load issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions