The failure handling strategy provides automatic retry and failover behavior for backend errors such as rate limits (429), connection errors, and authentication failures. It enables the proxy to silently recover from transient errors, improving reliability for agentic workflows.
When a backend request fails, the failure handling strategy decides whether to:
-
Wait and Retry - If the error is recoverable (e.g., rate limit) and the wait time is short, the proxy waits silently and retries the same backend.
-
Failover Immediately - If the wait would be too long or an alternative backend is available, the proxy switches to another backend instance that can serve the same model.
-
Surface the Error - If no recovery options are available, the error is returned to the client.
This happens transparently to the client, so agentic workflows continue without interruption from transient errors.
# Disable failure handling entirely
--disable-failure-handling
# Max seconds to wait before attempting failover (default: 30)
--max-silent-wait 30
# Total timeout budget across all failover attempts (default: 90)
--total-timeout-budget 90
# Seconds between SSE keepalive comments during waits (default: 8)
--keepalive-interval 8
# Maximum backend instances to try in failover chain (default: 5)
--max-failover-hops 5
# Minimum retry wait even for sub-second retry-after (default: 1)
--min-retry-wait 1| Variable | Default | Description |
|---|---|---|
DISABLE_FAILURE_HANDLING |
0 |
Set to 1 to disable automatic failure handling |
FAILURE_HANDLING_MAX_SILENT_WAIT |
30.0 |
Max seconds to wait before failover |
FAILURE_HANDLING_TOTAL_TIMEOUT_BUDGET |
90.0 |
Total timeout budget for all attempts |
FAILURE_HANDLING_KEEPALIVE_INTERVAL |
8.0 |
SSE keepalive interval during waits |
FAILURE_HANDLING_MAX_FAILOVER_HOPS |
5 |
Maximum backend instances to try |
FAILURE_HANDLING_MIN_RETRY_WAIT |
1.0 |
Minimum retry wait time |
Add to your config/config.yaml:
failure_handling:
# Master switch to enable/disable failure handling
enabled: true
# Maximum seconds to wait for retry-after before failover
# If retry-after <= this value, proxy waits silently
# If > this value, it attempts failover to another backend
max_silent_wait: 30.0
# Maximum total seconds across all failover attempts
# After this time, errors are surfaced to the client
total_timeout_budget: 90.0
# Seconds between SSE keepalive comments during waits
# Prevents client/connection timeouts during retry periods
keepalive_interval: 8.0
# Maximum number of backend instances to try in failover chain
# Limits failover depth to prevent infinite loops
max_failover_hops: 5
# Minimum wait time even for sub-second retry-after
# Prevents tight retry loops that could overwhelm backends
min_retry_wait: 1.0Default: 30 seconds
This is the threshold that determines whether to wait-and-retry or failover:
- If
retry-after <= max_silent_wait: The proxy waits silently and retries the same backend. The client doesn't notice the delay. - If
retry-after > max_silent_wait: The proxy immediately attempts failover to an alternative backend instance.
Example scenarios:
- Backend returns
retry-after: 10s→ Proxy waits 10s and retries (within threshold) - Backend returns
retry-after: 60s→ Proxy immediately fails over to another backend (exceeds threshold) - Backend returns
retry-after: 600s→ Proxy fails over if possible, otherwise surfaces error
Lower values (e.g., 10-15s) provide faster failover but may cause unnecessary backend switching. Higher values (e.g., 45-60s) are more patient but may cause longer delays for the client.
Default: 90 seconds
The maximum total time the proxy will spend attempting recovery before surfacing an error to the client. This includes:
- Time spent waiting for retry-after delays
- Time spent on failover attempts
- Time spent on actual backend requests
Once this budget is exhausted, any subsequent errors are immediately surfaced to the client.
Default: 8 seconds
During wait periods (e.g., waiting for a retry-after delay), the proxy emits SSE keepalive comments to prevent client/connection timeouts. This is especially important for streaming responses.
The keepalive comments look like:
: keepalive
Clients should ignore these as they're standard SSE comments.
Default: 5
The maximum number of different backend instances to try before giving up. This prevents infinite failover loops when all backends are experiencing issues.
For example, with max_failover_hops: 3:
- Try
openai-1→ fails - Try
openai-2→ fails - Try
openai-3→ fails - Surface error to client (max hops reached)
Default: 1 second
The minimum wait time enforced even when a backend returns a very short retry-after value (e.g., 0.1 seconds). This prevents tight retry loops that could:
- Overwhelm the backend
- Consume excessive CPU
- Create retry storms
- 429 Too Many Requests - Rate limit errors. Uses
retry-afterheader if available. - 503 Service Unavailable - Temporary unavailability. Short default wait applied.
- Connection Errors - Network issues. Short wait then retry/failover.
- Timeout Errors - Request timeouts. Immediate failover preferred.
- 401 Unauthorized - Authentication failure. Immediate failover, no retry.
- 403 Forbidden - Authorization failure. Immediate failover, no retry.
- 500 Internal Server Error - Backend error. Immediate failover, no retry.
- 400 Bad Request - Invalid request. Surfaced immediately (client error).
If the backend has already started sending content to the client (e.g., streaming has begun), the error is always surfaced immediately. Partial responses cannot be transparently recovered.
During streaming responses, the failure handling strategy behaves slightly differently:
- Before content starts: Full retry/failover capability. Client sees no error.
- After content starts: No recovery possible. Error is surfaced to client.
Keepalive comments are emitted during wait periods to prevent streaming timeouts:
: keepalive
: retrying in 5s
: retrying now
The failure handling strategy logs its decisions at INFO level:
INFO Failure strategy: waiting 10.0s before retrying backend-1/gpt-4o
INFO Failure strategy: failing over from backend-1 to backend-2 for model gpt-4o
For workflows that can tolerate longer delays but want maximum retry attempts:
failure_handling:
enabled: true
max_silent_wait: 60.0
total_timeout_budget: 180.0
max_failover_hops: 10
min_retry_wait: 2.0For latency-sensitive workflows that prefer quick failover:
failure_handling:
enabled: true
max_silent_wait: 10.0
total_timeout_budget: 45.0
max_failover_hops: 3
min_retry_wait: 0.5When debugging backend issues, you may want to see raw errors:
--disable-failure-handlingOr via environment:
export DISABLE_FAILURE_HANDLING=1- Request Deduplication - Prevent duplicate requests from exhausting rate limits
- Health Checks - Proactive backend health monitoring
- Backends Overview - Available backend configurations
- Troubleshooting - Debugging common issues