Skip to content

Conversation

@diyuyi-agora
Copy link
Contributor

No description provided.

@claude
Copy link

claude bot commented Jan 13, 2026

PR Review: Fix error code (0) and HTTP status codes (400-599) reconnection handling

Summary

This PR modifies the is_reconnectable_error() function to treat error code 0 and HTTP status codes in the 400-599 range as reconnectable errors. The changes are minimal and focused.


✅ Positive Aspects

  1. Clear intent: The PR addresses a specific issue where error code 0 (unknown errors) and HTTP status codes should trigger reconnection attempts
  2. Proper version bump: The manifest.json version is correctly incremented from 0.3.12 to 0.3.13
  3. Good code comments: The added logic includes clear explanatory comments

⚠️ Issues and Concerns

1. Critical Logic Issue: 4xx Client Errors Should NOT Be Retried

The most significant problem is treating ALL HTTP 400-599 status codes as reconnectable. This is incorrect:

  • 4xx errors (400-499) are client errors indicating problems with the request itself (bad request, unauthorized, forbidden, not found, etc.)
  • Retrying 4xx errors is generally futile and can cause unnecessary load
  • 5xx errors (500-599) are server errors and make sense to retry

Issue location: const.py:83

# Current code - PROBLEMATIC
if 400 <= error_code <= 599:
    return True

Recommendation: Only retry 5xx server errors:

# HTTP 5xx status codes (server errors) should be retried
# 4xx client errors should NOT be retried as they indicate request problems
if 500 <= error_code <= 599:
    return True

2. Potential Conflict with Volcengine Error Code Ranges

Looking at the Volcengine error codes in the file:

  • Client errors: 4xxxxxxx (e.g., 45000000)
  • Server errors: 5xxxxxxx (e.g., 55000000)

The HTTP status code range 400-599 will overlap with the existing error code ranges:

  • HTTP 400-599 will match
  • But Volcengine codes like 45000000 will NOT match this range (since 45000000 > 599)

This creates inconsistent behavior:

  • A Volcengine error 45000001 won't match the HTTP range check
  • But an actual HTTP 400 will match and be retried

Questions to clarify:

  1. Are you mixing HTTP status codes and Volcengine-specific error codes in the same error_code parameter?
  2. Where do HTTP status codes (400-599) actually come from in this codebase?
  3. Based on line extension.py:911, it seems error_code defaults to 0 for unknown errors and may contain HTTP status codes from exception messages

3. Error Code 0 Handling May Be Too Broad

Treating error code 0 as reconnectable might mask real issues:

  • Error code 0 typically means "unknown error"
  • Blindly retrying unknown errors could hide bugs or configuration issues
  • Consider logging when this happens to understand the root cause

Recommendation: Add debug logging when error code 0 triggers reconnection

4. Missing Test Coverage

There are no unit tests for the modified is_reconnectable_error() function. The PR should include tests covering error code 0, HTTP status codes, Volcengine error codes, and non-reconnectable errors.


🔍 Additional Observations

  1. Code organization: The function logic is clear with the early returns, which is good
  2. No security concerns: This is error handling logic with no security implications
  3. Performance: The sequential checks are fine for this use case; no performance issues

📋 Recommendations

Must Fix:

  1. Change HTTP status code range from 400-599 to 500-599 (only retry server errors)
  2. Add unit tests for the new logic

Should Consider:

  1. Add logging when error code 0 is encountered to help diagnose unknown errors
  2. Document the expected error code formats (HTTP vs Volcengine) in the function docstring
  3. Clarify in the PR description why HTTP 4xx client errors should be retried (if there's a specific reason)

Verdict

⚠️ Needs Changes - The PR addresses a valid issue but has a critical logic flaw that would cause 4xx client errors to be retried unnecessarily. Please address the recommendations above before merging.

@diyuyi-agora diyuyi-agora force-pushed the bugfix/yuyidi/bytedance_reconnect branch from 21bf335 to 8d2c961 Compare January 14, 2026 08:46
@diyuyi-agora diyuyi-agora changed the title fix: error code (0) and HTTP status codes (400-599) to trigger reconnection attempts. update error handling logic for reconnection attempts Jan 14, 2026
wangyoucao577
wangyoucao577 previously approved these changes Jan 14, 2026
@diyuyi-agora diyuyi-agora changed the title update error handling logic for reconnection attempts fix: update error handling logic for reconnection attempts Jan 14, 2026
@diyuyi-agora diyuyi-agora reopened this Jan 14, 2026
@diyuyi-agora diyuyi-agora force-pushed the bugfix/yuyidi/bytedance_reconnect branch from 8d2c961 to a3bac72 Compare January 14, 2026 09:40
@claude
Copy link

claude bot commented Jan 14, 2026

Code Review: PR #1978 - Update Error Handling Logic for Reconnection

Summary

This PR changes the error handling strategy from an opt-in (whitelist) approach to an opt-out (blacklist) approach. While this may handle transient errors more gracefully, it introduces several significant risks that need to be addressed.


🔴 Critical Issues

1. Overly Aggressive Reconnection Logic

The new logic assumes all errors are retryable by default unless explicitly blacklisted. This is a risky assumption that could lead to:

  • Resource exhaustion: Reconnecting on fatal errors (authentication failures, quota exceeded, service unavailable)
  • Rate limiting violations: Repeatedly hitting the API with non-retryable errors
  • Cost implications: Unnecessary API calls for errors that cannot be resolved by retry
  • Cascading failures: If the service is truly down, aggressive retries from many clients could worsen the situation

Recommendation: Consider a hybrid approach:

  • Maintain an explicit whitelist of known retryable errors
  • Add a catch-all for specific error ranges that are documented as transient (e.g., 5xx server errors)
  • Default to False for unknown errors, logging them for investigation

2. Incomplete Error Code Blacklist

Only INVALID_PARAMETER is marked as non-retryable. Many other error codes should also be blacklisted:

NON_RECONNECTABLE_ERROR_CODES = [
    VOLCENGINE_ERROR_CODES["INVALID_PARAMETER"],      # 400xxxxx
    VOLCENGINE_ERROR_CODES["AUTHENTICATION_FAILED"],   # 401xxxxx - Auth won't fix itself
    VOLCENGINE_ERROR_CODES["PERMISSION_DENIED"],      # 403xxxxx - Permission issues
    VOLCENGINE_ERROR_CODES["NOT_FOUND"],              # 404xxxxx - Resource doesn't exist
    VOLCENGINE_ERROR_CODES["UNSUPPORTED_OPERATION"],  # 405xxxxx - Invalid operation
    VOLCENGINE_ERROR_CODES["REQUEST_TIMEOUT"],        # Could debate this one
    VOLCENGINE_ERROR_CODES["CONFLICT"],               # 409xxxxx - State conflict
    VOLCENGINE_ERROR_CODES["QUOTA_EXCEEDED"],         # 429xxxxx - Rate limit reached
    # Add more based on API documentation
]

3. Loss of Range-Based Error Handling

The old code had logic for handling error ranges (e.g., 55000000 <= error_code <= 55099999 for internal errors). This was removed without replacement. If the API uses error code ranges for categorization, this information is lost.

Question: Does the Volcengine API document error code ranges? If so, we should preserve this capability.


⚠️ Significant Concerns

4. Missing Circuit Breaker Pattern

With the new "retry everything" approach, there's no protection against infinite retry loops. Consider implementing:

  • Max retry count per error type
  • Exponential backoff with jitter
  • Circuit breaker to stop retries after N consecutive failures
  • Dead letter queue for errors that fail repeatedly

5. No Error Monitoring/Alerting

When unknown errors are automatically retried, you lose visibility. Recommendations:

def is_reconnectable_error(error_code: int) -> bool:
    if error_code in NON_RECONNECTABLE_ERROR_CODES:
        return False
    
    # Log unknown error codes for investigation
    if error_code not in RECONNECTABLE_ERROR_CODES:
        logger.warning(
            f"Unknown error code {error_code} encountered. "
            f"Attempting reconnection by default. Please update error handling logic."
        )
    
    return True

6. Version Bump Rationale

The version bumped from 0.3.12 to 0.3.13 (patch version). Given this is a breaking behavioral change, consider:

  • Is this a patch (bug fix)?
  • Or a minor version change (behavior change)?

If this fixes a critical bug where legitimate retries weren't happening, document this in the PR description.


💡 Best Practices & Code Quality

7. Documentation Improvements

The new docstring is good but could be more specific:

def is_reconnectable_error(error_code: int) -> bool:
    """
    Check if an error code should trigger reconnection.
    
    Uses a blacklist approach: all errors are considered retryable unless
    explicitly marked as non-retryable. This handles transient and unknown
    errors gracefully, but may cause unnecessary retries for fatal errors.
    
    Args:
        error_code: The Volcengine API error code to check
    
    Returns:
        True if reconnection should be attempted (default for unknown errors)
        False if the error indicates a permanent failure (e.g., invalid parameters)
        
    Note:
        Consider implementing rate limiting and circuit breaker patterns
        when using this function to avoid retry storms.
    """

8. Testing Concerns

Questions to address:

  • Are there unit tests for is_reconnectable_error()?
  • Do tests cover both known and unknown error codes?
  • Are there integration tests that verify retry behavior doesn't cause issues?
  • Has this been tested against actual API failures?

🔒 Security Considerations

9. Denial of Service Risk

If an attacker can trigger errors that cause infinite retries, they could:

  • Exhaust your API quota
  • Increase costs
  • Create resource exhaustion

Mitigation: Implement rate limiting, max retry counts, and monitoring.


🚀 Performance Considerations

10. Latency Impact

Retrying non-retryable errors adds latency to failure scenarios. Users might wait longer for definitive failures.

Recommendation: Fast-fail on known non-retryable errors, implement timeouts.


📋 Suggested Action Items

Before merging, please address:

  1. Expand NON_RECONNECTABLE_ERROR_CODES to include all permanently failing error types
  2. Add logging for unknown error codes to build a comprehensive list over time
  3. Implement retry limits (max attempts, exponential backoff)
  4. Add/update unit tests covering edge cases
  5. Document the reasoning for this change in the PR description (what bug does it fix?)
  6. Consider a phased rollout with monitoring to catch unexpected retry storms
  7. Review Volcengine API documentation to ensure all error codes are properly categorized

📝 Summary

Approval Status: ⚠️ Request Changes

The approach has merit for handling transient errors, but the implementation is too permissive. The risks of resource exhaustion, cost increase, and cascading failures outweigh the benefits without proper guardrails.

Recommended Path Forward:

  1. Expand the non-retryable error list significantly
  2. Add retry limiting mechanisms
  3. Include monitoring/logging for unknown errors
  4. Provide comprehensive test coverage
  5. Document the specific problem this solves

Happy to discuss this further or review a revised approach!

@claude
Copy link

claude bot commented Jan 16, 2026

Code Review: PR #1978 - Update Error Handling Logic for Reconnection

Summary

This PR fundamentally changes the error reconnection strategy from an opt-in (whitelist) approach to an opt-out (blacklist) approach. The new logic assumes all errors are retryable by default unless explicitly marked as non-retryable. While this approach has merit for handling unknown transient errors, there are several concerns that should be addressed.


📊 Changes Overview

Files Modified:

  • const.py: Complete rewrite of is_reconnectable_error() logic (14 additions, 11 deletions)
  • manifest.json: Version bump from 0.3.13 to 0.3.14

Core Logic Change:

# OLD (whitelist): Only retry known reconnectable errors
if error_code in RECONNECTABLE_ERROR_CODES:
    return True
if 55000000 <= error_code <= 55099999:  # Internal errors range
    return True
return False

# NEW (blacklist): Retry everything except known non-retryable errors
if error_code in NON_RECONNECTABLE_ERROR_CODES:
    return False
return True  # Default: retry

✅ Positive Aspects

  1. Resilience to Unknown Errors: The new approach handles transient errors that weren't explicitly whitelisted, which could improve reliability in production
  2. Clear Documentation: The updated docstring clearly explains the new default-retry behavior
  3. Code Simplicity: The new logic is more concise and easier to understand
  4. Proper Version Bump: Manifest version correctly incremented

⚠️ Critical Concerns

1. Incomplete Non-Retryable Error List 🔴

Currently only INVALID_PARAMETER (45000001) is blacklisted. Based on the error codes defined in the file, several others should likely be non-retryable:

NON_RECONNECTABLE_ERROR_CODES = [
    VOLCENGINE_ERROR_CODES["INVALID_PARAMETER"],      # 45000001 - Bad request
    VOLCENGINE_ERROR_CODES["INVALID_AUDIO_FORMAT"],   # 45000151 - Format won't fix with retry
    # Consider adding if defined in API:
    # - Authentication/authorization errors (won't fix without credential changes)
    # - Resource not found errors (resource doesn't exist)
    # - Quota exceeded errors (need to wait or increase limits)
]

Impact: Without a comprehensive blacklist, the system may repeatedly retry truly fatal errors, wasting resources and potentially hitting rate limits.

Recommendation: Review the Volcengine ASR LLM API documentation to identify all permanently failing error codes and add them to the blacklist.


2. Loss of Range-Based Error Handling 🟡

The original code had range-based matching for internal errors:

# Old code handled all 550xxxxx errors
if 55000000 <= error_code <= 55099999:
    return True

This logic is now gone. If the Volcengine API uses error code ranges to categorize errors (which appears to be the case based on the 4xxxxxxx and 5xxxxxxx patterns), we've lost this capability.

Questions:

  • Does the API documentation specify error ranges?
  • Should we preserve range checking for unknown errors within the 5xxxxxxx (server error) range?

Potential Solution:

def is_reconnectable_error(error_code: int) -> bool:
    # Explicit blacklist
    if error_code in NON_RECONNECTABLE_ERROR_CODES:
        return False
    
    # Explicit whitelist for known reconnectable errors
    if error_code in RECONNECTABLE_ERROR_CODES:
        return True
    
    # Range-based fallback: 5xxxxxxx are server errors (retryable)
    # 4xxxxxxx are client errors (generally not retryable)
    if 50000000 <= error_code <= 59999999:
        return True
    
    # Conservative default for truly unknown errors
    # Log these for investigation
    return False

3. Risk of Retry Storms 🟡

With aggressive default-retry behavior, there's increased risk of:

  • Resource exhaustion from infinite retry loops
  • API rate limiting violations
  • Increased costs from unnecessary API calls
  • Cascading failures if many clients retry simultaneously

Current Mitigation: I reviewed extension.py:435-450 and confirmed there is exponential backoff with configurable max_retry_delay and retry count limits, which helps mitigate this risk.

Status: ✅ Adequate retry limiting exists, but should be documented in this PR


4. No Test Coverage for New Logic 🔴

The test file test_volcengine_asr.py has no tests for is_reconnectable_error(). The PR should include unit tests covering:

def test_is_reconnectable_error():
    # Non-retryable errors
    assert not is_reconnectable_error(VOLCENGINE_ERROR_CODES["INVALID_PARAMETER"])
    
    # Known retryable errors (from RECONNECTABLE_ERROR_CODES)
    assert is_reconnectable_error(VOLCENGINE_ERROR_CODES["SERVER_BUSY"])
    assert is_reconnectable_error(VOLCENGINE_ERROR_CODES["INTERNAL_ERROR"])
    assert is_reconnectable_error(VOLCENGINE_ERROR_CODES["PACKET_TIMEOUT"])
    
    # Unknown errors (test new default behavior)
    assert is_reconnectable_error(99999999)  # Unknown code
    assert is_reconnectable_error(0)  # Zero code

Recommendation: Add comprehensive unit tests before merging.


💡 Code Quality Observations

5. RECONNECTABLE_ERROR_CODES Is Now Unused

The RECONNECTABLE_ERROR_CODES set (lines 45-64) is still defined but no longer referenced in is_reconnectable_error(). This creates dead code.

Options:

  1. Remove it if it's truly unused (check if other code references it)
  2. Keep it for documentation purposes (shows which errors are known to be retryable)
  3. Use it in a hybrid approach (whitelist + blacklist)

Recommendation: Check for references with grep, then either remove or document its purpose.


6. Data Structure Mismatch

Minor inconsistency:

  • RECONNECTABLE_ERROR_CODES is a set (lines 45-64)
  • NON_RECONNECTABLE_ERROR_CODES is a list (lines 66-70)

Both use in operator which works fine, but for consistency and performance (O(1) vs O(n) lookup), both should be sets.

Recommendation: Change to NON_RECONNECTABLE_ERROR_CODES = { (use curly braces)


🔍 Additional Observations

7. Missing Context in PR Description

The PR title mentions "reconnection attempts" but the PR body is empty. For future maintainability, please document:

  • What problem does this solve? (e.g., "Errors X and Y were not being retried, causing service interruptions")
  • Why was the approach changed from whitelist to blacklist?
  • What testing was performed?
  • Are there any risks or behavioral changes users should be aware of?

8. Integration with Extension Usage

I reviewed extension.py (lines 419 and 887) where is_reconnectable_error() is called. The integration looks correct:

# extension.py:419
if is_reconnectable_error(error_code) and not self.stopped:
    await self._handle_reconnect()
else:
    await self.send_asr_error(...)

# extension.py:887
if is_reconnectable_error(error_code) and not self.stopped:
    await self._handle_reconnect()
else:
    await self.send_asr_error(...)

The not self.stopped check provides a circuit breaker, which is good. The reconnection logic in _handle_reconnect() has exponential backoff with max retries, which mitigates retry storm risks.


📋 Recommendations Summary

Must Fix (Before Merging):

  1. Expand NON_RECONNECTABLE_ERROR_CODES - Add all error codes that represent permanent failures
  2. Add unit tests - Test both retryable and non-retryable error codes, plus unknown codes
  3. Change list to set - NON_RECONNECTABLE_ERROR_CODES should be a set for consistency and performance

Should Consider:
4. 🤔 Document in PR description - Explain what problem this solves and why the approach changed
5. 🤔 Consider range-based fallback - Preserve handling for 5xxxxxxx error ranges if documented by API
6. 🤔 Clean up unused code - Remove or document the now-unused RECONNECTABLE_ERROR_CODES set
7. 🤔 Add logging - Log when unknown error codes trigger retry (for monitoring/debugging)

Nice to Have:
8. 💭 Review Volcengine API documentation to ensure all error codes are properly categorized
9. 💭 Consider a hybrid whitelist+blacklist approach for better control
10. 💭 Add integration test that verifies retry behavior under various error conditions


🎯 Verdict

Status: ⚠️ Request Changes

The approach has merit and could improve reliability, but the implementation needs hardening before production use. The most critical issues are:

  1. Incomplete blacklist of non-retryable errors
  2. Missing test coverage
  3. Loss of range-based error handling

Risk Level: Medium - The existing retry limiting in extension.py prevents catastrophic failures, but the overly permissive retry logic could still cause resource waste and increased costs.

Path Forward: Address the "Must Fix" items, then I'd be comfortable approving. The "Should Consider" items would make this production-ready.

Happy to discuss or review a revised approach! 🚀

@wangyoucao577 wangyoucao577 merged commit e278442 into main Jan 16, 2026
34 checks passed
@wangyoucao577 wangyoucao577 deleted the bugfix/yuyidi/bytedance_reconnect branch January 16, 2026 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants