Skip to content

Retry OAuth token refresh on server errors#4513

Open
gkatz2 wants to merge 2 commits intostacklok:mainfrom
gkatz2:fix/oauth-transient-error-classification
Open

Retry OAuth token refresh on server errors#4513
gkatz2 wants to merge 2 commits intostacklok:mainfrom
gkatz2:fix/oauth-transient-error-classification

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented Apr 3, 2026

Summary

  • Load balancers and CDNs can return HTML error pages (or HTTP 5xx) during OAuth token refresh. isTransientNetworkError() treats these as permanent auth failures, immediately marking the workload as "unauthenticated" with no retry. Remote MCP servers become permanently broken until manually restarted.
  • Classify *oauth2.RetrieveError with 5xx status as transient (retry with backoff). 4xx errors remain permanent.
  • Detect unparseable token responses (HTML-on-200) from the oauth2 library as transient.

Fixes #4512

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Does this introduce a user-facing change?

Remote MCP servers with OAuth authentication now survive transient token endpoint outages (5xx errors, HTML error pages from load balancers) instead of permanently breaking.

Special notes for reviewers

The isOAuthParseError helper uses string matching against "oauth2: cannot parse json" and "oauth2: cannot parse response" because the oauth2 library wraps these with fmt.Errorf("%v", err) (not %w), making type-based detection impossible. These strings have been stable across oauth2 v0.33.0 through v0.36.0. If they ever change, the worst case is a return to current behavior (no regression).

Generated with Claude Code

When a load balancer or CDN returns an HTML error page during
token refresh, the workload is immediately marked as
unauthenticated with no retry. Remote MCP servers become
permanently broken until manually restarted, even when the
OAuth server recovers seconds later.

Fixes stacklok#4512

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 3, 2026
Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/oauth-transient-error-classification branch from de82747 to 540a487 Compare April 3, 2026 00:42
@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Apr 3, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (e69251a) to head (540a487).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
pkg/auth/monitored_token_source.go 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4513      +/-   ##
==========================================
- Coverage   69.64%   69.07%   -0.58%     
==========================================
  Files         491      502      +11     
  Lines       50304    51973    +1669     
==========================================
+ Hits        35036    35900     +864     
- Misses      12580    13285     +705     
- Partials     2688     2788     +100     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Token refresh treats server errors as permanent auth failures

2 participants