Skip to content

Conversation

@SkyeYoung
Copy link
Member

@SkyeYoung SkyeYoung commented Jan 14, 2026

Description

Summary

Fix eureka service discovery to retry all configured endpoints when one fails, preventing 503 errors on APISIX restart when some eureka services are unhealthy.

Problem

When APISIX restarts with multiple eureka services configured (some healthy, some unhealthy), each worker randomly selects one eureka node. Workers that hit an unhealthy node fail silently, leaving their applications variable as nil, causing 503 errors for requests routed to those workers.

Changes

  • Add build_endpoints() function - Converts all eureka hosts from config to endpoint objects with URL and auth info, handling prefix and trailing slash normalization.

  • Implement endpoint failover in fetch_full_registry() - Start from a random endpoint position for load balancing, then try endpoints sequentially until one succeeds. If all fail, log error and return without clearing existing applications data.

  • Add test case - Added TEST 4 in t/discovery/eureka.t to verify failover behavior when first eureka host returns 502.

refs

#12734

Which issue(s) this PR fixes:

Fixes #12610

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@SkyeYoung SkyeYoung marked this pull request as ready for review January 14, 2026 11:21
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jan 14, 2026
@moonming moonming requested a review from Copilot January 14, 2026 11:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

local fetch_interval = local_conf.discovery.eureka.fetch_interval or 30
log.info("fetch_interval:", fetch_interval, ".")

endpoints = build_endpoints()
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If build_endpoints() returns nil due to missing configuration, the timers will still be started, causing them to fail silently on every execution. Consider adding error handling to prevent timer initialization when endpoints cannot be built, similar to how other discovery services handle init failures.

Suggested change
endpoints = build_endpoints()
endpoints = build_endpoints()
if not endpoints or #endpoints == 0 then
log.error("failed to init eureka discovery: no valid endpoints " ..
"could be built, please check discovery.eureka configuration")
return
end

Copilot uses AI. Check for mistakes.
applications = up_apps
log.info("successfully updated service registry, services count=",
core.table.nkeys(up_apps), "; source=",
selected_endpoint and selected_endpoint.url or "unknown")
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition selected_endpoint and selected_endpoint.url or 'unknown' is redundant since this log line is only reached when selected_endpoint is not nil (line 170 returns early if nil). Simplify to just selected_endpoint.url.

Suggested change
selected_endpoint and selected_endpoint.url or "unknown")
selected_endpoint.url)

Copilot uses AI. Check for mistakes.
--- request
GET /eureka/apps/APISIX-EUREKA
--- response_body_like
.*<name>APISIX-EUREKA</name>.*
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test verifies the response body but doesn't verify that the request was actually routed through the working endpoint (127.0.0.1:8761) rather than the failing one (127.0.0.1:20997). Consider adding an assertion to check that the upstream actually received the request.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +168
if r and r.body and r.status == 200 then
selected_endpoint = endpoint
selected_body = r.body
break
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modification essentially ensures that if a registry unexpectedly fails, it will retry until a good registry is used. While this may introduce some additional latency, is it necessary to document it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should add a similar description to all service discovery documentation.

membphis
membphis previously approved these changes Jan 15, 2026
if not res.body or res.status ~= 200 then
log.error("failed to fetch registry, status = ", res.status)
if not selected_endpoint then
log.error("failed to fetch registry from all eureka hosts")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need more information, eg: no healthy registry, the count of all hosts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

moonming
moonming previously approved these changes Jan 15, 2026
@SkyeYoung SkyeYoung dismissed stale reviews from moonming and membphis via b70606c January 15, 2026 10:10
@SkyeYoung SkyeYoung requested review from bzp2010 and moonming January 15, 2026 10:30
@SkyeYoung SkyeYoung merged commit 9718c69 into apache:master Jan 16, 2026
23 checks passed
@SkyeYoung SkyeYoung deleted the young/fix/eureka-requests branch January 16, 2026 02:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: dynamic upstream, one Eureka node is unavailable, half of the requests are lost after reloading

4 participants