Skip to content

Conversation

@onewland
Copy link
Contributor

@onewland onewland commented Dec 16, 2025

We've had a lot of issues over the years with the health check being an issue that takes Snuba API pods down in a way that reduces stability rather than increasing it. Recently, there's an issue where a too-slow health check response, not being caught by our own timeout handling behavior, causes API pods to be taken out of rotation even when we have no evidence that new pods would behave any better.

Rather than scale down when we have connection issues to ClickHouse, this essentially adds an option to keep accepting requests on the pod for as long as uwsgi has capacity. This still is some degradation to the customer (at this time, some of their requests will time out) but assuming some kind of network or envoy issue is causing bad behavior that should recover faster than a pod restart. This option (health_check_ignore_clickhouse) is going to be disabled by default, but I want to try enabling this behavior in us for a while and see how it goes.

Internal doc on this: https://www.notion.so/sentry/investigation-pod-availability-drop-2ca8b10e4b5d808796c6ecd2e3ad1700

@onewland onewland marked this pull request as ready for review December 16, 2025 19:46
@onewland onewland requested a review from a team as a code owner December 16, 2025 19:46
@volokluev
Copy link
Member

why not just pass thorough=false and be done with it?

@onewland
Copy link
Contributor Author

onewland commented Dec 16, 2025

why not just pass thorough=false and be done with it?

thorough=false does something else, though we could change its behavior if you want.

thorough=true checks that every table exists on every ClickHouse cluster we define.
thorough=false checks that at least one ClickHouse cluster is connectable.

I'm trying to define a check that's even looser than that (basically: is the process alive?)

Copy link
Member

@MeredithAnya MeredithAnya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this seems reasonable to me

@onewland onewland merged commit 4d7e519 into master Jan 7, 2026
34 checks passed
@onewland onewland deleted the allow-noop-health-check branch January 7, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants