feat(health): allow clickhouse check bypass #7600

onewland · 2025-12-16T19:44:57Z

We've had a lot of issues over the years with the health check being an issue that takes Snuba API pods down in a way that reduces stability rather than increasing it. Recently, there's an issue where a too-slow health check response, not being caught by our own timeout handling behavior, causes API pods to be taken out of rotation even when we have no evidence that new pods would behave any better.

Rather than scale down when we have connection issues to ClickHouse, this essentially adds an option to keep accepting requests on the pod for as long as uwsgi has capacity. This still is some degradation to the customer (at this time, some of their requests will time out) but assuming some kind of network or envoy issue is causing bad behavior that should recover faster than a pod restart. This option (health_check_ignore_clickhouse) is going to be disabled by default, but I want to try enabling this behavior in us for a while and see how it goes.

Internal doc on this: https://www.notion.so/sentry/investigation-pod-availability-drop-2ca8b10e4b5d808796c6ecd2e3ad1700

volokluev · 2025-12-16T20:04:19Z

why not just pass thorough=false and be done with it?

onewland · 2025-12-16T20:27:03Z

why not just pass thorough=false and be done with it?

thorough=false does something else, though we could change its behavior if you want.

thorough=true checks that every table exists on every ClickHouse cluster we define.
thorough=false checks that at least one ClickHouse cluster is connectable.

I'm trying to define a check that's even looser than that (basically: is the process alive?)

MeredithAnya

i think this seems reasonable to me

feat(health): allow clickhouse check bypass

6ace875

onewland marked this pull request as ready for review December 16, 2025 19:46

onewland requested a review from a team as a code owner December 16, 2025 19:46

MeredithAnya approved these changes Dec 16, 2025

View reviewed changes

phacops approved these changes Jan 7, 2026

View reviewed changes

onewland merged commit 4d7e519 into master Jan 7, 2026
34 checks passed

onewland deleted the allow-noop-health-check branch January 7, 2026 18:35

github-actions bot mentioned this pull request Jan 9, 2026

ci(release): Switch from action-prepare-release to Craft #7631

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(health): allow clickhouse check bypass #7600

feat(health): allow clickhouse check bypass #7600

onewland commented Dec 16, 2025 •

edited

Loading

Uh oh!

volokluev commented Dec 16, 2025

Uh oh!

onewland commented Dec 16, 2025 •

edited

Loading

Uh oh!

MeredithAnya left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

feat(health): allow clickhouse check bypass #7600

feat(health): allow clickhouse check bypass #7600

Conversation

onewland commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

volokluev commented Dec 16, 2025

Uh oh!

onewland commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeredithAnya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

onewland commented Dec 16, 2025 •

edited

Loading

onewland commented Dec 16, 2025 •

edited

Loading