Make transparent proxy health check parameters configurable by gkatz2 · Pull Request #4108 · stacklok/toolhive

gkatz2 · 2026-03-11T16:36:17Z

Summary

The transparent proxy health check has mostly hardcoded parameters (ping timeout, retry delay, failure threshold), making it impossible to tune resilience for environments with transient network disruptions. When the proxy kills itself due to false positives, the client's MCP session is invalidated, requiring a full client restart to recover.
Add three new environment variables (TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existing TOOLHIVE_HEALTH_CHECK_INTERVAL pattern.
Raise the default failure threshold from 3 to 5, extending the tolerance window from ~40s to ~60s out of the box.
Add warning logs when invalid env var values are provided, and debug logs when custom values are successfully applied.

Type of change

Test plan

Unit tests (task test)
E2E tests (task test-e2e)
Linting (task lint-fix)
Manual testing (describe below)

Built a test binary and started an MCP server with invalid env vars (TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana, TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmed default=5 in the threshold warning, showing the new default is active.

Changes

File	Change
`pkg/transport/proxy/transparent/transparent_proxy.go`	Add env var constants, getter functions, struct field for failure threshold, wire getters in constructor, replace package-level `healthCheckRetryCount` constant with struct field
`pkg/transport/proxy/transparent/transparent_test.go`	Table-driven tests for all 4 getters, constructor wiring tests, option function tests, test helper `newMinimalProxy()`
`docs/arch/03-transport-architecture.md`	Add "Health Check Tuning Parameters" section documenting all 4 env vars

Does this introduce a user-facing change?

Yes. Three new environment variables for tuning health check behavior:

Environment Variable	Description	Default	Type
`TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT`	Timeout for each health check ping	`5s`	duration
`TOOLHIVE_HEALTH_CHECK_RETRY_DELAY`	Delay between retry attempts after a failure	`5s`	duration
`TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD`	Consecutive failures before proxy shutdown	`5`	integer

Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.

Special notes for reviewers

The existing getHealthCheckInterval() was also updated to add slog.Warn on invalid values and slog.Debug on custom values, for consistency with the new getter functions.
The healthCheckRetryCount package-level constant was replaced with a healthCheckFailureThreshold struct field to allow per-proxy configuration via the options pattern.
Per-server flags (e.g., --health-check-failure-threshold on thv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.

codecov · 2026-03-11T16:39:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.77%. Comparing base (d851c69) to head (41974d7).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4108      +/-   ##
==========================================
+ Coverage   68.71%   68.77%   +0.06%     
==========================================
  Files         517      517              
  Lines       54817    54846      +29     
==========================================
+ Hits        37666    37723      +57     
+ Misses      14252    14234      -18     
+ Partials     2899     2889      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

amirejaz · 2026-03-11T23:25:29Z

@claude could you review this PR?

claude · 2026-03-11T23:26:09Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copilot

Pull request overview

This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.

Changes:

Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into TransparentProxy.
Replace the package-level retry-count constant with a per-proxy healthCheckFailureThreshold field (options + env var).
Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
pkg/transport/proxy/transparent/transparent_proxy.go	Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance.
pkg/transport/proxy/transparent/transparent_test.go	Adds table-driven tests for the getter functions and constructor/option wiring.
docs/arch/03-transport-architecture.md	Documents the new transparent proxy health check tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

amirejaz · 2026-03-12T00:34:55Z

@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation.

claude · 2026-03-12T00:35:39Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

amirejaz

Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.

gkatz2 · 2026-03-18T06:01:21Z

Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (9e89dc30 and a41f8a87):

Test hermeticity: all getter tests and the default-values test now explicitly set env vars (including empty string for default cases)
Failure window formula corrected to ~60s
pingTimeout > interval interaction documented
threshold=1 behavior (immediate shutdown, no retries) documented
Dead retryDelay == 0 guard removed

Would you have time for a re-review? Thanks!

jhrozek

A couple of minor observations from reviewing this PR:

Hardcoded 5s context timeout in pkg/healthcheck/healthcheck.go:110 — checkMCPStatus wraps every Ping() call in context.WithTimeout(ctx, 5*time.Second), which silently caps the new TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT env var at 5s. At the current default (also 5s) this is a non-issue, but if someone sets a higher timeout for slow backends, the hardcoded context will cancel first. Worth a note or follow-up.

jhrozek · 2026-04-09T22:13:16Z

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.

I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

The health check ping timeout, retry delay, and failure threshold were hardcoded, making it impossible to tune resilience for environments with transient network disruptions. Only the check interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL. Add three new environment variables following the same pattern: TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s, reducing false positives without significantly delaying detection of genuinely dead servers. Fixes stacklok#4084 Signed-off-by: Greg Katz <gkatz@indeed.com>

Signed-off-by: Greg Katz <gkatz@indeed.com>

- Always set env vars in getter tests (including empty string for default cases) so tests are hermetic against ambient environment - Clear all health check env vars in TestNewTransparentProxyDefaultValues - Fix failure window formula in docs to match actual behavior - Remove reference to not-yet-merged env var name in docs Signed-off-by: Greg Katz <gkatz@indeed.com>

- Fix failure window estimate: 70s → 60s to match the formula - Document threshold=1 behavior (immediate shutdown, no retries) - Note that pingTimeout > interval extends the failure window - Remove pre-existing dead code: retryDelay == 0 guard was unreachable before this PR (constructor always set a positive value via DefaultHealthCheckRetryDelay) - Fix test hermeticity: isolate env vars in option function test Signed-off-by: Greg Katz <gkatz@indeed.com>

gkatz2 · 2026-04-10T16:24:32Z

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.

I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

jhrozek · 2026-04-13T08:46:51Z

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.
I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

No problem @gkatz2 the main decision point is the approved/changes requested. The nits are something you can either fix in the same PR if you feel like it or send a follow-up.

gkatz2 · 2026-04-13T17:13:04Z

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.
I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

No problem @gkatz2 the main decision point is the approved/changes requested. The nits are something you can either fix in the same PR if you feel like it or send a follow-up.

Sounds good. Follow-up PR: #4791

gkatz2 requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jhrozek, rdimitrov and yrobla as code owners March 11, 2026 16:36

github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 11, 2026

gkatz2 force-pushed the fix/configurable-health-check-4084 branch from 0656ee6 to 7d879e7 Compare March 11, 2026 18:54

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

amirejaz requested a review from Copilot March 11, 2026 23:41

Copilot started reviewing on behalf of amirejaz March 11, 2026 23:41 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026

amirejaz reviewed Mar 12, 2026

View reviewed changes

gkatz2 closed this Mar 15, 2026

gkatz2 reopened this Mar 15, 2026

jhrozek reviewed Apr 9, 2026

View reviewed changes

Comment thread pkg/transport/proxy/transparent/transparent_proxy.go

gkatz2 added 4 commits April 9, 2026 18:16

Retrigger CI

db4406e

Signed-off-by: Greg Katz <gkatz@indeed.com>

gkatz2 force-pushed the fix/configurable-health-check-4084 branch from a41f8a8 to 41974d7 Compare April 10, 2026 01:32

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 10, 2026

jhrozek approved these changes Apr 10, 2026

View reviewed changes

jhrozek merged commit 63a2f62 into stacklok:main Apr 10, 2026
66 of 68 checks passed

This was referenced Apr 13, 2026

Remove dead code and redundant ping timeout in health check #4790

Closed

Clean up health check dead code and timeout #4791

Merged

Conversation

gkatz2 commented Mar 11, 2026

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amirejaz commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amirejaz commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026

Uh oh!

amirejaz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gkatz2 commented Mar 18, 2026

Uh oh!

jhrozek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jhrozek commented Apr 9, 2026

Uh oh!

Uh oh!

gkatz2 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhrozek commented Apr 13, 2026

Uh oh!

gkatz2 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Mar 11, 2026 •

edited

Loading

gkatz2 commented Apr 10, 2026 •

edited

Loading

gkatz2 commented Apr 13, 2026 •

edited

Loading