Make transparent proxy health check parameters configurable#4108
Make transparent proxy health check parameters configurable#4108jhrozek merged 4 commits intostacklok:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4108 +/- ##
==========================================
+ Coverage 68.71% 68.77% +0.06%
==========================================
Files 517 517
Lines 54817 54846 +29
==========================================
+ Hits 37666 37723 +57
+ Misses 14252 14234 -18
+ Partials 2899 2889 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0656ee6 to
7d879e7
Compare
|
@claude could you review this PR? |
|
I'll analyze this and get back to you. |
There was a problem hiding this comment.
Pull request overview
This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.
Changes:
- Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into
TransparentProxy. - Replace the package-level retry-count constant with a per-proxy
healthCheckFailureThresholdfield (options + env var). - Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| pkg/transport/proxy/transparent/transparent_proxy.go | Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance. |
| pkg/transport/proxy/transparent/transparent_test.go | Adds table-driven tests for the getter functions and constructor/option wiring. |
| docs/arch/03-transport-architecture.md | Documents the new transparent proxy health check tuning environment variables. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation. |
|
I'll analyze this and get back to you. |
amirejaz
left a comment
There was a problem hiding this comment.
Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.
|
Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (
Would you have time for a re-review? Thanks! |
jhrozek
left a comment
There was a problem hiding this comment.
A couple of minor observations from reviewing this PR:
Hardcoded 5s context timeout in pkg/healthcheck/healthcheck.go:110 — checkMCPStatus wraps every Ping() call in context.WithTimeout(ctx, 5*time.Second), which silently caps the new TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT env var at 5s. At the current default (also 5s) this is a non-issue, but if someone sets a higher timeout for slow backends, the hardcoded context will cancel first. Worth a note or follow-up.
|
@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better. I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations. |
The health check ping timeout, retry delay, and failure threshold were hardcoded, making it impossible to tune resilience for environments with transient network disruptions. Only the check interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL. Add three new environment variables following the same pattern: TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s, reducing false positives without significantly delaying detection of genuinely dead servers. Fixes stacklok#4084 Signed-off-by: Greg Katz <gkatz@indeed.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
- Always set env vars in getter tests (including empty string for default cases) so tests are hermetic against ambient environment - Clear all health check env vars in TestNewTransparentProxyDefaultValues - Fix failure window formula in docs to match actual behavior - Remove reference to not-yet-merged env var name in docs Signed-off-by: Greg Katz <gkatz@indeed.com>
- Fix failure window estimate: 70s → 60s to match the formula - Document threshold=1 behavior (immediate shutdown, no retries) - Note that pingTimeout > interval extends the failure window - Remove pre-existing dead code: retryDelay == 0 guard was unreachable before this PR (constructor always set a positive value via DefaultHealthCheckRetryDelay) - Fix test hermeticity: isolate env vars in option function test Signed-off-by: Greg Katz <gkatz@indeed.com>
a41f8a8 to
41974d7
Compare
Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now? |
No problem @gkatz2 the main decision point is the approved/changes requested. The nits are something you can either fix in the same PR if you feel like it or send a follow-up. |
Sounds good. Follow-up PR: #4791 |

Summary
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT,TOOLHIVE_HEALTH_CHECK_RETRY_DELAY,TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existingTOOLHIVE_HEALTH_CHECK_INTERVALpattern.Fixes #4084
Type of change
Test plan
task test)task test-e2e)task lint-fix)Built a test binary and started an MCP server with invalid env vars (
TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana,TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmeddefault=5in the threshold warning, showing the new default is active.Changes
pkg/transport/proxy/transparent/transparent_proxy.gohealthCheckRetryCountconstant with struct fieldpkg/transport/proxy/transparent/transparent_test.gonewMinimalProxy()docs/arch/03-transport-architecture.mdDoes this introduce a user-facing change?
Yes. Three new environment variables for tuning health check behavior:
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT5sTOOLHIVE_HEALTH_CHECK_RETRY_DELAY5sTOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD5Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.
Special notes for reviewers
getHealthCheckInterval()was also updated to addslog.Warnon invalid values andslog.Debugon custom values, for consistency with the new getter functions.healthCheckRetryCountpackage-level constant was replaced with ahealthCheckFailureThresholdstruct field to allow per-proxy configuration via the options pattern.--health-check-failure-thresholdonthv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.