Skip to content

Make transparent proxy health check parameters configurable#4108

Merged
jhrozek merged 4 commits intostacklok:mainfrom
gkatz2:fix/configurable-health-check-4084
Apr 10, 2026
Merged

Make transparent proxy health check parameters configurable#4108
jhrozek merged 4 commits intostacklok:mainfrom
gkatz2:fix/configurable-health-check-4084

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented Mar 11, 2026

Summary

  • The transparent proxy health check has mostly hardcoded parameters (ping timeout, retry delay, failure threshold), making it impossible to tune resilience for environments with transient network disruptions. When the proxy kills itself due to false positives, the client's MCP session is invalidated, requiring a full client restart to recover.
  • Add three new environment variables (TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY, TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD) following the existing TOOLHIVE_HEALTH_CHECK_INTERVAL pattern.
  • Raise the default failure threshold from 3 to 5, extending the tolerance window from ~40s to ~60s out of the box.
  • Add warning logs when invalid env var values are provided, and debug logs when custom values are successfully applied.

Fixes #4084

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Built a test binary and started an MCP server with invalid env vars (TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD=banana, TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT=not-a-duration). Verified warning logs appeared in the server log file with the env var name, invalid value, and default fallback. Confirmed default=5 in the threshold warning, showing the new default is active.

Changes

File Change
pkg/transport/proxy/transparent/transparent_proxy.go Add env var constants, getter functions, struct field for failure threshold, wire getters in constructor, replace package-level healthCheckRetryCount constant with struct field
pkg/transport/proxy/transparent/transparent_test.go Table-driven tests for all 4 getters, constructor wiring tests, option function tests, test helper newMinimalProxy()
docs/arch/03-transport-architecture.md Add "Health Check Tuning Parameters" section documenting all 4 env vars

Does this introduce a user-facing change?

Yes. Three new environment variables for tuning health check behavior:

Environment Variable Description Default Type
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT Timeout for each health check ping 5s duration
TOOLHIVE_HEALTH_CHECK_RETRY_DELAY Delay between retry attempts after a failure 5s duration
TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD Consecutive failures before proxy shutdown 5 integer

Behavioral change: The default failure threshold increases from 3 to 5. This means the proxy tolerates approximately 60 seconds of consecutive health check failures before shutting down (previously ~40 seconds). This improves resilience to transient network disruptions without significantly delaying detection of genuinely dead servers.

Special notes for reviewers

  • The existing getHealthCheckInterval() was also updated to add slog.Warn on invalid values and slog.Debug on custom values, for consistency with the new getter functions.
  • The healthCheckRetryCount package-level constant was replaced with a healthCheckFailureThreshold struct field to allow per-proxy configuration via the options pattern.
  • Per-server flags (e.g., --health-check-failure-threshold on thv run) would be a natural follow-up, following the precedence pattern suggested in Make proxy response timeout configurable #4063. Happy to include that in this PR if you'd prefer it bundled, or as a separate follow-up.

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 11, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.77%. Comparing base (d851c69) to head (41974d7).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4108      +/-   ##
==========================================
+ Coverage   68.71%   68.77%   +0.06%     
==========================================
  Files         517      517              
  Lines       54817    54846      +29     
==========================================
+ Hits        37666    37723      +57     
+ Misses      14252    14234      -18     
+ Partials     2899     2889      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gkatz2 gkatz2 force-pushed the fix/configurable-health-check-4084 branch from 0656ee6 to 7d879e7 Compare March 11, 2026 18:54
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
@amirejaz
Copy link
Copy Markdown
Contributor

@claude could you review this PR?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 11, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the transparent proxy health check behavior configurable via environment variables (ping timeout, retry delay, failure threshold), improving resilience in environments with transient network disruptions and increasing the default failure threshold from 3 to 5.

Changes:

  • Add env var constants + resolution helpers for transparent proxy health check parameters, and wire resolved values into TransparentProxy.
  • Replace the package-level retry-count constant with a per-proxy healthCheckFailureThreshold field (options + env var).
  • Add unit tests for env var resolution and constructor wiring, and document the new tuning parameters.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
pkg/transport/proxy/transparent/transparent_proxy.go Adds env var resolution for health-check parameters and stores the failure threshold on the proxy instance.
pkg/transport/proxy/transparent/transparent_test.go Adds table-driven tests for the getter functions and constructor/option wiring.
docs/arch/03-transport-architecture.md Documents the new transparent proxy health check tuning environment variables.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread docs/arch/03-transport-architecture.md Outdated
Comment thread pkg/transport/proxy/transparent/transparent_test.go
Comment thread pkg/transport/proxy/transparent/transparent_test.go
Comment thread pkg/transport/proxy/transparent/transparent_test.go
Comment thread pkg/transport/proxy/transparent/transparent_test.go
Comment thread pkg/transport/proxy/transparent/transparent_test.go
Comment thread docs/arch/03-transport-architecture.md Outdated
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 11, 2026
@amirejaz
Copy link
Copy Markdown
Contributor

@claude Review this PR focusing on the new TOOLHIVE_HEALTH_CHECK_* env vars and their interactions, edge cases, and documentation.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 12, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

Copy link
Copy Markdown
Contributor

@amirejaz amirejaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, focused PR. The env var pattern is consistent and the tests cover the happy/sad paths well. I have four findings: one test reliability bug, a docs inaccuracy, an undocumented edge case, and a now-redundant guard worth cleaning up.

Comment thread docs/arch/03-transport-architecture.md Outdated
Comment thread docs/arch/03-transport-architecture.md
Comment thread pkg/transport/proxy/transparent/transparent_proxy.go
Comment thread pkg/transport/proxy/transparent/transparent_proxy.go
Comment thread pkg/transport/proxy/transparent/transparent_test.go
@gkatz2 gkatz2 closed this Mar 15, 2026
@gkatz2 gkatz2 reopened this Mar 15, 2026
@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented Mar 18, 2026

Hi @amirejaz — thanks for the thorough review. Your 5 inline findings plus the Copilot bot's suggestions have all been addressed in the last two commits (9e89dc30 and a41f8a87):

  • Test hermeticity: all getter tests and the default-values test now explicitly set env vars (including empty string for default cases)
  • Failure window formula corrected to ~60s
  • pingTimeout > interval interaction documented
  • threshold=1 behavior (immediate shutdown, no retries) documented
  • Dead retryDelay == 0 guard removed

Would you have time for a re-review? Thanks!

Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor observations from reviewing this PR:

Hardcoded 5s context timeout in pkg/healthcheck/healthcheck.go:110checkMCPStatus wraps every Ping() call in context.WithTimeout(ctx, 5*time.Second), which silently caps the new TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT env var at 5s. At the current default (also 5s) this is a non-issue, but if someone sets a higher timeout for slow backends, the hardcoded context will cancel first. Worth a note or follow-up.

Comment thread pkg/transport/proxy/transparent/transparent_proxy.go
@jhrozek
Copy link
Copy Markdown
Contributor

jhrozek commented Apr 9, 2026

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.

I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

gkatz2 added 4 commits April 9, 2026 18:16
The health check ping timeout, retry delay, and failure threshold
were hardcoded, making it impossible to tune resilience for
environments with transient network disruptions. Only the check
interval was configurable via TOOLHIVE_HEALTH_CHECK_INTERVAL.

Add three new environment variables following the same pattern:
TOOLHIVE_HEALTH_CHECK_PING_TIMEOUT, TOOLHIVE_HEALTH_CHECK_RETRY_DELAY,
and TOOLHIVE_HEALTH_CHECK_FAILURE_THRESHOLD. Raise the default failure
threshold from 3 to 5 to extend the tolerance window from ~40s to ~60s,
reducing false positives without significantly delaying detection of
genuinely dead servers.

Fixes stacklok#4084

Signed-off-by: Greg Katz <gkatz@indeed.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
- Always set env vars in getter tests (including empty string for
  default cases) so tests are hermetic against ambient environment
- Clear all health check env vars in TestNewTransparentProxyDefaultValues
- Fix failure window formula in docs to match actual behavior
- Remove reference to not-yet-merged env var name in docs

Signed-off-by: Greg Katz <gkatz@indeed.com>
- Fix failure window estimate: 70s → 60s to match the formula
- Document threshold=1 behavior (immediate shutdown, no retries)
- Note that pingTimeout > interval extends the failure window
- Remove pre-existing dead code: retryDelay == 0 guard was
  unreachable before this PR (constructor always set a positive
  value via DefaultHealthCheckRetryDelay)
- Fix test hermeticity: isolate env vars in option function test

Signed-off-by: Greg Katz <gkatz@indeed.com>
@gkatz2 gkatz2 force-pushed the fix/configurable-health-check-4084 branch from a41f8a8 to 41974d7 Compare April 10, 2026 01:32
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 10, 2026
@jhrozek jhrozek merged commit 63a2f62 into stacklok:main Apr 10, 2026
66 of 68 checks passed
@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented Apr 10, 2026

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.

I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

@jhrozek
Copy link
Copy Markdown
Contributor

jhrozek commented Apr 13, 2026

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.
I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

No problem @gkatz2 the main decision point is the approved/changes requested. The nits are something you can either fix in the same PR if you feel like it or send a follow-up.

@gkatz2
Copy link
Copy Markdown
Contributor Author

gkatz2 commented Apr 13, 2026

@gkatz2 again I'm sorry for letting you wait so long for a review, this is bad contributor experience. We need to make sure work in-flight is handed over better.
I think the PR is in good shape overall and normally I would have acked but seeing there are conflicts, I added some nits inline - don't feel forced to resolve them, they are mostly observations.

Whoops, I missed this comment. @jhrozek, how do you typically handle addressing missed nits like these? Do you want a separate issue for them? Do you want to let them go for now?

No problem @gkatz2 the main decision point is the approved/changes requested. The nits are something you can either fix in the same PR if you feel like it or send a follow-up.

Sounds good. Follow-up PR: #4791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make transparent proxy health check parameters configurable

4 participants