Skip to content

Conversation

@flfeurmou-indeed
Copy link
Contributor

Summary

This PR implements health-aware restart logic for remote workloads (like Glean and Datadog). It enhances the supervisor check to verify not just that the process exists, but that it is responsive to HTTP health checks.

Problem

After laptop sleep/wake cycles, remote MCP server proxies often end up in a "zombie" state where the process exists (PID is valid) but the network socket is broken or the application is hung. The current logic only checks for PID existence, so it assumes the workload is healthy and refuses to restart it, or fails to clean it up properly.

Solution

Updated isSupervisorHealthy (formerly isSupervisorProcessAlive) to:

  1. Check if the process exists (using process.FindProcess)
  2. For remote workloads, perform an HTTP GET request to the health endpoint
  3. Return false if either check fails, triggering the restart/cleanup logic

This allows ToolHive to automatically detect and recover from "alive but unresponsive" states.

Changes

  • Updated DefaultManager to support dependency injection for health checks
  • Implemented checkHTTPHealth helper
  • Updated isSupervisorHealthy to use the new check
  • Updated unit tests to cover the new logic and mock the health checks

Testing

  • Added unit tests for healthy/unhealthy supervisor scenarios
  • Verified that existing tests pass
  • (Manual verification recommended on a real sleep/wake cycle)

This commit enhances the restart logic for remote workloads (like Glean/Datadog)
to be health-aware. Instead of just checking if the supervisor process exists
(PID check), it now also performs an HTTP health check against the proxy.

This addresses the "alive but unresponsive" state that often occurs after
laptop sleep/wake cycles, where the process exists but the network socket
is broken or the application is hung.

Changes:
- Added  field to  for dependency injection
- Updated  to perform HTTP health checks for remote workloads
- Updated tests to mock health checks and verify correct behavior
- Fixed test expectations for workload name vs base name usage in PID checks

This is a step towards a more robust supervisor that can auto-heal broken
workloads.

Signed-off-by: Frederic Le Feurmou <flfeurmou@indeed.com>
Copy link
Member

@dmjb dmjb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have healthchecks implemented elsewhere in the codebase, but we removed them to fix another bug. We have an issue open to reinstate them, I would prefer to close the PR and fix the original healthcheck mechanism.

@@ -809,25 +775,125 @@ func (d *DefaultManager) getWorkloadContainer(ctx context.Context, name string)
return &container, nil
}

// checkHTTPHealth performs a simple HTTP GET request to check health
func checkHTTPHealth(ctx context.Context, url string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have these sorts of health checks implemented further down the stack (in the proxy and transport layer).

@flfeurmou-indeed
Copy link
Contributor Author

Closing this PR in favor of the existing healthcheck mechanism. I'll test with the env var enabled once PR #3462 is included in a release, as suggested by @dmjb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants