feat: implement health-aware restart logic for remote workloads #3433

flfeurmou-indeed · 2026-01-24T04:54:23Z

Summary

This PR implements health-aware restart logic for remote workloads (like Glean and Datadog). It enhances the supervisor check to verify not just that the process exists, but that it is responsive to HTTP health checks.

Problem

After laptop sleep/wake cycles, remote MCP server proxies often end up in a "zombie" state where the process exists (PID is valid) but the network socket is broken or the application is hung. The current logic only checks for PID existence, so it assumes the workload is healthy and refuses to restart it, or fails to clean it up properly.

Solution

Updated isSupervisorHealthy (formerly isSupervisorProcessAlive) to:

Check if the process exists (using process.FindProcess)
For remote workloads, perform an HTTP GET request to the health endpoint
Return false if either check fails, triggering the restart/cleanup logic

This allows ToolHive to automatically detect and recover from "alive but unresponsive" states.

Changes

Updated DefaultManager to support dependency injection for health checks
Implemented checkHTTPHealth helper
Updated isSupervisorHealthy to use the new check
Updated unit tests to cover the new logic and mock the health checks

Testing

Added unit tests for healthy/unhealthy supervisor scenarios
Verified that existing tests pass
(Manual verification recommended on a real sleep/wake cycle)

This commit enhances the restart logic for remote workloads (like Glean/Datadog) to be health-aware. Instead of just checking if the supervisor process exists (PID check), it now also performs an HTTP health check against the proxy. This addresses the "alive but unresponsive" state that often occurs after laptop sleep/wake cycles, where the process exists but the network socket is broken or the application is hung. Changes: - Added field to for dependency injection - Updated to perform HTTP health checks for remote workloads - Updated tests to mock health checks and verify correct behavior - Fixed test expectations for workload name vs base name usage in PID checks This is a step towards a more robust supervisor that can auto-heal broken workloads. Signed-off-by: Frederic Le Feurmou <flfeurmou@indeed.com>

dmjb

We have healthchecks implemented elsewhere in the codebase, but we removed them to fix another bug. We have an issue open to reinstate them, I would prefer to close the PR and fix the original healthcheck mechanism.

dmjb · 2026-01-27T13:38:59Z

pkg/workloads/manager.go

@@ -809,25 +775,125 @@ func (d *DefaultManager) getWorkloadContainer(ctx context.Context, name string)
 	return &container, nil
 }

+// checkHTTPHealth performs a simple HTTP GET request to check health
+func checkHTTPHealth(ctx context.Context, url string) bool {


We have these sorts of health checks implemented further down the stack (in the proxy and transport layer).

flfeurmou-indeed · 2026-01-28T02:27:17Z

Closing this PR in favor of the existing healthcheck mechanism. I'll test with the env var enabled once PR #3462 is included in a release, as suggested by @dmjb.

dmjb requested changes Jan 27, 2026

View reviewed changes

flfeurmou-indeed closed this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement health-aware restart logic for remote workloads #3433

feat: implement health-aware restart logic for remote workloads #3433

flfeurmou-indeed commented Jan 24, 2026

Uh oh!

dmjb left a comment

Uh oh!

dmjb Jan 27, 2026

Uh oh!

flfeurmou-indeed commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: implement health-aware restart logic for remote workloads #3433

feat: implement health-aware restart logic for remote workloads #3433

Conversation

flfeurmou-indeed commented Jan 24, 2026

Summary

Problem

Solution

Changes

Testing

Uh oh!

dmjb left a comment

Choose a reason for hiding this comment

Uh oh!

dmjb Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

flfeurmou-indeed commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants