fix(runtime): drain in-flight reconcile actions before client disconnect (MCP-783)#558
Merged
Merged
Conversation
Supervisor reconcile dispatches each action (Connect/Disconnect/Reconnect/
Remove) in a bare, untracked goroutine ('no waiting!'). Supervisor.Stop()
cancelled the context and waited only on the three long-lived loops (s.wg),
so it could return while a ConnectServer -> client.Connect() was still in
flight. runtime.Close then called ShutdownAll -> Disconnect, overlapping
Connect and Disconnect on the same client — the root cause of the MCP-770
race cascade (five symptoms, each unmasking the next).
Track action goroutines in a new actionWg and gate dispatch with a stopping
flag set under stateMu, so Stop() drains all in-flight actions before
disconnecting clients. The drain is bounded by a 35s backstop (> the 30s
per-action context timeout) so a wedged Connect can't hang shutdown.
Confined to internal/runtime/supervisor. Adds a -race regression guard that
asserts upstream.Close() never overlaps an in-flight Connect.
Related #556
Co-Authored-By: Paperclip <noreply@paperclip.ing>
Deploying mcpproxy-docs with
|
| Latest commit: |
9f7b2c1
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://55819e63.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://fix-mcp783-drain-reconcile-a.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 26744915501 --repo smart-mcp-proxy/mcpproxy-go
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root fix for the MCP-770 race cascade
MCP-770 surfaced five distinct
-race/test failures one-by-one, each unmasking the next. All shared one defect:runtime.Close→ShutdownAlldisconnects upstream clients while the supervisor reconcile loop is still connecting the just-added server. Connect and Disconnect overlapped on one*managed.Client, racing every lifecycle field that assumed no overlap (Config pointer, stderr/process-monitor ctx + waitgroup, …). The symptoms were each fixed individually in #555/#556; this removes the source.Sharpened root cause
The reconcile action goroutines are fire-and-forget and untracked:
reconcile()dispatches each action in a bare goroutine with an explicit// no waiting!comment (supervisor.go), runningexecuteAction→ConnectServer→client.Connect().Supervisor.Stop()cancelleds.ctxand waited only ons.wg— which tracks the three long-lived loops, not the per-action goroutines.Stop()could return with aConnect()still in flight;runtime.Closethen ranShutdownAll→Disconnect(), overlapping on the same client.Cancelling the context alone is insufficient — cancellation only signals; the in-flight
Connect()keeps touching client fields whileDisconnect()starts. The fix must wait.Change (confined to
internal/runtime/supervisor/)actionWg sync.WaitGrouptracking in-flight reconcile action goroutines (AddunderstateMu, beforego;defer Doneinside).stoppingflag set understateMuinStop();reconcile()skips dispatch once stopping, so noactionWg.Addcan happen after the drain (no Add-after-Wait).Stop()now: setstopping→cancel()→wg.Wait()→drainActions()→upstream.Close(). The drain is bounded by a 35s backstop (> the 30s per-action context timeout) so a wedged Connect can't hang shutdown.Tests / verification
-raceregression guardTestSupervisor_Stop_DrainsInFlightConnectBeforeClose: holds a Connect in flight, assertsStop()blocks until it completes and thatupstream.Close()never overlaps an in-flight Connect. Verified red before the fix, green after (-count=2 -race).go test ./internal/runtime/... -race✅go test ./internal/upstream/... -race✅go test ./internal/server/ -run TestHandleUpstreamServers_AddFromRegistry -race -count=3(MCP-770 amplifier) ✅make build✅ ·./scripts/run-linter.sh→ 0 issues ✅Notes
mainis already green via the symptom fixes, so revert is safe (single PR).Related #556(no auto-close).🤖 Generated with Claude Code