Skip to content

fix(runtime): drain in-flight reconcile actions before client disconnect (MCP-783)#558

Merged
Dumbris merged 1 commit into
mainfrom
fix/mcp783-drain-reconcile-actions
Jun 1, 2026
Merged

fix(runtime): drain in-flight reconcile actions before client disconnect (MCP-783)#558
Dumbris merged 1 commit into
mainfrom
fix/mcp783-drain-reconcile-actions

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member

Root fix for the MCP-770 race cascade

MCP-770 surfaced five distinct -race/test failures one-by-one, each unmasking the next. All shared one defect: runtime.CloseShutdownAll disconnects upstream clients while the supervisor reconcile loop is still connecting the just-added server. Connect and Disconnect overlapped on one *managed.Client, racing every lifecycle field that assumed no overlap (Config pointer, stderr/process-monitor ctx + waitgroup, …). The symptoms were each fixed individually in #555/#556; this removes the source.

Sharpened root cause

The reconcile action goroutines are fire-and-forget and untracked:

  • reconcile() dispatches each action in a bare goroutine with an explicit // no waiting! comment (supervisor.go), running executeActionConnectServerclient.Connect().
  • Supervisor.Stop() cancelled s.ctx and waited only on s.wg — which tracks the three long-lived loops, not the per-action goroutines.
  • So Stop() could return with a Connect() still in flight; runtime.Close then ran ShutdownAllDisconnect(), overlapping on the same client.

Cancelling the context alone is insufficient — cancellation only signals; the in-flight Connect() keeps touching client fields while Disconnect() starts. The fix must wait.

Change (confined to internal/runtime/supervisor/)

  • Add actionWg sync.WaitGroup tracking in-flight reconcile action goroutines (Add under stateMu, before go; defer Done inside).
  • Add a stopping flag set under stateMu in Stop(); reconcile() skips dispatch once stopping, so no actionWg.Add can happen after the drain (no Add-after-Wait).
  • Stop() now: set stoppingcancel()wg.Wait()drainActions()upstream.Close(). The drain is bounded by a 35s backstop (> the 30s per-action context timeout) so a wedged Connect can't hang shutdown.

Tests / verification

  • New -race regression guard TestSupervisor_Stop_DrainsInFlightConnectBeforeClose: holds a Connect in flight, asserts Stop() blocks until it completes and that upstream.Close() never overlaps an in-flight Connect. Verified red before the fix, green after (-count=2 -race).
  • go test ./internal/runtime/... -race
  • go test ./internal/upstream/... -race
  • go test ./internal/server/ -run TestHandleUpstreamServers_AddFromRegistry -race -count=3 (MCP-770 amplifier) ✅
  • make build ✅ · ./scripts/run-linter.sh → 0 issues ✅

Notes

  • Defense-in-depth: main is already green via the symptom fixes, so revert is safe (single PR).
  • Option (b) from the issue (hermetic add-op tests) is a separate follow-on.
  • Related #556 (no auto-close).

🤖 Generated with Claude Code

Supervisor reconcile dispatches each action (Connect/Disconnect/Reconnect/
Remove) in a bare, untracked goroutine ('no waiting!'). Supervisor.Stop()
cancelled the context and waited only on the three long-lived loops (s.wg),
so it could return while a ConnectServer -> client.Connect() was still in
flight. runtime.Close then called ShutdownAll -> Disconnect, overlapping
Connect and Disconnect on the same client — the root cause of the MCP-770
race cascade (five symptoms, each unmasking the next).

Track action goroutines in a new actionWg and gate dispatch with a stopping
flag set under stateMu, so Stop() drains all in-flight actions before
disconnecting clients. The drain is bounded by a 35s backstop (> the 30s
per-action context timeout) so a wedged Connect can't hang shutdown.

Confined to internal/runtime/supervisor. Adds a -race regression guard that
asserts upstream.Close() never overlaps an in-flight Connect.

Related #556

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9f7b2c1
Status: ✅  Deploy successful!
Preview URL: https://55819e63.mcpproxy-docs.pages.dev
Branch Preview URL: https://fix-mcp783-drain-reconcile-a.mcpproxy-docs.pages.dev

View logs

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 65.21739% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/runtime/supervisor/supervisor.go 65.21% 7 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/mcp783-drain-reconcile-actions

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (27 MB)
  • archive-windows-arm64 (24 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26744915501 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

@Dumbris Dumbris merged commit c17f262 into main Jun 1, 2026
28 checks passed
@Dumbris Dumbris deleted the fix/mcp783-drain-reconcile-actions branch June 1, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants