fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770)#556
Merged
Merged
Conversation
Deploying mcpproxy-docs with
|
| Latest commit: |
88ef69a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://4a30d918.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://fix-mcp770-discovertools-con.mcpproxy-docs.pages.dev |
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 26738476773 --repo smart-mcp-proxy/mcpproxy-go
|
DiscoverTools snapshotted client.Config directly while only holding the Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu before calling managed.Client.SetConfig, which guards mc.Config with the client's own mc.mu. The two accesses therefore used different mutexes, producing a data race the -race detector flagged intermittently in CI (PR #555, run 26719908592). Fix the read side: read config via the mutex-guarded GetConfig() accessor in DiscoverTools and in the API-facing status readers (GetStats, GetTotalToolCount, ListServers) that share the identical hazard — they read client.Config after releasing m.mu and run concurrently with reconcile. mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu), so this is deadlock-safe. Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against the readers concurrently; trips -race before the fix, green after. Related: MCP-770
The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats, GetTotalToolCount, ListServers) through GetConfig(), but the underlying hazard is broader: managed.Client.Config is a pointer field that SetConfig swaps off m.mu (reconcile add path), while it is read concurrently from many sites that do NOT hold the client's mc.mu — including Client.Connect's unlocked phase (client.go:197) and detached state-change callback goroutines (onStateChange, spawned by StateManager.SetError). The CI -race detector flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after the first fix was cherry-picked. A mutex-guarded accessor can't cover this: most internal readers run while mc.mu is already held, so routing them through an RLock GetConfig() would deadlock (sync.RWMutex is not reentrant). Instead make the field itself lock-free and data-race-free: - Replace the exported `Config *config.ServerConfig` field with an unexported `cfg atomic.Pointer[config.ServerConfig]`. - GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are lock-free and safe whether or not mc.mu is held. - Route every reader through GetConfig() across the managed client, the upstream Manager, and the supervisor actor pool. Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against Client.Connect concurrently; trips -race on the old field, green after. Verified: go test -race ./internal/upstream/... ./internal/runtime/... and the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath / _AdminAllowed_WriteOps pass x5 under -race; build + linter clean. Related: MCP-770
StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the process-monitoring equivalents — with no synchronization. When a connect and a shutdown overlap on the same client, the -race detector flags WG.Add (Start) vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's shutdown path actually run concurrently with reconcile. Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor goroutines never acquire it; it is never held across c.mu), so Stop's bounded WG.Wait under the lock cannot deadlock. Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips -race on the unsynchronized fields, green with monitoringMu. Related: MCP-770, spec 070
The monitoringMu serialization (prev commit) was incomplete: StopStderr/ StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms timeout. When a real child keeps its stderr open, monitorStderr blocks in scanner.Scan(), the wait times out, Stop returns and releases monitoringMu — but the detached WG.Wait goroutine lives on and races the next connect's WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring WG.Add vs the lingering StopStderrMonitoring WG.Wait). Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the monitor goroutine closes, and pass the context + done as locals so an abandoned (timed-out) monitor never reads the shared ctx/state a later Start overwrites. Stop now waits on done directly under monitoringMu (no detached waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx as a parameter. Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race. Related: MCP-770, spec 070
02eec48 to
88ef69a
Compare
Dumbris
added a commit
that referenced
this pull request
Jun 1, 2026
Forking a work branch from another feature branch drags its unmerged commits into the PR (root cause of spec-064 docs leaking into the MCP-770 race fix #556). ENG-3 now mandates fetch + branch from origin/main explicitly, in both the engineer bundle and the contract.
Dumbris
added a commit
that referenced
this pull request
Jun 1, 2026
* docs(064): Glass Cockpit — transparent & steerable agent cockpit spec Spec, plan, and design artifacts for making the existing Paperclip "MCPProxy" cockpit (spec 045) transparent and steerable: invert the default from "proceed" to "checkpoint at every design-decision boundary" via three human gates (plan-of-attack, per-spec design, pre-merge) mapped to Paperclip native primitives (executionPolicy approval stages, request_confirmation/suggest_tasks interactions, issue_tree_holds), with reasoning visible before each gate and a single "waiting on you" view. Phased rollout: A) config + agent-instruction only (the dry-run target), B) a Paperclip plugin for the fused transparency UI, C) a fork only if A/B fall short. SynapBus is log/wiki only, never on the critical path. Includes rewritten gate-aware agent instructions, consumed-API + executionPolicy + agent-instruction contracts, data model, research, and operator quickstart. Supersedes spec 062's fresh-dev-instance approach; extends spec 045. * docs(064): amend gate model — human-merge → dual-AI auto-merge + human veto Session 2 amendment to FR-005 + US3: replace the mandatory human-merge gate (throughput bottleneck) with draft-PR + dual-AI-review consensus auto-merge. Two reviewers on different model families (Gemini 2.5-pro Critic + Codex), never the implementer; tests-green + both-accept → auto-merge; human is an optional 3rd reviewer with veto (request-changes/hold freezes auto-merge). Prerequisite flagged: a bot GitHub identity (agents currently = author's gh, and GitHub forbids self-approval) — interim fallback is 2-AI-review-as- required-check with the human merging. codex-local Paperclip adapter exists. * docs(064): dual-AI-review auto-merge — reviewer doctrine + engineer draft-PR + setup Adds the Session-2 gate-model deliverables: reviewer/REVIEWER.md (shared RV-1..RV-6 dual-review doctrine), codex-reviewer/AGENTS.md (2nd reviewer on codex-local), amends engineer ENG-4 to open DRAFT PRs + request 2 AI reviewers (no self-merge), and auto-merge-setup.md (GitHub branch-protection config + the bot-identity prerequisite + interim human-merge fallback + open items). * docs(064): reviewers on subscription auth only; Gemini quota-exhausted, Codex gpt-5.5 ready Per user directive: both AI reviewers use paid SUBSCRIPTION logins, not API keys. - Gemini Critic: subscription/OAuth, pin gemini-2.5-pro (3.5/3 UNVERIFIED — quota exhausted on every probe today; switch if confirmed). TWO blockers: quota + empty-prompt adapter bug → cannot accept yet. - Codex reviewer: ChatGPT subscription, gpt-5.5 (codex-cli 0.46.0) — READY now. - Live two-reviewer set today = Codex + human (FR-005f) until Gemini recovers. * docs(064): Critic GEMINI.md — subscription-only, quota-blocked; Kimi+Codex as live pair Gemini settings pin gemini-3.1-pro-preview (subscription/OAuth) but quota is exhausted (no reset hint) + empty-prompt adapter bug → Critic can't accept yet. Live 2-AI reviewer pair = Codex gpt-5.5 + Kimi-K2.5 (opencode_local, Gcore key present); Gemini rejoins as 3rd reviewer when quota returns. * docs(064): stand up Codex+Kimi reviewer agents; codex config fix Live dual-AI reviewer pair created in the running Paperclip cockpit and verified responding (2026-05-31): - CodexReviewer — codex_local / gpt-5-codex (5b94562c-…) - KimiReviewer — opencode_local / Kimi-K2.5 (fdaa1d4c-…) Both carry managed instruction bundles (shared doctrine + RV-1..RV-6 + role notes), report to CEO, idle, heartbeat off (woken by review-stage). Docs: - add canonical kimi-reviewer/AGENTS.md (the design lacked it) - correct codex-reviewer/AGENTS.md model facts (gpt-5.5 -> gpt-5-codex) - auto-merge-setup.md: live pair is Codex+Kimi; Gemini Critic becomes the 3rd reviewer when its subscription quota recovers codex config fix (~/.codex/config.toml, not in repo): model_reasoning_effort xhigh->high and model gpt-5.5->gpt-5-codex. On codex-cli 0.46.0 + ChatGPT subscription auth, gpt-5.5 needs a newer CLI and gpt-5.4/5.3-codex/5.2 are auth-restricted; gpt-5-codex/gpt-5 are the working models. Backup at ~/.codex/config.toml.bak.pre-reviewer-fix.* * docs(064): engineers drive PRs green + bundle docs/ updates (ENG-8/9) engineer/AGENTS.md: - ENG-8: drive every required check to green before review — run local verification before push, watch `gh pr checks --watch`, push fixes until all green; never leave/hand off a red PR, never --no-verify or weaken a check. Green CI is the engineer's job, not the reviewer's. - ENG-9: when a change touches CLI/REST/MCP API/config/defaults/security or anything under docs/, the SAME PR must update docs/ (+ CLAUDE.md/ oas/swagger.yaml/README where mirrored). Docs-only changes exempt from TDD. - ENG-5 reworked to dual-AI merge-readiness (Codex+Kimi accept + all CI green). reviewer/REVIEWER.md RV-3: red/pending check = automatic request_changes; missing docs when the change warrants them = request_changes. Applied to the live Paperclip brains: 3 engineers (Backend/Frontend/MacOS) re-flattened from canonical; Codex+Kimi reviewer brains refreshed. * docs(064): record applied CI-context branch protection on main Phase-1 gate live on main (no bot identity needed): required_status_checks strict=false with 8 always-run, non-path-conditional contexts (Lint, Unit Tests ubuntu, Build ubuntu/macos/windows, Build Frontend, Validate PR title, Verify OpenAPI Artifacts). Existing 1-review + enforce_admins=false kept. Verified: green PR #553 satisfies all 8 (blocked only by review); in-flight PR #555 blocked on pending required checks. Documents the deliberately- excluded checks and the Go-version-pinned context-name fragility. * docs(064): ENG-3 — branch from origin/main, never from a feature branch Forking a work branch from another feature branch drags its unmerged commits into the PR (root cause of spec-064 docs leaking into the MCP-770 race fix #556). ENG-3 now mandates fetch + branch from origin/main explicitly, in both the engineer bundle and the contract.
Dumbris
added a commit
that referenced
this pull request
Jun 1, 2026
…ect (MCP-783) (#558) Supervisor reconcile dispatches each action (Connect/Disconnect/Reconnect/ Remove) in a bare, untracked goroutine ('no waiting!'). Supervisor.Stop() cancelled the context and waited only on the three long-lived loops (s.wg), so it could return while a ConnectServer -> client.Connect() was still in flight. runtime.Close then called ShutdownAll -> Disconnect, overlapping Connect and Disconnect on the same client — the root cause of the MCP-770 race cascade (five symptoms, each unmasking the next). Track action goroutines in a new actionWg and gate dispatch with a stopping flag set under stateMu, so Stop() drains all in-flight actions before disconnecting clients. The drain is bounded by a 35s backstop (> the 30s per-action context timeout) so a wedged Connect can't hang shutdown. Confined to internal/runtime/supervisor. Adds a -race regression guard that asserts upstream.Close() never overlaps an in-flight Connect. Related #556 Co-authored-by: Paperclip <noreply@paperclip.ing>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
-racedetector intermittently flagged a data race in the upstream layer whenever a server was added (reconcile) while background tool indexing ran. Surfaced on PR #555 CI but reproduces independently — the racing code predates that branch.internal/upstream/manager.go:825):DiscoverToolssnapshottedclient.Config.*directly while holding only the Manager'sm.mu.internal/upstream/managed/client.go:355):SetConfig, called fromAddServerConfigafter it releasesm.mu(line 307-308, intentionally, to avoid a lock-order deadlock withGetServerState), guardsmc.Configwith the client's ownmc.mu.Reader and writer used different mutexes → genuine race on the
mc.Configpointer.Provenance: CI run https://github.com/smart-mcp-proxy/mcpproxy-go/actions/runs/26719908592/job/78745012175
Fix
Read config through the existing mutex-guarded
GetConfig()accessor in:DiscoverTools(the CI-flagged reader), andGetStats,GetTotalToolCount,ListServersreadclient.Configafter releasingm.muand run concurrently with reconcile from HTTP handlers.mc.muis a leaf lock (onlyGet/SetConfigacquire it, never while holdingm.mu), so reading throughGetConfig()while holdingm.mu.RLock()introduces no lock-order inversion. The write side (SetConfig) is left untouched, preserving the deadlock-avoidance the originalm.mu.Unlock()provides.Test
TestDiscoverTools_ConfigRacedrivesAddServerConfig(→SetConfig) concurrently with all four readers. It trips-racebefore the fix (read manager.go:825 vs write client.go:355) and is green after.Verification
go test -race ./internal/upstream/...✅go test -race ./internal/runtime/...✅ (exercisesDiscoverAndIndexToolshot path)go build ./...✅./scripts/run-linter.sh→ 0 issues ✅Related: MCP-770
🤖 Generated with Claude Code