fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770) by Dumbris · Pull Request #556 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-01T03:09:07Z

Problem

The -race detector intermittently flagged a data race in the upstream layer whenever a server was added (reconcile) while background tool indexing ran. Surfaced on PR #555 CI but reproduces independently — the racing code predates that branch.

Read (internal/upstream/manager.go:825): DiscoverTools snapshotted client.Config.* directly while holding only the Manager's m.mu.
Write (internal/upstream/managed/client.go:355): SetConfig, called from AddServerConfig after it releases m.mu (line 307-308, intentionally, to avoid a lock-order deadlock with GetServerState), guards mc.Config with the client's own mc.mu.

Reader and writer used different mutexes → genuine race on the mc.Config pointer.

Provenance: CI run https://github.com/smart-mcp-proxy/mcpproxy-go/actions/runs/26719908592/job/78745012175

Fix

Read config through the existing mutex-guarded GetConfig() accessor in:

DiscoverTools (the CI-flagged reader), and
the API-facing status readers that share the identical hazard — GetStats, GetTotalToolCount, ListServers read client.Config after releasing m.mu and run concurrently with reconcile from HTTP handlers.

mc.mu is a leaf lock (only Get/SetConfig acquire it, never while holding m.mu), so reading through GetConfig() while holding m.mu.RLock() introduces no lock-order inversion. The write side (SetConfig) is left untouched, preserving the deadlock-avoidance the original m.mu.Unlock() provides.

Test

TestDiscoverTools_ConfigRace drives AddServerConfig (→ SetConfig) concurrently with all four readers. It trips -race before the fix (read manager.go:825 vs write client.go:355) and is green after.

Verification

go test -race ./internal/upstream/... ✅
go test -race ./internal/runtime/... ✅ (exercises DiscoverAndIndexTools hot path)
go build ./... ✅
./scripts/run-linter.sh → 0 issues ✅

Related: MCP-770

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-06-01T03:11:44Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`88ef69a`
Status:	✅ Deploy successful!
Preview URL:	https://4a30d918.mcpproxy-docs.pages.dev
Branch Preview URL:	https://fix-mcp770-discovertools-con.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-06-01T03:13:27Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 26.59574% with 138 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
internal/upstream/managed/client.go	5.26%	72 Missing ⚠️
internal/upstream/core/monitoring.go	44.44%	29 Missing and 1 partial ⚠️
internal/upstream/manager.go	47.82%	23 Missing and 1 partial ⚠️
...runtime/supervisor/actor_pool_complex_reference.go	0.00%	9 Missing ⚠️
internal/runtime/supervisor/actor_pool.go	0.00%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-01T03:19:01Z

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/mcp770-discovertools-config-race

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26738476773 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

DiscoverTools snapshotted client.Config directly while only holding the Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu before calling managed.Client.SetConfig, which guards mc.Config with the client's own mc.mu. The two accesses therefore used different mutexes, producing a data race the -race detector flagged intermittently in CI (PR #555, run 26719908592). Fix the read side: read config via the mutex-guarded GetConfig() accessor in DiscoverTools and in the API-facing status readers (GetStats, GetTotalToolCount, ListServers) that share the identical hazard — they read client.Config after releasing m.mu and run concurrently with reconcile. mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu), so this is deadlock-safe. Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against the readers concurrently; trips -race before the fix, green after. Related: MCP-770

The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats, GetTotalToolCount, ListServers) through GetConfig(), but the underlying hazard is broader: managed.Client.Config is a pointer field that SetConfig swaps off m.mu (reconcile add path), while it is read concurrently from many sites that do NOT hold the client's mc.mu — including Client.Connect's unlocked phase (client.go:197) and detached state-change callback goroutines (onStateChange, spawned by StateManager.SetError). The CI -race detector flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after the first fix was cherry-picked. A mutex-guarded accessor can't cover this: most internal readers run while mc.mu is already held, so routing them through an RLock GetConfig() would deadlock (sync.RWMutex is not reentrant). Instead make the field itself lock-free and data-race-free: - Replace the exported `Config *config.ServerConfig` field with an unexported `cfg atomic.Pointer[config.ServerConfig]`. - GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are lock-free and safe whether or not mc.mu is held. - Route every reader through GetConfig() across the managed client, the upstream Manager, and the supervisor actor pool. Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against Client.Connect concurrently; trips -race on the old field, green after. Verified: go test -race ./internal/upstream/... ./internal/runtime/... and the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath / _AdminAllowed_WriteOps pass x5 under -race; build + linter clean. Related: MCP-770

StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the process-monitoring equivalents — with no synchronization. When a connect and a shutdown overlap on the same client, the -race detector flags WG.Add (Start) vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's shutdown path actually run concurrently with reconcile. Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor goroutines never acquire it; it is never held across c.mu), so Stop's bounded WG.Wait under the lock cannot deadlock. Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips -race on the unsynchronized fields, green with monitoringMu. Related: MCP-770, spec 070

The monitoringMu serialization (prev commit) was incomplete: StopStderr/ StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms timeout. When a real child keeps its stderr open, monitorStderr blocks in scanner.Scan(), the wait times out, Stop returns and releases monitoringMu — but the detached WG.Wait goroutine lives on and races the next connect's WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring WG.Add vs the lingering StopStderrMonitoring WG.Wait). Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the monitor goroutine closes, and pass the context + done as locals so an abandoned (timed-out) monitor never reads the shared ctx/state a later Start overwrites. Stop now waits on done directly under monitoringMu (no detached waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx as a parameter. Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race. Related: MCP-770, spec 070

Forking a work branch from another feature branch drags its unmerged commits into the PR (root cause of spec-064 docs leaking into the MCP-770 race fix #556). ENG-3 now mandates fetch + branch from origin/main explicitly, in both the engineer bundle and the contract.

* docs(064): Glass Cockpit — transparent & steerable agent cockpit spec Spec, plan, and design artifacts for making the existing Paperclip "MCPProxy" cockpit (spec 045) transparent and steerable: invert the default from "proceed" to "checkpoint at every design-decision boundary" via three human gates (plan-of-attack, per-spec design, pre-merge) mapped to Paperclip native primitives (executionPolicy approval stages, request_confirmation/suggest_tasks interactions, issue_tree_holds), with reasoning visible before each gate and a single "waiting on you" view. Phased rollout: A) config + agent-instruction only (the dry-run target), B) a Paperclip plugin for the fused transparency UI, C) a fork only if A/B fall short. SynapBus is log/wiki only, never on the critical path. Includes rewritten gate-aware agent instructions, consumed-API + executionPolicy + agent-instruction contracts, data model, research, and operator quickstart. Supersedes spec 062's fresh-dev-instance approach; extends spec 045. * docs(064): amend gate model — human-merge → dual-AI auto-merge + human veto Session 2 amendment to FR-005 + US3: replace the mandatory human-merge gate (throughput bottleneck) with draft-PR + dual-AI-review consensus auto-merge. Two reviewers on different model families (Gemini 2.5-pro Critic + Codex), never the implementer; tests-green + both-accept → auto-merge; human is an optional 3rd reviewer with veto (request-changes/hold freezes auto-merge). Prerequisite flagged: a bot GitHub identity (agents currently = author's gh, and GitHub forbids self-approval) — interim fallback is 2-AI-review-as- required-check with the human merging. codex-local Paperclip adapter exists. * docs(064): dual-AI-review auto-merge — reviewer doctrine + engineer draft-PR + setup Adds the Session-2 gate-model deliverables: reviewer/REVIEWER.md (shared RV-1..RV-6 dual-review doctrine), codex-reviewer/AGENTS.md (2nd reviewer on codex-local), amends engineer ENG-4 to open DRAFT PRs + request 2 AI reviewers (no self-merge), and auto-merge-setup.md (GitHub branch-protection config + the bot-identity prerequisite + interim human-merge fallback + open items). * docs(064): reviewers on subscription auth only; Gemini quota-exhausted, Codex gpt-5.5 ready Per user directive: both AI reviewers use paid SUBSCRIPTION logins, not API keys. - Gemini Critic: subscription/OAuth, pin gemini-2.5-pro (3.5/3 UNVERIFIED — quota exhausted on every probe today; switch if confirmed). TWO blockers: quota + empty-prompt adapter bug → cannot accept yet. - Codex reviewer: ChatGPT subscription, gpt-5.5 (codex-cli 0.46.0) — READY now. - Live two-reviewer set today = Codex + human (FR-005f) until Gemini recovers. * docs(064): Critic GEMINI.md — subscription-only, quota-blocked; Kimi+Codex as live pair Gemini settings pin gemini-3.1-pro-preview (subscription/OAuth) but quota is exhausted (no reset hint) + empty-prompt adapter bug → Critic can't accept yet. Live 2-AI reviewer pair = Codex gpt-5.5 + Kimi-K2.5 (opencode_local, Gcore key present); Gemini rejoins as 3rd reviewer when quota returns. * docs(064): stand up Codex+Kimi reviewer agents; codex config fix Live dual-AI reviewer pair created in the running Paperclip cockpit and verified responding (2026-05-31): - CodexReviewer — codex_local / gpt-5-codex (5b94562c-…) - KimiReviewer — opencode_local / Kimi-K2.5 (fdaa1d4c-…) Both carry managed instruction bundles (shared doctrine + RV-1..RV-6 + role notes), report to CEO, idle, heartbeat off (woken by review-stage). Docs: - add canonical kimi-reviewer/AGENTS.md (the design lacked it) - correct codex-reviewer/AGENTS.md model facts (gpt-5.5 -> gpt-5-codex) - auto-merge-setup.md: live pair is Codex+Kimi; Gemini Critic becomes the 3rd reviewer when its subscription quota recovers codex config fix (~/.codex/config.toml, not in repo): model_reasoning_effort xhigh->high and model gpt-5.5->gpt-5-codex. On codex-cli 0.46.0 + ChatGPT subscription auth, gpt-5.5 needs a newer CLI and gpt-5.4/5.3-codex/5.2 are auth-restricted; gpt-5-codex/gpt-5 are the working models. Backup at ~/.codex/config.toml.bak.pre-reviewer-fix.* * docs(064): engineers drive PRs green + bundle docs/ updates (ENG-8/9) engineer/AGENTS.md: - ENG-8: drive every required check to green before review — run local verification before push, watch `gh pr checks --watch`, push fixes until all green; never leave/hand off a red PR, never --no-verify or weaken a check. Green CI is the engineer's job, not the reviewer's. - ENG-9: when a change touches CLI/REST/MCP API/config/defaults/security or anything under docs/, the SAME PR must update docs/ (+ CLAUDE.md/ oas/swagger.yaml/README where mirrored). Docs-only changes exempt from TDD. - ENG-5 reworked to dual-AI merge-readiness (Codex+Kimi accept + all CI green). reviewer/REVIEWER.md RV-3: red/pending check = automatic request_changes; missing docs when the change warrants them = request_changes. Applied to the live Paperclip brains: 3 engineers (Backend/Frontend/MacOS) re-flattened from canonical; Codex+Kimi reviewer brains refreshed. * docs(064): record applied CI-context branch protection on main Phase-1 gate live on main (no bot identity needed): required_status_checks strict=false with 8 always-run, non-path-conditional contexts (Lint, Unit Tests ubuntu, Build ubuntu/macos/windows, Build Frontend, Validate PR title, Verify OpenAPI Artifacts). Existing 1-review + enforce_admins=false kept. Verified: green PR #553 satisfies all 8 (blocked only by review); in-flight PR #555 blocked on pending required checks. Documents the deliberately- excluded checks and the Go-version-pinned context-name fragility. * docs(064): ENG-3 — branch from origin/main, never from a feature branch Forking a work branch from another feature branch drags its unmerged commits into the PR (root cause of spec-064 docs leaking into the MCP-770 race fix #556). ENG-3 now mandates fetch + branch from origin/main explicitly, in both the engineer bundle and the contract.

…ect (MCP-783) (#558) Supervisor reconcile dispatches each action (Connect/Disconnect/Reconnect/ Remove) in a bare, untracked goroutine ('no waiting!'). Supervisor.Stop() cancelled the context and waited only on the three long-lived loops (s.wg), so it could return while a ConnectServer -> client.Connect() was still in flight. runtime.Close then called ShutdownAll -> Disconnect, overlapping Connect and Disconnect on the same client — the root cause of the MCP-770 race cascade (five symptoms, each unmasking the next). Track action goroutines in a new actionWg and gate dispatch with a stopping flag set under stateMu, so Stop() drains all in-flight actions before disconnecting clients. The drain is bounded by a 35s backstop (> the 30s per-action context timeout) so a wedged Connect can't hang shutdown. Confined to internal/runtime/supervisor. Adds a -race regression guard that asserts upstream.Close() never overlaps an in-flight Connect. Related #556 Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris added 4 commits June 1, 2026 09:14

Dumbris force-pushed the fix/mcp770-discovertools-config-race branch from 02eec48 to 88ef69a Compare June 1, 2026 06:14

Dumbris mentioned this pull request Jun 1, 2026

docs(064): Glass Cockpit spec + agent-instruction bundles #557

Merged

Dumbris merged commit e9b4325 into main Jun 1, 2026
28 checks passed

Dumbris mentioned this pull request Jun 1, 2026

fix(runtime): drain in-flight reconcile actions before client disconnect (MCP-783) #558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770)#556

fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770)#556
Dumbris merged 4 commits into
mainfrom
fix/mcp770-discovertools-config-race

Dumbris commented Jun 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 1, 2026

Problem

Fix

Test

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

codecov-commenter commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading