Skip to content

fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770)#556

Merged
Dumbris merged 4 commits into
mainfrom
fix/mcp770-discovertools-config-race
Jun 1, 2026
Merged

fix(upstream): data race on client.Config between DiscoverTools and SetConfig (MCP-770)#556
Dumbris merged 4 commits into
mainfrom
fix/mcp770-discovertools-config-race

Conversation

@Dumbris

@Dumbris Dumbris commented Jun 1, 2026

Copy link
Copy Markdown
Member

Problem

The -race detector intermittently flagged a data race in the upstream layer whenever a server was added (reconcile) while background tool indexing ran. Surfaced on PR #555 CI but reproduces independently — the racing code predates that branch.

  • Read (internal/upstream/manager.go:825): DiscoverTools snapshotted client.Config.* directly while holding only the Manager's m.mu.
  • Write (internal/upstream/managed/client.go:355): SetConfig, called from AddServerConfig after it releases m.mu (line 307-308, intentionally, to avoid a lock-order deadlock with GetServerState), guards mc.Config with the client's own mc.mu.

Reader and writer used different mutexes → genuine race on the mc.Config pointer.

Provenance: CI run https://github.com/smart-mcp-proxy/mcpproxy-go/actions/runs/26719908592/job/78745012175

Fix

Read config through the existing mutex-guarded GetConfig() accessor in:

  • DiscoverTools (the CI-flagged reader), and
  • the API-facing status readers that share the identical hazard — GetStats, GetTotalToolCount, ListServers read client.Config after releasing m.mu and run concurrently with reconcile from HTTP handlers.

mc.mu is a leaf lock (only Get/SetConfig acquire it, never while holding m.mu), so reading through GetConfig() while holding m.mu.RLock() introduces no lock-order inversion. The write side (SetConfig) is left untouched, preserving the deadlock-avoidance the original m.mu.Unlock() provides.

Test

TestDiscoverTools_ConfigRace drives AddServerConfig (→ SetConfig) concurrently with all four readers. It trips -race before the fix (read manager.go:825 vs write client.go:355) and is green after.

Verification

  • go test -race ./internal/upstream/...
  • go test -race ./internal/runtime/... ✅ (exercises DiscoverAndIndexTools hot path)
  • go build ./...
  • ./scripts/run-linter.sh → 0 issues ✅

Related: MCP-770

🤖 Generated with Claude Code

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 1, 2026

Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 88ef69a
Status: ✅  Deploy successful!
Preview URL: https://4a30d918.mcpproxy-docs.pages.dev
Branch Preview URL: https://fix-mcp770-discovertools-con.mcpproxy-docs.pages.dev

View logs

@codecov-commenter

codecov-commenter commented Jun 1, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 26.59574% with 138 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
internal/upstream/managed/client.go 5.26% 72 Missing ⚠️
internal/upstream/core/monitoring.go 44.44% 29 Missing and 1 partial ⚠️
internal/upstream/manager.go 47.82% 23 Missing and 1 partial ⚠️
...runtime/supervisor/actor_pool_complex_reference.go 0.00% 9 Missing ⚠️
internal/runtime/supervisor/actor_pool.go 0.00% 3 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/mcp770-discovertools-config-race

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (27 MB)
  • archive-windows-arm64 (24 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26738476773 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris added 4 commits June 1, 2026 09:14
DiscoverTools snapshotted client.Config directly while only holding the
Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu
before calling managed.Client.SetConfig, which guards mc.Config with the
client's own mc.mu. The two accesses therefore used different mutexes,
producing a data race the -race detector flagged intermittently in CI
(PR #555, run 26719908592).

Fix the read side: read config via the mutex-guarded GetConfig() accessor in
DiscoverTools and in the API-facing status readers (GetStats,
GetTotalToolCount, ListServers) that share the identical hazard — they read
client.Config after releasing m.mu and run concurrently with reconcile.
mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu),
so this is deadlock-safe.

Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against
the readers concurrently; trips -race before the fix, green after.

Related: MCP-770
The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats,
GetTotalToolCount, ListServers) through GetConfig(), but the underlying
hazard is broader: managed.Client.Config is a pointer field that SetConfig
swaps off m.mu (reconcile add path), while it is read concurrently from many
sites that do NOT hold the client's mc.mu — including Client.Connect's
unlocked phase (client.go:197) and detached state-change callback goroutines
(onStateChange, spawned by StateManager.SetError). The CI -race detector
flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after
the first fix was cherry-picked.

A mutex-guarded accessor can't cover this: most internal readers run while
mc.mu is already held, so routing them through an RLock GetConfig() would
deadlock (sync.RWMutex is not reentrant). Instead make the field itself
lock-free and data-race-free:

- Replace the exported `Config *config.ServerConfig` field with an
  unexported `cfg atomic.Pointer[config.ServerConfig]`.
- GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are
  lock-free and safe whether or not mc.mu is held.
- Route every reader through GetConfig() across the managed client, the
  upstream Manager, and the supervisor actor pool.

Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against
Client.Connect concurrently; trips -race on the old field, green after.

Verified: go test -race ./internal/upstream/... ./internal/runtime/... and
the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath /
_AdminAllowed_WriteOps pass x5 under -race; build + linter clean.

Related: MCP-770
StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and
StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the
monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the
process-monitoring equivalents — with no synchronization. When a connect and a
shutdown overlap on the same client, the -race detector flags WG.Add (Start)
vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier
MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's
shutdown path actually run concurrently with reconcile.

Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods
mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor
goroutines never acquire it; it is never held across c.mu), so Stop's bounded
WG.Wait under the lock cannot deadlock.

Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips
-race on the unsynchronized fields, green with monitoringMu.

Related: MCP-770, spec 070
The monitoringMu serialization (prev commit) was incomplete: StopStderr/
StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms
timeout. When a real child keeps its stderr open, monitorStderr blocks in
scanner.Scan(), the wait times out, Stop returns and releases monitoringMu —
but the detached WG.Wait goroutine lives on and races the next connect's
WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring
WG.Add vs the lingering StopStderrMonitoring WG.Wait).

Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the
monitor goroutine closes, and pass the context + done as locals so an
abandoned (timed-out) monitor never reads the shared ctx/state a later Start
overwrites. Stop now waits on done directly under monitoringMu (no detached
waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx
as a parameter.

Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking
pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact
escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race.

Related: MCP-770, spec 070
@Dumbris Dumbris force-pushed the fix/mcp770-discovertools-config-race branch from 02eec48 to 88ef69a Compare June 1, 2026 06:14
Dumbris added a commit that referenced this pull request Jun 1, 2026
Forking a work branch from another feature branch drags its unmerged
commits into the PR (root cause of spec-064 docs leaking into the
MCP-770 race fix #556). ENG-3 now mandates fetch + branch from
origin/main explicitly, in both the engineer bundle and the contract.
@Dumbris Dumbris merged commit e9b4325 into main Jun 1, 2026
28 checks passed
Dumbris added a commit that referenced this pull request Jun 1, 2026
* docs(064): Glass Cockpit — transparent & steerable agent cockpit spec

Spec, plan, and design artifacts for making the existing Paperclip
"MCPProxy" cockpit (spec 045) transparent and steerable: invert the
default from "proceed" to "checkpoint at every design-decision
boundary" via three human gates (plan-of-attack, per-spec design,
pre-merge) mapped to Paperclip native primitives (executionPolicy
approval stages, request_confirmation/suggest_tasks interactions,
issue_tree_holds), with reasoning visible before each gate and a
single "waiting on you" view.

Phased rollout: A) config + agent-instruction only (the dry-run
target), B) a Paperclip plugin for the fused transparency UI, C) a
fork only if A/B fall short. SynapBus is log/wiki only, never on the
critical path. Includes rewritten gate-aware agent instructions,
consumed-API + executionPolicy + agent-instruction contracts, data
model, research, and operator quickstart.

Supersedes spec 062's fresh-dev-instance approach; extends spec 045.

* docs(064): amend gate model — human-merge → dual-AI auto-merge + human veto

Session 2 amendment to FR-005 + US3: replace the mandatory human-merge gate
(throughput bottleneck) with draft-PR + dual-AI-review consensus auto-merge.
Two reviewers on different model families (Gemini 2.5-pro Critic + Codex),
never the implementer; tests-green + both-accept → auto-merge; human is an
optional 3rd reviewer with veto (request-changes/hold freezes auto-merge).
Prerequisite flagged: a bot GitHub identity (agents currently = author's gh,
and GitHub forbids self-approval) — interim fallback is 2-AI-review-as-
required-check with the human merging. codex-local Paperclip adapter exists.

* docs(064): dual-AI-review auto-merge — reviewer doctrine + engineer draft-PR + setup

Adds the Session-2 gate-model deliverables: reviewer/REVIEWER.md (shared
RV-1..RV-6 dual-review doctrine), codex-reviewer/AGENTS.md (2nd reviewer on
codex-local), amends engineer ENG-4 to open DRAFT PRs + request 2 AI reviewers
(no self-merge), and auto-merge-setup.md (GitHub branch-protection config +
the bot-identity prerequisite + interim human-merge fallback + open items).

* docs(064): reviewers on subscription auth only; Gemini quota-exhausted, Codex gpt-5.5 ready

Per user directive: both AI reviewers use paid SUBSCRIPTION logins, not API keys.
- Gemini Critic: subscription/OAuth, pin gemini-2.5-pro (3.5/3 UNVERIFIED — quota
  exhausted on every probe today; switch if confirmed). TWO blockers: quota +
  empty-prompt adapter bug → cannot accept yet.
- Codex reviewer: ChatGPT subscription, gpt-5.5 (codex-cli 0.46.0) — READY now.
- Live two-reviewer set today = Codex + human (FR-005f) until Gemini recovers.

* docs(064): Critic GEMINI.md — subscription-only, quota-blocked; Kimi+Codex as live pair

Gemini settings pin gemini-3.1-pro-preview (subscription/OAuth) but quota is
exhausted (no reset hint) + empty-prompt adapter bug → Critic can't accept yet.
Live 2-AI reviewer pair = Codex gpt-5.5 + Kimi-K2.5 (opencode_local, Gcore key
present); Gemini rejoins as 3rd reviewer when quota returns.

* docs(064): stand up Codex+Kimi reviewer agents; codex config fix

Live dual-AI reviewer pair created in the running Paperclip cockpit and
verified responding (2026-05-31):
- CodexReviewer  — codex_local / gpt-5-codex   (5b94562c-…)
- KimiReviewer   — opencode_local / Kimi-K2.5  (fdaa1d4c-…)
Both carry managed instruction bundles (shared doctrine + RV-1..RV-6 +
role notes), report to CEO, idle, heartbeat off (woken by review-stage).

Docs:
- add canonical kimi-reviewer/AGENTS.md (the design lacked it)
- correct codex-reviewer/AGENTS.md model facts (gpt-5.5 -> gpt-5-codex)
- auto-merge-setup.md: live pair is Codex+Kimi; Gemini Critic becomes the
  3rd reviewer when its subscription quota recovers

codex config fix (~/.codex/config.toml, not in repo): model_reasoning_effort
xhigh->high and model gpt-5.5->gpt-5-codex. On codex-cli 0.46.0 + ChatGPT
subscription auth, gpt-5.5 needs a newer CLI and gpt-5.4/5.3-codex/5.2 are
auth-restricted; gpt-5-codex/gpt-5 are the working models. Backup at
~/.codex/config.toml.bak.pre-reviewer-fix.*

* docs(064): engineers drive PRs green + bundle docs/ updates (ENG-8/9)

engineer/AGENTS.md:
- ENG-8: drive every required check to green before review — run local
  verification before push, watch `gh pr checks --watch`, push fixes until
  all green; never leave/hand off a red PR, never --no-verify or weaken a
  check. Green CI is the engineer's job, not the reviewer's.
- ENG-9: when a change touches CLI/REST/MCP API/config/defaults/security or
  anything under docs/, the SAME PR must update docs/ (+ CLAUDE.md/
  oas/swagger.yaml/README where mirrored). Docs-only changes exempt from TDD.
- ENG-5 reworked to dual-AI merge-readiness (Codex+Kimi accept + all CI green).

reviewer/REVIEWER.md RV-3: red/pending check = automatic request_changes;
missing docs when the change warrants them = request_changes.

Applied to the live Paperclip brains: 3 engineers (Backend/Frontend/MacOS)
re-flattened from canonical; Codex+Kimi reviewer brains refreshed.

* docs(064): record applied CI-context branch protection on main

Phase-1 gate live on main (no bot identity needed): required_status_checks
strict=false with 8 always-run, non-path-conditional contexts (Lint, Unit
Tests ubuntu, Build ubuntu/macos/windows, Build Frontend, Validate PR title,
Verify OpenAPI Artifacts). Existing 1-review + enforce_admins=false kept.
Verified: green PR #553 satisfies all 8 (blocked only by review); in-flight
PR #555 blocked on pending required checks. Documents the deliberately-
excluded checks and the Go-version-pinned context-name fragility.

* docs(064): ENG-3 — branch from origin/main, never from a feature branch

Forking a work branch from another feature branch drags its unmerged
commits into the PR (root cause of spec-064 docs leaking into the
MCP-770 race fix #556). ENG-3 now mandates fetch + branch from
origin/main explicitly, in both the engineer bundle and the contract.
Dumbris added a commit that referenced this pull request Jun 1, 2026
…ect (MCP-783) (#558)

Supervisor reconcile dispatches each action (Connect/Disconnect/Reconnect/
Remove) in a bare, untracked goroutine ('no waiting!'). Supervisor.Stop()
cancelled the context and waited only on the three long-lived loops (s.wg),
so it could return while a ConnectServer -> client.Connect() was still in
flight. runtime.Close then called ShutdownAll -> Disconnect, overlapping
Connect and Disconnect on the same client — the root cause of the MCP-770
race cascade (five symptoms, each unmasking the next).

Track action goroutines in a new actionWg and gate dispatch with a stopping
flag set under stateMu, so Stop() drains all in-flight actions before
disconnecting clients. The drain is bounded by a 35s backstop (> the 30s
per-action context timeout) so a wedged Connect can't hang shutdown.

Confined to internal/runtime/supervisor. Adds a -race regression guard that
asserts upstream.Close() never overlaps an in-flight Connect.

Related #556

Co-authored-by: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants