Skip to content

Route MCP sessions to specific StatefulSet pods via headless DNS#4638

Open
yrobla wants to merge 1 commit intomainfrom
issue-4575
Open

Route MCP sessions to specific StatefulSet pods via headless DNS#4638
yrobla wants to merge 1 commit intomainfrom
issue-4575

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Apr 7, 2026

Summary

When backendReplicas > 1, the proxy was storing the ClusterIP service URL as backend_url on session initialize. After a proxy-runner restart, Redis-recovered sessions were routed through the ClusterIP, which sent requests to a random pod — one that had never seen the initialize and returned -32001 ("session not found").

Two root causes fixed:

  1. StatefulSet serviceName was set to containerName instead of mcp--headless, so Kubernetes never created pod DNS records.

  2. On initialize, the proxy now picks a random pod ordinal and stores its headless DNS URL (e.g. myserver-0.mcp-myserver-headless.default .svc.cluster.local) as backend_url instead of the ClusterIP. The Rewrite closure already reads backend_url to route follow-up requests, so sessions now survive proxy-runner restarts.

The operator populates a new HeadlessServiceConfig (StatefulSet name, headless service name, namespace, replica count) in ScalingConfig when backendReplicas > 1. Single-replica and non-Kubernetes deployments are unaffected.

Fixes #4575

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Changes

File Change

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

This is a complete PR that solves a bug, and is an isolated functionality. Cannot be split.

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Apr 7, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 7, 2026

Codecov Report

❌ Patch coverage is 36.28319% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.80%. Comparing base (1ee7107) to head (3b743b9).

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 24 Missing ⚠️
pkg/transport/http.go 29.41% 20 Missing and 4 partials ⚠️
pkg/transport/factory.go 0.00% 18 Missing ⚠️
...g/transport/proxy/transparent/transparent_proxy.go 80.00% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4638      +/-   ##
==========================================
- Coverage   68.88%   68.80%   -0.09%     
==========================================
  Files         505      505              
  Lines       52437    52521      +84     
==========================================
+ Hits        36121    36135      +14     
- Misses      13525    13589      +64     
- Partials     2791     2797       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 7, 2026
@yrobla yrobla requested a review from Copilot April 7, 2026 15:58
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 7, 2026
When backendReplicas > 1, the proxy was storing the ClusterIP service
URL as backend_url on session initialize. After a proxy-runner restart,
Redis-recovered sessions were routed through the ClusterIP, which sent
requests to a random pod — one that had never seen the initialize and
returned -32001 ("session not found").

Two root causes fixed:

1. StatefulSet serviceName was set to containerName instead of
   mcp-<name>-headless, so Kubernetes never created pod DNS records.

2. On initialize, the proxy now picks a random pod ordinal and stores
   its headless DNS URL (e.g. myserver-0.mcp-myserver-headless.default
   .svc.cluster.local) as backend_url instead of the ClusterIP. The
   Rewrite closure already reads backend_url to route follow-up
   requests, so sessions now survive proxy-runner restarts.

The operator populates a new HeadlessServiceConfig (StatefulSet name,
headless service name, namespace, replica count) in ScalingConfig when
backendReplicas > 1. Single-replica and non-Kubernetes deployments are
unaffected.

Fixes #4575
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 7, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 7, 2026
@github-actions github-actions bot dismissed their stale review April 7, 2026 16:04

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MCP session routing in Kubernetes when backendReplicas > 1 by ensuring sessions can be routed back to a specific StatefulSet pod (via headless DNS) after a proxy-runner restart, instead of being re-randomized through the ClusterIP service.

Changes:

  • Fix StatefulSet serviceName to match the created headless service name so per-pod DNS records are generated.
  • Add headless-service routing config to the operator/runconfig path and transport layer, and attempt to store a pod-specific headless DNS URL in session metadata.
  • Add unit + E2E coverage around multi-replica session routing and proxy-runner restart behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go New E2E tests for Redis-backed sessions across replicas and proxy-runner restart routing.
pkg/transport/types/transport.go Adds pluggable SessionStorage to transport config for non-local session stores (e.g., Redis).
pkg/transport/proxy/transparent/transparent_proxy.go Adds headless-service routing option and stores per-session backend_url metadata.
pkg/transport/proxy/transparent/backend_routing_test.go Unit tests for storing pod-specific backend_url when headless routing is configured.
pkg/transport/http.go Wires session storage + headless routing options into the HTTP transport’s proxy creation.
pkg/transport/factory.go Threads SessionStorage into transports; adds a transport option for headless routing.
pkg/runner/runner.go Creates Redis session storage from scaling config and passes headless routing settings into transports.
pkg/runner/config.go Extends scaling config with HeadlessService metadata for building per-pod DNS URLs.
pkg/container/kubernetes/client.go Fixes StatefulSet serviceName to mcp-<name>-headless to enable StatefulSet pod DNS.
cmd/thv-operator/controllers/mcpserver_runconfig.go Operator populates new headless routing config into the generated RunConfig.
.codespellrc Adds clientA to ignore list for the new tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 240 to +260
}
}

// WithPodHeadlessService configures pod-specific routing for multi-replica StatefulSet deployments.
// When set, each new MCP session is pinned to a randomly selected pod via its headless DNS name
// (e.g. myserver-0.mcp-myserver-headless.default.svc.cluster.local) so that session routing
// survives proxy-runner restarts without being sent to the wrong pod.
// The option is a no-op when replicas <= 1 or when any required field is empty.
func WithPodHeadlessService(statefulSetName, serviceName, namespace string, replicas int32) Option {
return func(p *TransparentProxy) {
if statefulSetName == "" || serviceName == "" || namespace == "" || replicas <= 1 {
return
}
p.headlessService = &podHeadlessService{
statefulSetName: statefulSetName,
serviceName: serviceName,
namespace: namespace,
replicas: replicas,
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocker: Make headless DNS routing the default for all Kubernetes deployments, and remove the headlessSetter type assertion pattern.

Two related issues:

1. Headless routing should be unconditional for Kubernetes. For a single-replica StatefulSet, myserver-0.mcp-myserver-headless.ns.svc.cluster.local is deterministic (always ordinal 0) and the headless service already exists because the operator creates it. This would eliminate the if val > 1 guard in populateScalingConfig, remove the replicas <= 1 no-op here, and give you one routing code path instead of two. pickPodBackendURL degenerates correctly — rand.Int(rand.Reader, big.NewInt(1)) always returns 0. Non-Kubernetes deployments are unaffected since HeadlessServiceConfig stays nil.

2. The headlessSetter anonymous interface in factory.go is unsafe. If a transport doesn't implement the unexported method, the option silently no-ops — no compile-time check, no runtime warning, no error returned. This means a misconfiguration (e.g. applying the option to a transport type that doesn't support it) is invisible. Put this on types.Config like SessionStorage, so it flows through the same plumbing and doesn't rely on type assertions against unexported methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend routing breaks after proxy runner restart with backendReplicas > 1

4 participants