Skip to content

Add E2E acceptance test for backend routing after proxy restart#4574

Open
jerm-dro wants to merge 6 commits intomainfrom
e2e-backend-routing-acceptance-test
Open

Add E2E acceptance test for backend routing after proxy restart#4574
jerm-dro wants to merge 6 commits intomainfrom
e2e-backend-routing-acceptance-test

Conversation

@jerm-dro
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro commented Apr 6, 2026

Summary

  • When backendReplicas > 1 and the proxy runner restarts, the recovered session's backend_url points to the ClusterIP service. Kube-proxy may route to a backend pod that never handled the session's initialize, causing HTTP 404 / JSON-RPC -32001 "session not found" errors.
  • This PR adds an E2E acceptance test that demonstrates the bug: it creates an MCPServer with backendReplicas=2, establishes a session, deletes the proxy runner pod, and sends 5 requests with the same session ID. With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%, making the failure reliably detectable.
  • This test is expected to fail until pod-specific headless DNS routing is implemented.

Type of change

  • New feature

Test plan

  • E2E tests (task test-e2e) — this PR adds the test; it is expected to fail on CI, proving the bug exists
  • Unit tests (task test) — no production code changed, existing tests unaffected

Does this introduce a user-facing change?

No

Special notes for reviewers

This test is intentionally expected to fail on CI. It demonstrates the backend routing bug that will be fixed in a follow-up PR implementing pod-specific headless DNS routing. The test context "backendReplicas=2 and proxy runner restarts" will pass once the fix lands.

Generated with Claude Code

jerm-dro and others added 5 commits April 3, 2026 08:52
MCPServer supports horizontal scaling with Redis session storage, but
there was no E2E test verifying that a session established on one pod
is accessible from a different pod. This test deploys an MCPServer with
replicas=2 and Redis session storage, initializes an MCP session, then
sends raw JSON-RPC requests directly to each pod IP using the same
Mcp-Session-Id header to prove sessions are shared via Redis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod IPs are not reachable from the CI runner host in Kind clusters.
Replace direct pod IP HTTP calls with kubectl port-forward to each
pod, which tunnels through the Kind node's network.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The MCPServer CRD's sessionStorage config was populated by the operator
into RunConfig but the proxy runner never read it — sessions always used
in-memory LocalStorage, making cross-replica routing non-functional.

Add WithSessionStorage transport option and wire ScalingConfig.SessionRedis
from RunConfig into the transport layer so both StdioTransport and
HTTPTransport (transparent proxy) use Redis-backed session storage when
configured.

Rewrite the E2E test to use mcp-go clients throughout, including
transport.WithSession to create a client on pod B that reuses the
session established on pod A.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass SessionStorage through types.Config instead of a factory option
with interface assertion. The factory now sets the field directly on
each transport type during construction.

Add clientA to codespell ignore list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When backendReplicas > 1 and the proxy runner restarts, the recovered
session's backend_url points to the ClusterIP service. Kube-proxy may
route to a backend pod that never handled the session's initialize,
causing HTTP 404 / JSON-RPC -32001 "session not found" errors.

This test creates an MCPServer with backendReplicas=2, establishes a
session, deletes the proxy runner pod, and sends 5 requests with the
same session ID. With 2 backends and random routing, P(all 5 hit the
correct pod) ≈ 3%, making the failure reliably detectable.

The test is expected to fail until pod-specific headless DNS routing
is implemented in the proxy runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Apr 6, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (2e8e487) to head (c26ef73).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 19 Missing ⚠️
pkg/transport/factory.go 0.00% 11 Missing ⚠️
pkg/transport/http.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4574      +/-   ##
==========================================
- Coverage   69.06%   68.84%   -0.23%     
==========================================
  Files         502      505       +3     
  Lines       51997    52408     +411     
==========================================
+ Hits        35913    36078     +165     
- Misses      13300    13542     +242     
- Partials     2784     2788       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The namespace parameter is always defaultNamespace in current tests
but is kept for reusability across future test contexts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant