Skip to content

Wire Redis session storage and fix MultiSession lookup#4402

Closed
yrobla wants to merge 2 commits intomainfrom
issue-4220-v1
Closed

Wire Redis session storage and fix MultiSession lookup#4402
yrobla wants to merge 2 commits intomainfrom
issue-4220-v1

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Mar 27, 2026

Summary

Two bugs prevented Redis-backed VirtualMCPServer from working correctly across pods:

  1. The vmcp process never read the SessionStorage field from its config, always using in-memory (local) storage regardless of operator config. Fixed by adding SessionStorageConfig to server.Config and a buildRedisConfig helper in commands.go that reads the provider, address, DB, and password (via THV_SESSION_REDIS_PASSWORD) from the vmcp config at startup.

  2. With Redis-backed storage, sessionmanager.Manager.GetMultiSession() always returned (nil, false) for fully-formed sessions. The transportsession.Manager.Get() deserialises from Redis into a plain *StreamableSession, losing the defaultMultiSession type. The type assertion sess.(vmcpsession.MultiSession) therefore always failed, causing GetAdaptedTools to error, which triggered the cleanup defer to call Terminate(), marking the session terminated=true in Redis. The client's next request got a 404 "session terminated".

    Fixed by adding a node-local sync.Map (multiSessions) to sessionmanager.Manager as the authoritative store for live MultiSession objects. CreateSession stores the live session there after UpsertSession succeeds; GetMultiSession and Terminate consult the local map first before falling back to the pluggable backend.

Also adds:

  • E2E test: VirtualMCPServer Redis Session Continuity — verifies that MCP sessions survive complete pod replacement when Redis is configured (ServiceAffinityNone, 2 replicas, all pods deleted and replaced).
  • Redis image constant and DeployRedis/CleanupRedis helpers in the e2e test package.

Fixes #4220

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Does this introduce a user-facing change?

Breaking behavior change: HTTP 401 → HTTP 404 for unknown/expired sessions

Previously, the discovery middleware returned HTTP 401 for subsequent requests (requests carrying an Mcp-Session-Id header) when the session was not found or not yet fully initialized. With this change, the middleware passes those requests through to the MCP SDK, which returns HTTP 404 ("session not found").

The new behavior is more semantically accurate — an unknown session ID is a "not found" condition, not an authorization failure — but any monitoring, alerting, or client code that matches on HTTP 401 specifically for session-expiry detection will need to be updated to also handle HTTP 404.

Special notes for reviewers

  • multiSessions local map: intentionally node-local (not shared via Redis). Redis holds serialisable metadata only; live MultiSession objects (which hold open HTTP connections to backends) cannot be serialised. Cross-pod recovery is handled by RestoreSession, which reopens backend connections from stored metadata on cache miss.
  • Two-phase session creation: Generate() creates a placeholder; CreateSession() (called from OnRegisterSession) upgrades it. This window means a session can exist as a placeholder with no MultiSession in the local map — handleSubsequentRequest accounts for this by skipping capability injection rather than erroring.
  • Race window in Terminate(): noted inline with a TODO for a "terminating" tombstone if needed.
  • context.Background() in GetMultiSession: MultiSessionGetter carries no context (SDK interface constraint); noted inline with a TODO to add context propagation.

Large PR Justification

This is a complete PR that covers a single functionality. It includes comprehensive tests. Cannot be split.

@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@yrobla yrobla requested a review from Copilot March 27, 2026 10:09
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 27, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 55.79568% with 225 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.37%. Comparing base (feef6ed) to head (802c8ce).
⚠️ Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
...kg/transport/session/session_data_storage_redis.go 37.23% 53 Missing and 6 partials ⚠️
pkg/vmcp/server/sessionmanager/session_manager.go 76.19% 22 Missing and 13 partials ⚠️
pkg/vmcp/session/factory.go 41.50% 31 Missing ⚠️
pkg/vmcp/session/internal/security/security.go 0.00% 23 Missing ⚠️
cmd/vmcp/app/commands.go 0.00% 22 Missing ⚠️
pkg/vmcp/session/decorating_factory.go 0.00% 17 Missing ⚠️
pkg/vmcp/server/server.go 71.05% 8 Missing and 3 partials ⚠️
...kg/transport/session/session_data_storage_local.go 90.36% 4 Missing and 4 partials ⚠️
pkg/transport/session/storage_redis.go 0.00% 7 Missing ⚠️
pkg/transport/session/manager.go 0.00% 6 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4402      +/-   ##
==========================================
- Coverage   69.50%   69.37%   -0.14%     
==========================================
  Files         486      489       +3     
  Lines       50017    50433     +416     
==========================================
+ Hits        34766    34987     +221     
- Misses      12570    12743     +173     
- Partials     2681     2703      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Redis-backed session storage for VirtualMCPServer so session metadata is actually persisted across pods, and addresses a MultiSession lookup issue caused by Redis deserialization losing the concrete MultiSession type.

Changes:

  • Wire SessionStorage config into vmcp server startup and create a Redis-backed transport session manager when configured.
  • Fix GetMultiSession() behavior for Redis-backed storage by maintaining a node-local authoritative map of live MultiSession objects.
  • Add E2E coverage for Redis session continuity across full pod replacement, plus Redis deploy/cleanup helpers and image constants.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmd/vmcp/app/commands.go Builds Redis session config from vmcp config/env and passes it into server config.
pkg/vmcp/server/server.go Adds SessionStorageConfig and instantiates Redis-backed transport session storage when set.
pkg/vmcp/server/sessionmanager/session_manager.go Introduces node-local multiSessions map and updates Create/Get/Terminate/Decorate paths for Redis compatibility.
test/e2e/thv-operator/virtualmcp/virtualmcpserver_scaling_test.go Refactors scaling test into a single ordered lifecycle and adds a scale helper.
test/e2e/thv-operator/virtualmcp/virtualmcpserver_redis_session_test.go New E2E test verifying sessions survive complete pod replacement with Redis configured.
test/e2e/thv-operator/virtualmcp/helpers.go Adds DeployRedis / CleanupRedis helpers for E2E tests.
test/e2e/images/images.go Adds a Redis image constant for E2E usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 27, 2026
@yrobla yrobla requested a review from blkt as a code owner March 27, 2026 10:24
@yrobla yrobla requested a review from Copilot March 27, 2026 10:24
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 27, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 27, 2026
@yrobla yrobla requested a review from Copilot March 27, 2026 10:58
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Mar 30, 2026
@yrobla yrobla requested a review from Copilot March 30, 2026 08:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yrobla yrobla requested a review from Copilot March 30, 2026 08:56
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 30, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 30, 2026
JAORMX
JAORMX previously requested changes Mar 30, 2026
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the architectural split here. Separating serializable metadata from live in-process state is the right call, and the singleflight dedup for RestoreSession is a smart touch to avoid thundering-herd reconnections.

The main things I'd like addressed before merging:

  1. Exists + Store in Generate() isn't atomic -- the collision guard doesn't actually guard. Use SetNX or equivalent.
  2. context.Background() with no timeout in Validate() and Generate() -- these are on the hot path. If Redis is slow, request goroutines hang. The old code used context.WithTimeout.
  3. Peek on Storage interface has no caller -- let's not add methods to stable interfaces without a consumer.
  4. Missing unit tests for DataStorage implementations -- both local and Redis have non-trivial logic that should be tested directly.
  5. HTTP 401 -> 404 behavior change -- worth calling out in the PR description so consumers know.

The Terminate() ordering, eviction-during-use, and GetMultiSession context propagation items are more "worth thinking about" than blockers... see inline comments for details.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 30, 2026
@yrobla yrobla requested review from JAORMX and Copilot March 30, 2026 12:27
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the iteration here — the direction is looking good. The scope has grown significantly though (2300+ lines across 27 files), and I think splitting this up would make it easier to review and land incrementally.

Some possible splits, but I'll leave the exact boundaries to you:

  1. DataStorage interface + implementations + tests — Pure additive code with no existing behavior changes, so it can merge independently with minimal risk.

  2. Session manager migration to DataStorage — The core behavioral change: swap the transport Manager dependency, implement RestoreSession, update tests.

  3. Wire into server + discovery middlewarebuildRedisConfig, SessionStorageConfig, MultiSessionGetter interface, and integration test updates (including the 401 → 404 change).

  4. E2E tests — Redis session continuity test, helpers, scaling test refactor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Write unit tests for horizontal scaling operator behaviors (THV-0047)

5 participants