Wire Redis session storage and fix MultiSession lookup#4402
Wire Redis session storage and fix MultiSession lookup#4402
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4402 +/- ##
==========================================
- Coverage 69.50% 69.37% -0.14%
==========================================
Files 486 489 +3
Lines 50017 50433 +416
==========================================
+ Hits 34766 34987 +221
- Misses 12570 12743 +173
- Partials 2681 2703 +22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes Redis-backed session storage for VirtualMCPServer so session metadata is actually persisted across pods, and addresses a MultiSession lookup issue caused by Redis deserialization losing the concrete MultiSession type.
Changes:
- Wire
SessionStorageconfig intovmcpserver startup and create a Redis-backed transport session manager when configured. - Fix
GetMultiSession()behavior for Redis-backed storage by maintaining a node-local authoritative map of liveMultiSessionobjects. - Add E2E coverage for Redis session continuity across full pod replacement, plus Redis deploy/cleanup helpers and image constants.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/vmcp/app/commands.go |
Builds Redis session config from vmcp config/env and passes it into server config. |
pkg/vmcp/server/server.go |
Adds SessionStorageConfig and instantiates Redis-backed transport session storage when set. |
pkg/vmcp/server/sessionmanager/session_manager.go |
Introduces node-local multiSessions map and updates Create/Get/Terminate/Decorate paths for Redis compatibility. |
test/e2e/thv-operator/virtualmcp/virtualmcpserver_scaling_test.go |
Refactors scaling test into a single ordered lifecycle and adds a scale helper. |
test/e2e/thv-operator/virtualmcp/virtualmcpserver_redis_session_test.go |
New E2E test verifying sessions survive complete pod replacement with Redis configured. |
test/e2e/thv-operator/virtualmcp/helpers.go |
Adds DeployRedis / CleanupRedis helpers for E2E tests. |
test/e2e/images/images.go |
Adds a Redis image constant for E2E usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
JAORMX
left a comment
There was a problem hiding this comment.
Nice work on the architectural split here. Separating serializable metadata from live in-process state is the right call, and the singleflight dedup for RestoreSession is a smart touch to avoid thundering-herd reconnections.
The main things I'd like addressed before merging:
Exists+StoreinGenerate()isn't atomic -- the collision guard doesn't actually guard. UseSetNXor equivalent.context.Background()with no timeout inValidate()andGenerate()-- these are on the hot path. If Redis is slow, request goroutines hang. The old code usedcontext.WithTimeout.PeekonStorageinterface has no caller -- let's not add methods to stable interfaces without a consumer.- Missing unit tests for
DataStorageimplementations -- both local and Redis have non-trivial logic that should be tested directly. - HTTP 401 -> 404 behavior change -- worth calling out in the PR description so consumers know.
The Terminate() ordering, eviction-during-use, and GetMultiSession context propagation items are more "worth thinking about" than blockers... see inline comments for details.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 27 out of 27 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jerm-dro
left a comment
There was a problem hiding this comment.
Thanks for the iteration here — the direction is looking good. The scope has grown significantly though (2300+ lines across 27 files), and I think splitting this up would make it easier to review and land incrementally.
Some possible splits, but I'll leave the exact boundaries to you:
-
DataStorageinterface + implementations + tests — Pure additive code with no existing behavior changes, so it can merge independently with minimal risk. -
Session manager migration to
DataStorage— The core behavioral change: swap the transportManagerdependency, implementRestoreSession, update tests. -
Wire into server + discovery middleware —
buildRedisConfig,SessionStorageConfig,MultiSessionGetterinterface, and integration test updates (including the 401 → 404 change). -
E2E tests — Redis session continuity test, helpers, scaling test refactor.
Summary
Two bugs prevented Redis-backed VirtualMCPServer from working correctly across pods:
The vmcp process never read the SessionStorage field from its config, always using in-memory (local) storage regardless of operator config. Fixed by adding SessionStorageConfig to server.Config and a buildRedisConfig helper in commands.go that reads the provider, address, DB, and password (via THV_SESSION_REDIS_PASSWORD) from the vmcp config at startup.
With Redis-backed storage, sessionmanager.Manager.GetMultiSession() always returned (nil, false) for fully-formed sessions. The transportsession.Manager.Get() deserialises from Redis into a plain *StreamableSession, losing the defaultMultiSession type. The type assertion sess.(vmcpsession.MultiSession) therefore always failed, causing GetAdaptedTools to error, which triggered the cleanup defer to call Terminate(), marking the session terminated=true in Redis. The client's next request got a 404 "session terminated".
Fixed by adding a node-local sync.Map (multiSessions) to sessionmanager.Manager as the authoritative store for live MultiSession objects. CreateSession stores the live session there after UpsertSession succeeds; GetMultiSession and Terminate consult the local map first before falling back to the pluggable backend.
Also adds:
Fixes #4220
Type of change
Test plan
task test)task test-e2e)task lint-fix)Does this introduce a user-facing change?
Breaking behavior change: HTTP 401 → HTTP 404 for unknown/expired sessions
Previously, the discovery middleware returned HTTP 401 for subsequent requests (requests carrying an
Mcp-Session-Idheader) when the session was not found or not yet fully initialized. With this change, the middleware passes those requests through to the MCP SDK, which returns HTTP 404 ("session not found").The new behavior is more semantically accurate — an unknown session ID is a "not found" condition, not an authorization failure — but any monitoring, alerting, or client code that matches on HTTP 401 specifically for session-expiry detection will need to be updated to also handle HTTP 404.
Special notes for reviewers
multiSessionslocal map: intentionally node-local (not shared via Redis). Redis holds serialisable metadata only; liveMultiSessionobjects (which hold open HTTP connections to backends) cannot be serialised. Cross-pod recovery is handled byRestoreSession, which reopens backend connections from stored metadata on cache miss.Generate()creates a placeholder;CreateSession()(called fromOnRegisterSession) upgrades it. This window means a session can exist as a placeholder with noMultiSessionin the local map —handleSubsequentRequestaccounts for this by skipping capability injection rather than erroring.Terminate(): noted inline with a TODO for a "terminating" tombstone if needed.context.Background()inGetMultiSession:MultiSessionGettercarries no context (SDK interface constraint); noted inline with a TODO to add context propagation.Large PR Justification
This is a complete PR that covers a single functionality. It includes comprehensive tests. Cannot be split.