Update Redis metadata when backend sessions expire (RC-16)#4404
Update Redis metadata when backend sessions expire (RC-16)#4404
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a mechanism to keep transport-session metadata (and therefore Redis, when configured) consistent with the actual set of connected backends when an individual backend session expires or is lost.
Changes:
- Extend
MultiSessionwithRemoveBackendFromMetadata(workloadID string)and implement it indefaultMultiSession. - Add
sessionmanager.Manager.NotifyBackendExpired()to apply the metadata update and persist it viastorage.UpsertSession. - Add unit tests for backend-metadata removal and for
NotifyBackendExpired, plus mock/test-session updates.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/vmcp/session/types/session.go | Adds RemoveBackendFromMetadata to the MultiSession interface. |
| pkg/vmcp/session/types/mocks/mock_session.go | Updates the generated gomock MockMultiSession with the new method. |
| pkg/vmcp/session/default_session.go | Implements RemoveBackendFromMetadata by clearing per-backend metadata and rebuilding vmcp.backend.ids. |
| pkg/vmcp/session/remove_backend_test.go | New tests validating metadata updates when backends are removed. |
| pkg/vmcp/server/sessionmanager/session_manager.go | Adds NotifyBackendExpired to mutate metadata and persist it to storage. |
| pkg/vmcp/server/sessionmanager/session_manager_test.go | New tests covering NotifyBackendExpired behavior. |
| pkg/vmcp/server/telemetry_integration_test.go | Updates a test MultiSession implementation to satisfy the new interface. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4404 +/- ##
==========================================
+ Coverage 68.84% 68.88% +0.03%
==========================================
Files 505 505
Lines 52403 52456 +53
==========================================
+ Hits 36079 36134 +55
+ Misses 13533 13527 -6
- Partials 2791 2795 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jerm-dro
left a comment
There was a problem hiding this comment.
blocker: This PR introduces mutability to MultiSession — adding RemoveBackendFromMetadata to the interface, a sync.RWMutex (bsMu) to guard backendSessions, and in-place mutation of a previously read-only map. This is a consequence of sessions being long-lived cached objects (the sync.Map from #4402) that need to be updated in place when a backend expires.
Per the feedback on #4402, if sessions are reconstructed from metadata on every access (the SessionDataStorage + RestoreSession model), then sessions become immutable snapshots and this mutation isn't needed. When a backend expires, the fix is simpler:
- Update the metadata in
SessionDataStorage— deletevmcp.backend.session.{workloadID}and rebuildMetadataKeyBackendIDs. - The next request reconstructs the session via
RestoreSession, which reads the updated metadata and naturally produces a session without the expired backend.
No interface change, no mutex, no in-place mutation. NotifyBackendExpired becomes a metadata-only storage update rather than a session method.
I'd suggest revisiting this after the #4402 rework lands — the approach here will likely simplify significantly.
PR size has been reduced below the XL threshold. Thank you for splitting this up!
|
✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jerm-dro
left a comment
There was a problem hiding this comment.
The NotifyBackendExpired storage logic is on the right track, but I think this can be simpler.
Drop RemoveBackendFromMetadata from MultiSession
If NotifyBackendExpired evicts the session from the node-local cache after updating storage, there are no stale in-memory copies left. The next GetMultiSession call triggers RestoreSession, which reconstructs a fresh snapshot from the updated metadata in storage. The old session object is simply discarded and replaced — no in-place mutation needed.
Simplify the storage operation
The Load → mutate → Upsert pattern introduces a TOCTOU race (Terminate could delete between Load and Upsert, resurrecting the session). Consider a generic metadata update primitive on Manager instead:
func (sm *Manager) UpdateMetadata(sessionID string, metadata map[string]string)Conditional write the new state, evict from cache. The caller builds the complete metadata — no load, no merge, no diff. This also aligns with the cache simplification work in #4495.
Cross-pod propagation
When pod A updates metadata in storage and evicts locally, pod B still serves its stale cached session until that cache entry is naturally evicted. Pod B would continue routing to the expired backend in the meantime. The fix: the validating cache's check callback should diff the cached session's metadata against what's in storage, and if it's out of date, replace the cached entry with a fresh snapshot built from the updated metadata. This way storage updates propagate to all pods on the next cache access, not just the pod that initiated the update.
Parallel writers
Neither this approach nor the current Load → mutate → Upsert is safe against parallel writers — concurrent updates to the same session's metadata will squash each other (last write wins). A fully correct solution would need CAS semantics, distributed locks, or a generation/version counter on the session state. For now we're assuming parallel writes within a session don't happen, and we're not painting ourselves into a corner — we could retrofit the simplified storage/cache update pattern with any of these methods later. This assumption should be documented in a code comment.
Add RemoveBackendFromMetadata to the MultiSession interface and implement it on defaultMultiSession to clear the per-backend session ID key and rebuild MetadataKeyBackendIDs when a backend disconnects or expires. Add NotifyBackendExpired to sessionmanager.Manager to call RemoveBackendFromMetadata and flush the updated metadata to storage (Redis when configured), keeping Redis consistent with the actual set of connected backends. Closes #4213
NotifyBackendExpired now evicts the session from the node-local multiSessions cache after a successful storage update, and forgets the singleflight key to prevent a concurrent restore from re-populating the cache with stale data. This ensures same-pod requests trigger RestoreSession with the updated metadata (without the expired backend), not just cross-pod ones. On storage error the cache is intentionally not evicted to avoid serving requests against metadata that still describes the expired backend. Also update session_data_storage_test.go to match the API renamed in the "fixes from review" commit: Upsert→Store, Create→Exists. Add contract tests for Exists (false when absent, true when present, does not refresh TTL for either local and Redis implementations). Remove the error-return assertion from NewLocalSessionDataStorage.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary
Add RemoveBackendFromMetadata to the MultiSession interface and implement it on defaultMultiSession to clear the per-backend session ID key and rebuild MetadataKeyBackendIDs when a backend disconnects or expires.
Add NotifyBackendExpired to sessionmanager.Manager to call RemoveBackendFromMetadata and flush the updated metadata to storage (Redis when configured), keeping Redis consistent with the actual set of connected backends.
Fixes #4213
Type of change
Test plan
task test)task test-e2e)task lint-fix)Changes
Does this introduce a user-facing change?
Special notes for reviewers