Skip to content

Update Redis metadata when backend sessions expire (RC-16)#4404

Open
yrobla wants to merge 3 commits intomainfrom
issue-4213
Open

Update Redis metadata when backend sessions expire (RC-16)#4404
yrobla wants to merge 3 commits intomainfrom
issue-4213

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Mar 27, 2026

Summary

Add RemoveBackendFromMetadata to the MultiSession interface and implement it on defaultMultiSession to clear the per-backend session ID key and rebuild MetadataKeyBackendIDs when a backend disconnects or expires.

Add NotifyBackendExpired to sessionmanager.Manager to call RemoveBackendFromMetadata and flush the updated metadata to storage (Redis when configured), keeping Redis consistent with the actual set of connected backends.

Fixes #4213

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Changes

File Change

Does this introduce a user-facing change?

Special notes for reviewers

@yrobla yrobla requested a review from Copilot March 27, 2026 12:22
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Mar 27, 2026
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a mechanism to keep transport-session metadata (and therefore Redis, when configured) consistent with the actual set of connected backends when an individual backend session expires or is lost.

Changes:

  • Extend MultiSession with RemoveBackendFromMetadata(workloadID string) and implement it in defaultMultiSession.
  • Add sessionmanager.Manager.NotifyBackendExpired() to apply the metadata update and persist it via storage.UpsertSession.
  • Add unit tests for backend-metadata removal and for NotifyBackendExpired, plus mock/test-session updates.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/vmcp/session/types/session.go Adds RemoveBackendFromMetadata to the MultiSession interface.
pkg/vmcp/session/types/mocks/mock_session.go Updates the generated gomock MockMultiSession with the new method.
pkg/vmcp/session/default_session.go Implements RemoveBackendFromMetadata by clearing per-backend metadata and rebuilding vmcp.backend.ids.
pkg/vmcp/session/remove_backend_test.go New tests validating metadata updates when backends are removed.
pkg/vmcp/server/sessionmanager/session_manager.go Adds NotifyBackendExpired to mutate metadata and persist it to storage.
pkg/vmcp/server/sessionmanager/session_manager_test.go New tests covering NotifyBackendExpired behavior.
pkg/vmcp/server/telemetry_integration_test.go Updates a test MultiSession implementation to satisfy the new interface.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 66.66667% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.88%. Comparing base (8b36d99) to head (0615b2c).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/vmcp/server/sessionmanager/session_manager.go 63.26% 13 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4404      +/-   ##
==========================================
+ Coverage   68.84%   68.88%   +0.03%     
==========================================
  Files         505      505              
  Lines       52403    52456      +53     
==========================================
+ Hits        36079    36134      +55     
+ Misses      13533    13527       -6     
- Partials     2791     2795       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocker: This PR introduces mutability to MultiSession — adding RemoveBackendFromMetadata to the interface, a sync.RWMutex (bsMu) to guard backendSessions, and in-place mutation of a previously read-only map. This is a consequence of sessions being long-lived cached objects (the sync.Map from #4402) that need to be updated in place when a backend expires.

Per the feedback on #4402, if sessions are reconstructed from metadata on every access (the SessionDataStorage + RestoreSession model), then sessions become immutable snapshots and this mutation isn't needed. When a backend expires, the fix is simpler:

  1. Update the metadata in SessionDataStorage — delete vmcp.backend.session.{workloadID} and rebuild MetadataKeyBackendIDs.
  2. The next request reconstructs the session via RestoreSession, which reads the updated metadata and naturally produces a session without the expired backend.

No interface change, no mutex, no in-place mutation. NotifyBackendExpired becomes a metadata-only storage update rather than a session method.

I'd suggest revisiting this after the #4402 rework lands — the approach here will likely simplify significantly.

Base automatically changed from issue-4211 to main March 30, 2026 09:12
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 30, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 30, 2026
@yrobla yrobla changed the base branch from main to issue-4220-v1 March 30, 2026 10:22
@github-actions github-actions bot removed the size/M Medium PR: 300-599 lines changed label Mar 30, 2026
@jerm-dro
Copy link
Copy Markdown
Contributor

jerm-dro commented Apr 1, 2026

This PR overlaps significantly with #4464 (which has been approved). Once #4464 merges, this will need a rebase — expect substantial conflicts in session_manager.go, factory.go, server.go, middleware.go, and others. Happy to re-review after rebase.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 2, 2026
@github-actions github-actions bot dismissed their stale review April 2, 2026 13:32

PR size has been reduced below the XL threshold. Thank you for splitting this up!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up!

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 2, 2026
@yrobla yrobla requested a review from Copilot April 2, 2026 13:57
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 2, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NotifyBackendExpired storage logic is on the right track, but I think this can be simpler.

Drop RemoveBackendFromMetadata from MultiSession

If NotifyBackendExpired evicts the session from the node-local cache after updating storage, there are no stale in-memory copies left. The next GetMultiSession call triggers RestoreSession, which reconstructs a fresh snapshot from the updated metadata in storage. The old session object is simply discarded and replaced — no in-place mutation needed.

Simplify the storage operation

The Load → mutate → Upsert pattern introduces a TOCTOU race (Terminate could delete between Load and Upsert, resurrecting the session). Consider a generic metadata update primitive on Manager instead:

func (sm *Manager) UpdateMetadata(sessionID string, metadata map[string]string)

Conditional write the new state, evict from cache. The caller builds the complete metadata — no load, no merge, no diff. This also aligns with the cache simplification work in #4495.

Cross-pod propagation

When pod A updates metadata in storage and evicts locally, pod B still serves its stale cached session until that cache entry is naturally evicted. Pod B would continue routing to the expired backend in the meantime. The fix: the validating cache's check callback should diff the cached session's metadata against what's in storage, and if it's out of date, replace the cached entry with a fresh snapshot built from the updated metadata. This way storage updates propagate to all pods on the next cache access, not just the pod that initiated the update.

Parallel writers

Neither this approach nor the current Load → mutate → Upsert is safe against parallel writers — concurrent updates to the same session's metadata will squash each other (last write wins). A fully correct solution would need CAS semantics, distributed locks, or a generation/version counter on the session state. For now we're assuming parallel writes within a session don't happen, and we're not painting ourselves into a corner — we could retrofit the simplified storage/cache update pattern with any of these methods later. This assumption should be documented in a code comment.

taskbot added 2 commits April 7, 2026 11:28
Add RemoveBackendFromMetadata to the MultiSession interface and implement
it on defaultMultiSession to clear the per-backend session ID key and
rebuild MetadataKeyBackendIDs when a backend disconnects or expires.

Add NotifyBackendExpired to sessionmanager.Manager to call
RemoveBackendFromMetadata and flush the updated metadata to storage
(Redis when configured), keeping Redis consistent with the actual set
of connected backends.

Closes #4213
NotifyBackendExpired now evicts the session from the node-local
multiSessions cache after a successful storage update, and forgets
the singleflight key to prevent a concurrent restore from
re-populating the cache with stale data. This ensures same-pod
requests trigger RestoreSession with the updated metadata (without
the expired backend), not just cross-pod ones.

On storage error the cache is intentionally not evicted to avoid
serving requests against metadata that still describes the expired
backend.

Also update session_data_storage_test.go to match the API renamed in
the "fixes from review" commit: Upsert→Store, Create→Exists. Add
contract tests for Exists (false when absent, true when present, does
not refresh TTL for either local and Redis implementations). Remove
the error-return assertion from NewLocalSessionDataStorage.
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Apr 7, 2026
@yrobla yrobla requested a review from jerm-dro April 7, 2026 10:17
@yrobla yrobla requested a review from Copilot April 7, 2026 10:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update Redis Metadata When Backend Sessions Expire (RC-16)

4 participants