Skip to content

Conversation

@therealnb
Copy link
Contributor

@therealnb therealnb commented Jan 26, 2026

Infrastructure Improvements and Bugfixes for vMCP

This PR contains infrastructure improvements and bugfixes that enable better observability, reliability, and performance for vMCP. These changes are independent of the optimizer implementation and can be merged separately.

Changes

Observability & Tracing

OpenTelemetry Tracing in Capability Aggregation (pkg/vmcp/aggregator/default_aggregator.go)

  • Added comprehensive tracing spans to all aggregator methods
  • Includes spans for AggregateCapabilities, QueryAllCapabilities, QueryCapabilities, ResolveConflicts, and MergeCapabilities
  • All spans include relevant attributes (backend counts, tool/resource/prompt counts, error recording)
  • Why needed: Provides visibility into capability aggregation performance and helps diagnose issues in distributed tracing systems like Jaeger

Performance Improvements

Singleflight Deduplication for Discovery (pkg/vmcp/discovery/manager.go)

  • Added singleflight group to prevent duplicate aggregation work when multiple requests arrive concurrently
  • Ensures only one aggregation happens per cache key at a time
  • Includes double-check cache pattern after acquiring singleflight lock
  • Why needed: Prevents wasted CPU cycles and reduces backend load when multiple clients request capabilities simultaneously. Critical for high-traffic scenarios.

Health Check Improvements

Self-Check Prevention (pkg/vmcp/health/checker.go)

  • Added selfURL field to prevent server from health-checking itself
  • Implements URL normalization for comparison (handles localhost/127.0.0.1 variations)
  • Short-circuits health checks targeting the server's own URL
  • Why needed: Prevents wasteful self-checks and potential connection issues during startup. Improves server efficiency.

Health Check Context Marker (pkg/vmcp/health/checker.go)

  • Added WithHealthCheckMarker to bypass authentication logging for health checks
  • Health checks verify backend availability and should not require user credentials
  • Why needed: Reduces noise in authentication logs and clarifies that health checks are system-level operations, not user requests.

Client Reliability Fixes

HTTP Client Timeout (pkg/vmcp/client/client.go)

  • Added 30-second timeout to HTTP client to prevent hanging connections
  • Removed WithContinuousListening() in favor of timeout-based approach
  • Set timeout for streamable HTTP client as well
  • Why needed: Prevents connections from hanging indefinitely, which can cause resource exhaustion and make the system unresponsive. Critical for production reliability.

Test Reliability Improvements

E2E Test Pod Readiness Checks (test/e2e/thv-operator/virtualmcp/helpers.go)

  • Skip completed pods (Succeeded, Failed) in readiness checks
  • Only check pods in Running phase
  • Added validation to ensure at least one running pod exists
  • Why needed: Prevents flaky test failures when old pod deployments exist. Tests were failing because they checked all pods including completed ones from previous test runs.

E2E Test HTTP Client Timeout (test/e2e/thv-operator/virtualmcp/virtualmcp_auth_discovery_test.go)

  • Added HTTP client with 10-second timeout for health checks
  • Added pod readiness verification before health endpoint checks
  • Improved error messages for connection reset issues
  • Why needed: Prevents test timeouts and makes test failures more diagnosable. Ensures pods are actually ready before attempting health checks.

Build & Infrastructure

Gitignore Updates (.gitignore)

  • Added cmd/vmcp/__debug_bin* patterns for debug binaries
  • Why needed: Prevents debug binaries from being accidentally committed

Linter Configuration (.golangci.yml)

  • Excluded scripts$ directory from linting
  • Why needed: Scripts directory contains helper scripts that don't need to follow strict Go linting rules

Code Coverage Configuration (codecov.yaml)

  • Excluded test coverage files (*_test_coverage.go) from coverage reports
  • Why needed: These are generated coverage test files and shouldn't be included in coverage calculations

Chart Version Bump (deploy/charts/operator-crds/Chart.yaml)

  • Updated version to 0.0.97
  • Why needed: Standard version bump after changes

Testing

All existing tests pass. The changes improve test reliability and don't break any existing functionality.

Related

This PR is split from #3373. The optimizer implementation will follow in a separate PR that depends on these infrastructure improvements.

Large PR Justification

  • These changes were requested as one PR

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements
@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Jan 26, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

…interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function
- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed size/L Large PR: 600-999 lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 26, 2026
@github-actions github-actions bot dismissed their stale review January 26, 2026 12:20

PR size has been reduced below the XL threshold. Thank you for splitting this up!

@github-actions
Copy link
Contributor

✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up!

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jan 26, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/L Large PR: 600-999 lines changed labels Jan 26, 2026
All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 26, 2026
@codecov
Copy link

codecov bot commented Jan 26, 2026

Codecov Report

❌ Patch coverage is 77.27273% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.97%. Comparing base (2acfcfc) to head (0b7873c).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
cmd/vmcp/app/commands.go 0.00% 13 Missing ⚠️
pkg/vmcp/aggregator/default_aggregator.go 86.51% 7 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3439      +/-   ##
==========================================
+ Coverage   64.95%   64.97%   +0.02%     
==========================================
  Files         396      396              
  Lines       38492    38579      +87     
==========================================
+ Hits        25001    25067      +66     
- Misses      11542    11556      +14     
- Partials     1949     1956       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jan 26, 2026
The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.
@github-actions github-actions bot removed the size/XL Extra large PR: 1000+ lines changed label Jan 26, 2026
- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@therealnb therealnb requested a review from jerm-dro January 27, 2026 12:20
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jan 27, 2026
@therealnb therealnb merged commit ebf0f55 into main Jan 28, 2026
35 checks passed
@therealnb therealnb deleted the optimizer-enablers branch January 28, 2026 11:23
dmjb pushed a commit that referenced this pull request Jan 28, 2026
* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go
dmjb added a commit that referenced this pull request Jan 28, 2026
* Infrastructure improvements and bugfixes for vMCP (#3439)

* Infrastructure improvements and bugfixes for vMCP

- Add OpenTelemetry tracing to capability aggregation
- Add singleflight deduplication for discovery requests
- Add health checker self-check prevention
- Add HTTP client timeout fixes
- Improve E2E test reliability
- Various build and infrastructure improvements

* fix: Update CallTool and GetPrompt signatures to match BackendClient interface

- Add conversion import for meta field handling
- Update CallTool to accept meta parameter and return *vmcp.ToolCallResult
- Update GetPrompt to return *vmcp.PromptGetResult
- Add convertContent helper function

* fix: Update ReadResource signature to match BackendClient interface

- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte
- Extract and include meta field from backend response
- Include MIME type in result

* fix: Pass selfURL parameter to health.NewMonitor

- Construct selfURL from Host, Port, and EndpointPath
- Prevents health checker from checking itself

* Fix NewHealthChecker calls in checker_test.go to include selfURL parameter

* Fix NewMonitor calls in monitor_test.go to include selfURL parameter

All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.

* Fix Go import formatting issues (gci linter)

Fixed import ordering in:
- pkg/vmcp/client/client.go
- pkg/vmcp/health/checker_test.go
- pkg/vmcp/health/monitor_test.go

* Fix Chart.yaml version - restore to 0.0.103

The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.

* Bump Chart.yaml version to 0.0.104

The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103).

* Update README.md version badge to 0.0.104

Match the Chart.yaml version update to satisfy helm-docs pre-commit hook.

* Refactor vMCP tracing and remove health checker self-check

Move telemetry provider initialization earlier in vmcp serve command to
enable distributed tracing in the aggregator. The aggregator now accepts
an explicit tracer provider parameter instead of using the global otel
tracer, following dependency injection best practices.

Improve tracing error handling by using named return values and deferred
error recording in aggregator methods, ensuring errors are properly
captured in traces.

Remove health checker self-check functionality that prevented the server
from checking its own health endpoint. This simplifies the implementation
and removes unnecessary URL normalization logic.

Changes:
- Add tracerProvider parameter to aggregator.NewDefaultAggregator
- Use noop tracer when provider is nil
- Improve span error handling with deferred recording
- Remove selfURL parameter from health.NewHealthChecker
- Delete pkg/vmcp/health/checker_selfcheck_test.go
- Update all tests to match new function signatures
- Add debug logging for auth strategy application in client

* Add explanatory comment for MCP SDK Meta limitations

Restores comment explaining why Meta field preservation is important
for ReadResource, in anticipation of future SDK improvements.

This addresses PR feedback to maintain context about the SDK's
current limitations regarding Meta field handling.

* Update test helper comments to clarify pod readiness contract

- Clarify that checkPodsReady waits for at least one pod (not all pods)
- Add context that helpers are used for single replica deployments
- Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady

Addresses code review feedback from PR review.

* Complete error capture pattern in MergeCapabilities defer

- Add named return value (retErr error) to MergeCapabilities
- Add error capture in defer statement with span.RecordError and span.SetStatus
- Ensures consistent error handling pattern across all aggregator methods

This completes the implementation of the error capture pattern suggested
in code review for all methods with tracing spans.

* Remove singleflight race condition fix

Moving the singleflight deduplication logic to a separate PR
as it addresses a different race condition from the one fixed in #3450.

The fix prevents duplicate capability aggregation when multiple
concurrent requests arrive simultaneously at startup.

* Add SPDX license headers to manager.go

* Update E2E tests to reflect new registry error behavior

This change updates E2E tests to match the new HTTP status codes and
error messages introduced in the registry API improvements.

Changes:
- Update expected status codes:
  - 502 Bad Gateway: For validation errors (invalid JSON, missing servers)
  - 504 Gateway Timeout: For connectivity errors (unreachable hosts)
- Update expected error messages:
  - "Will use built-in registry" instead of "reset to default"
- Update test for api_url validation:
  - api_url now validates reachability (returns 504 for unreachable hosts)
  - Previously it only validated URL format

Updated tests:
1. "should reset to default with empty request"
   - Expected message: "Will use built-in registry"
2. "should return 502 for invalid JSON file"
   - Expected status: 502 (was 400)
3. "should return 502 for file without servers"
   - Expected status: 502 (was 400)
4. "should return 504 for URL pointing to unreachable host"
   - Expected status: 504 (was 400)
5. "should return 504 for api_url pointing to unreachable host"
   - Expected status: 504 (was 200)
   - Updated test name and comment to reflect new behavior

These changes validate that the registry API now properly distinguishes
between validation errors (502) and connectivity errors (504), providing
better semantics and user experience.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Nigel Brown <nigel@stacklok.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants