-
Notifications
You must be signed in to change notification settings - Fork 171
Infrastructure improvements and bugfixes for vMCP #3439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add OpenTelemetry tracing to capability aggregation - Add singleflight deduplication for discovery requests - Add health checker self-check prevention - Add HTTP client timeout fixes - Improve E2E test reliability - Various build and infrastructure improvements
a5aa704 to
9e28406
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
…interface - Add conversion import for meta field handling - Update CallTool to accept meta parameter and return *vmcp.ToolCallResult - Update GetPrompt to return *vmcp.PromptGetResult - Add convertContent helper function
- Update ReadResource to return *vmcp.ResourceReadResult instead of []byte - Extract and include meta field from backend response - Include MIME type in result
PR size has been reduced below the XL threshold. Thank you for splitting this up!
|
✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up! |
- Construct selfURL from Host, Port, and EndpointPath - Prevents health checker from checking itself
All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3439 +/- ##
==========================================
+ Coverage 64.95% 64.97% +0.02%
==========================================
Files 396 396
Lines 38492 38579 +87
==========================================
+ Hits 25001 25067 +66
- Misses 11542 11556 +14
- Partials 1949 1956 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixed import ordering in: - pkg/vmcp/client/client.go - pkg/vmcp/health/checker_test.go - pkg/vmcp/health/monitor_test.go
The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch.
- Clarify that checkPodsReady waits for at least one pod (not all pods) - Add context that helpers are used for single replica deployments - Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady Addresses code review feedback from PR review.
- Add named return value (retErr error) to MergeCapabilities - Add error capture in defer statement with span.RecordError and span.SetStatus - Ensures consistent error handling pattern across all aggregator methods This completes the implementation of the error capture pattern suggested in code review for all methods with tracing spans.
Moving the singleflight deduplication logic to a separate PR as it addresses a different race condition from the one fixed in #3450. The fix prevents duplicate capability aggregation when multiple concurrent requests arrive simultaneously at startup.
* Infrastructure improvements and bugfixes for vMCP - Add OpenTelemetry tracing to capability aggregation - Add singleflight deduplication for discovery requests - Add health checker self-check prevention - Add HTTP client timeout fixes - Improve E2E test reliability - Various build and infrastructure improvements * fix: Update CallTool and GetPrompt signatures to match BackendClient interface - Add conversion import for meta field handling - Update CallTool to accept meta parameter and return *vmcp.ToolCallResult - Update GetPrompt to return *vmcp.PromptGetResult - Add convertContent helper function * fix: Update ReadResource signature to match BackendClient interface - Update ReadResource to return *vmcp.ResourceReadResult instead of []byte - Extract and include meta field from backend response - Include MIME type in result * fix: Pass selfURL parameter to health.NewMonitor - Construct selfURL from Host, Port, and EndpointPath - Prevents health checker from checking itself * Fix NewHealthChecker calls in checker_test.go to include selfURL parameter * Fix NewMonitor calls in monitor_test.go to include selfURL parameter All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI. * Fix Go import formatting issues (gci linter) Fixed import ordering in: - pkg/vmcp/client/client.go - pkg/vmcp/health/checker_test.go - pkg/vmcp/health/monitor_test.go * Fix Chart.yaml version - restore to 0.0.103 The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch. * Bump Chart.yaml version to 0.0.104 The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103). * Update README.md version badge to 0.0.104 Match the Chart.yaml version update to satisfy helm-docs pre-commit hook. * Refactor vMCP tracing and remove health checker self-check Move telemetry provider initialization earlier in vmcp serve command to enable distributed tracing in the aggregator. The aggregator now accepts an explicit tracer provider parameter instead of using the global otel tracer, following dependency injection best practices. Improve tracing error handling by using named return values and deferred error recording in aggregator methods, ensuring errors are properly captured in traces. Remove health checker self-check functionality that prevented the server from checking its own health endpoint. This simplifies the implementation and removes unnecessary URL normalization logic. Changes: - Add tracerProvider parameter to aggregator.NewDefaultAggregator - Use noop tracer when provider is nil - Improve span error handling with deferred recording - Remove selfURL parameter from health.NewHealthChecker - Delete pkg/vmcp/health/checker_selfcheck_test.go - Update all tests to match new function signatures - Add debug logging for auth strategy application in client * Add explanatory comment for MCP SDK Meta limitations Restores comment explaining why Meta field preservation is important for ReadResource, in anticipation of future SDK improvements. This addresses PR feedback to maintain context about the SDK's current limitations regarding Meta field handling. * Update test helper comments to clarify pod readiness contract - Clarify that checkPodsReady waits for at least one pod (not all pods) - Add context that helpers are used for single replica deployments - Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady Addresses code review feedback from PR review. * Complete error capture pattern in MergeCapabilities defer - Add named return value (retErr error) to MergeCapabilities - Add error capture in defer statement with span.RecordError and span.SetStatus - Ensures consistent error handling pattern across all aggregator methods This completes the implementation of the error capture pattern suggested in code review for all methods with tracing spans. * Remove singleflight race condition fix Moving the singleflight deduplication logic to a separate PR as it addresses a different race condition from the one fixed in #3450. The fix prevents duplicate capability aggregation when multiple concurrent requests arrive simultaneously at startup. * Add SPDX license headers to manager.go
* Infrastructure improvements and bugfixes for vMCP (#3439) * Infrastructure improvements and bugfixes for vMCP - Add OpenTelemetry tracing to capability aggregation - Add singleflight deduplication for discovery requests - Add health checker self-check prevention - Add HTTP client timeout fixes - Improve E2E test reliability - Various build and infrastructure improvements * fix: Update CallTool and GetPrompt signatures to match BackendClient interface - Add conversion import for meta field handling - Update CallTool to accept meta parameter and return *vmcp.ToolCallResult - Update GetPrompt to return *vmcp.PromptGetResult - Add convertContent helper function * fix: Update ReadResource signature to match BackendClient interface - Update ReadResource to return *vmcp.ResourceReadResult instead of []byte - Extract and include meta field from backend response - Include MIME type in result * fix: Pass selfURL parameter to health.NewMonitor - Construct selfURL from Host, Port, and EndpointPath - Prevents health checker from checking itself * Fix NewHealthChecker calls in checker_test.go to include selfURL parameter * Fix NewMonitor calls in monitor_test.go to include selfURL parameter All 10 calls to NewMonitor in monitor_test.go were missing the new selfURL parameter that was added to the function signature. This was causing compilation failures in CI. * Fix Go import formatting issues (gci linter) Fixed import ordering in: - pkg/vmcp/client/client.go - pkg/vmcp/health/checker_test.go - pkg/vmcp/health/monitor_test.go * Fix Chart.yaml version - restore to 0.0.103 The version was incorrectly downgraded to 0.0.102. Restore it to 0.0.103 to match main branch. * Bump Chart.yaml version to 0.0.104 The chart-testing tool requires version bumps to be higher than the base branch version (0.0.103). * Update README.md version badge to 0.0.104 Match the Chart.yaml version update to satisfy helm-docs pre-commit hook. * Refactor vMCP tracing and remove health checker self-check Move telemetry provider initialization earlier in vmcp serve command to enable distributed tracing in the aggregator. The aggregator now accepts an explicit tracer provider parameter instead of using the global otel tracer, following dependency injection best practices. Improve tracing error handling by using named return values and deferred error recording in aggregator methods, ensuring errors are properly captured in traces. Remove health checker self-check functionality that prevented the server from checking its own health endpoint. This simplifies the implementation and removes unnecessary URL normalization logic. Changes: - Add tracerProvider parameter to aggregator.NewDefaultAggregator - Use noop tracer when provider is nil - Improve span error handling with deferred recording - Remove selfURL parameter from health.NewHealthChecker - Delete pkg/vmcp/health/checker_selfcheck_test.go - Update all tests to match new function signatures - Add debug logging for auth strategy application in client * Add explanatory comment for MCP SDK Meta limitations Restores comment explaining why Meta field preservation is important for ReadResource, in anticipation of future SDK improvements. This addresses PR feedback to maintain context about the SDK's current limitations regarding Meta field handling. * Update test helper comments to clarify pod readiness contract - Clarify that checkPodsReady waits for at least one pod (not all pods) - Add context that helpers are used for single replica deployments - Update comments on WaitForPodsReady and WaitForVirtualMCPServerReady Addresses code review feedback from PR review. * Complete error capture pattern in MergeCapabilities defer - Add named return value (retErr error) to MergeCapabilities - Add error capture in defer statement with span.RecordError and span.SetStatus - Ensures consistent error handling pattern across all aggregator methods This completes the implementation of the error capture pattern suggested in code review for all methods with tracing spans. * Remove singleflight race condition fix Moving the singleflight deduplication logic to a separate PR as it addresses a different race condition from the one fixed in #3450. The fix prevents duplicate capability aggregation when multiple concurrent requests arrive simultaneously at startup. * Add SPDX license headers to manager.go * Update E2E tests to reflect new registry error behavior This change updates E2E tests to match the new HTTP status codes and error messages introduced in the registry API improvements. Changes: - Update expected status codes: - 502 Bad Gateway: For validation errors (invalid JSON, missing servers) - 504 Gateway Timeout: For connectivity errors (unreachable hosts) - Update expected error messages: - "Will use built-in registry" instead of "reset to default" - Update test for api_url validation: - api_url now validates reachability (returns 504 for unreachable hosts) - Previously it only validated URL format Updated tests: 1. "should reset to default with empty request" - Expected message: "Will use built-in registry" 2. "should return 502 for invalid JSON file" - Expected status: 502 (was 400) 3. "should return 502 for file without servers" - Expected status: 502 (was 400) 4. "should return 504 for URL pointing to unreachable host" - Expected status: 504 (was 400) 5. "should return 504 for api_url pointing to unreachable host" - Expected status: 504 (was 200) - Updated test name and comment to reflect new behavior These changes validate that the registry API now properly distinguishes between validation errors (502) and connectivity errors (504), providing better semantics and user experience. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Nigel Brown <nigel@stacklok.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Infrastructure Improvements and Bugfixes for vMCP
This PR contains infrastructure improvements and bugfixes that enable better observability, reliability, and performance for vMCP. These changes are independent of the optimizer implementation and can be merged separately.
Changes
Observability & Tracing
OpenTelemetry Tracing in Capability Aggregation (
pkg/vmcp/aggregator/default_aggregator.go)AggregateCapabilities,QueryAllCapabilities,QueryCapabilities,ResolveConflicts, andMergeCapabilitiesPerformance Improvements
Singleflight Deduplication for Discovery (
pkg/vmcp/discovery/manager.go)Health Check Improvements
Self-Check Prevention (
pkg/vmcp/health/checker.go)selfURLfield to prevent server from health-checking itselfHealth Check Context Marker (
pkg/vmcp/health/checker.go)WithHealthCheckMarkerto bypass authentication logging for health checksClient Reliability Fixes
HTTP Client Timeout (
pkg/vmcp/client/client.go)WithContinuousListening()in favor of timeout-based approachTest Reliability Improvements
E2E Test Pod Readiness Checks (
test/e2e/thv-operator/virtualmcp/helpers.go)E2E Test HTTP Client Timeout (
test/e2e/thv-operator/virtualmcp/virtualmcp_auth_discovery_test.go)Build & Infrastructure
Gitignore Updates (
.gitignore)cmd/vmcp/__debug_bin*patterns for debug binariesLinter Configuration (
.golangci.yml)scripts$directory from lintingCode Coverage Configuration (
codecov.yaml)*_test_coverage.go) from coverage reportsChart Version Bump (
deploy/charts/operator-crds/Chart.yaml)Testing
All existing tests pass. The changes improve test reliability and don't break any existing functionality.
Related
This PR is split from #3373. The optimizer implementation will follow in a separate PR that depends on these infrastructure improvements.
Large PR Justification