Add MCPAuthzConfig CRD for backend-agnostic authorization#4777
Add MCPAuthzConfig CRD for backend-agnostic authorization#4777
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4777 +/- ##
==========================================
- Coverage 68.90% 68.86% -0.05%
==========================================
Files 517 519 +2
Lines 54803 55024 +221
==========================================
+ Hits 37764 37894 +130
- Misses 14129 14197 +68
- Partials 2910 2933 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| - message: inline must be set when type is 'inline', and must not | ||
| be set otherwise | ||
| rule: 'self.type == ''inline'' ? has(self.inline) : !has(self.inline)' | ||
| authzConfigRef: | ||
| description: |- | ||
| AuthzConfigRef references a shared MCPAuthzConfig resource for authorization. | ||
| The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer. | ||
| Mutually exclusive with authzConfig. | ||
| properties: | ||
| name: | ||
| description: Name is the name of the MCPAuthzConfig resource in | ||
| the same namespace. | ||
| minLength: 1 | ||
| type: string | ||
| required: | ||
| - name | ||
| type: object | ||
| backendReplicas: | ||
| description: |- | ||
| BackendReplicas is the desired number of MCP server backend pod replicas. |
There was a problem hiding this comment.
🔴 The new authzConfig/authzConfigRef fields are documented as mutually exclusive across MCPServer, MCPRemoteProxy, and VirtualMCPServer's IncomingAuthConfig, but no CEL x-kubernetes-validations rule enforces this at the API level. A user can submit a resource with both fields set and the Kubernetes API server will accept it without error. The fix is to add +kubebuilder:validation:XValidation markers (and the corresponding x-kubernetes-validations entries in the generated CRD YAMLs) mirroring the existing oidcConfig/oidcConfigRef precedent for all three affected types.
Extended reasoning...
What the bug is and how it manifests
The PR introduces authzConfigRef fields to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (VirtualMCPServer), explicitly documenting them as mutually exclusive with the existing authzConfig field. However, unlike the analogous oidcConfig/oidcConfigRef pair, no CEL validation rule is added to enforce this constraint at admission time. The Kubernetes API server therefore accepts resources with both fields populated simultaneously.
The specific code path that triggers it
In mcpremoteproxy_types.go, the struct-level kubebuilder marker reads:
// +kubebuilder:validation:XValidation:rule="\!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."
type MCPRemoteProxySpec struct {
No equivalent marker exists for authzConfig/authzConfigRef in MCPRemoteProxySpec, MCPServerSpec, or IncomingAuthConfig. The generated CRD YAML (e.g. toolhive.stacklok.dev_mcpremoteproxies.yaml) reflects this: the spec-level x-kubernetes-validations block only contains the OIDC rule, not an authz rule.
Why existing code doesn't prevent it
The Go field comments say "Mutually exclusive with authzConfig" but comments are not enforced. The kubebuilder XValidation marker is what generates the CEL rule into the CRD schema. That marker was added for oidcConfig/oidcConfigRef but was omitted for the new authzConfig/authzConfigRef pair in all three types.
Impact
Any user who sets both authzConfig and authzConfigRef will get a resource that is silently accepted by the API server. When Part 2 wires these fields into the workload controllers, the controller will face ambiguity: it will need to pick one field, likely silently ignoring the other, leading to unexpected authorization behavior that is hard to diagnose.
How to fix it
Add the following marker to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (adjacent to the existing oidcConfig/oidcConfigRef marker):
// +kubebuilder:validation:XValidation:rule="\!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive"Then regenerate the CRD manifests so that the three affected YAML files each gain a corresponding x-kubernetes-validations entry at the spec level.
Step-by-step proof
- Create an MCPServer manifest with both fields:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: test
spec:
image: example/mcp:latest
authzConfig:
type: inline
inline:
policies: ["permit(principal, action, resource);"]
authzConfigRef:
name: my-authzconfig- Apply it to a cluster with this PR's CRDs installed.
- Observe:
kubectl applysucceeds with no validation error. - Compare with the equivalent test using
oidcConfig+oidcConfigRef: the API server rejects it with "oidcConfig and oidcConfigRef are mutually exclusive" because the CEL rule exists for OIDC but not for authz.
| if hashChanged { | ||
| logger.Info("MCPAuthzConfig configuration changed", | ||
| "oldHash", authzConfig.Status.ConfigHash, | ||
| "newHash", configHash) | ||
|
|
||
| authzConfig.Status.ConfigHash = configHash | ||
| authzConfig.Status.ObservedGeneration = authzConfig.Generation | ||
|
|
||
| if err := r.Status().Update(ctx, authzConfig); err != nil { | ||
| logger.Error(err, "Failed to update MCPAuthzConfig status") | ||
| return ctrl.Result{}, err | ||
| } | ||
| return ctrl.Result{}, nil | ||
| } | ||
|
|
||
| // Refresh ReferencingWorkloads list | ||
| referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to find referencing workloads") | ||
| } else if !ctrlutil.WorkloadRefsEqual(authzConfig.Status.ReferencingWorkloads, referencingWorkloads) { | ||
| authzConfig.Status.ReferencingWorkloads = referencingWorkloads | ||
| conditionChanged = true | ||
| } |
There was a problem hiding this comment.
🔴 The new MCPAuthzConfig controller has two related issues that leave ReferencingWorkloads status stale. First, when findReferencingWorkloads returns an error in the normal reconcile path (lines 134-141), the error is only logged and execution continues without requeueing — unlike handleDeletion which correctly returns the error for automatic retry. Second, when hashChanged==true, the controller returns early at line 131 before ever reaching the findReferencingWorkloads call at line 135, so workload reference changes concurrent with spec changes are silently dropped for that reconcile cycle. The fix is to refresh workload refs before the hash-changed early return, and to return the error (with requeue) when findReferencingWorkloads fails, consistent with the deletion path.
Extended reasoning...
Bug 1 — Swallowed error in normal reconcile path (lines 134-141)
In the normal reconcile path, when findReferencingWorkloads returns an error, the code at lines 134-141 logs the error and falls through without returning it:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
logger.Error(err, "Failed to find referencing workloads")
} else if ...The deletion path (handleDeletion) correctly propagates the error, which causes controller-runtime to schedule a retry with exponential backoff:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
logger.Error(err, "Failed to check referencing workloads during deletion")
return ctrl.Result{}, err // requeues
}On a transient Kubernetes API error (e.g., brief etcd unavailability), the normal reconcile path silently leaves ReferencingWorkloads stale with no automatic retry. The status is only corrected when an unrelated event (e.g., a workload change) happens to re-trigger reconciliation. This asymmetry violates controller-runtime best practices and is the identical pattern that exists in toolconfig_controller.go, meaning the PR copies a known sub-optimal approach rather than improving on it.
Bug 2 — Hash-changed early return skips ReferencingWorkloads refresh (lines 119-132)
When hashChanged==true, the reconciler updates ConfigHash and ObservedGeneration, calls r.Status().Update(), and returns at line 131:
if hashChanged {
authzConfig.Status.ConfigHash = configHash
authzConfig.Status.ObservedGeneration = authzConfig.Generation
if err := r.Status().Update(ctx, authzConfig); err != nil { ... }
return ctrl.Result{}, nil // <-- findReferencingWorkloads is never called
}
// This is unreachable when hashChanged==true:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)Concrete proof for Bug 2: Consider MCPAuthzConfig my-authz with no referencing workloads. A user simultaneously (a) modifies the spec, and (b) adds authzConfigRef: {name: my-authz} to an MCPServer. Controller-runtime coalesces both watch events into a single reconcile. Inside that reconcile: hashChanged==true → controller updates ConfigHash and returns. ReferencingWorkloads is never refreshed. The watch handler for MCPServer fired to trigger this reconcile, so no second reconcile is enqueued. The status shows referencingWorkloads: [] even though the MCPServer now references the config, until either another spec change or another workload event occurs.
Addressing the refutation for Bug 2: One verifier argues that the watch-based reconciliation mechanism compensates for the early return, because a workload change fires a watch event that enqueues a new reconcile. This is true in the separate event case. However, it fails in the coalesced event case described above: if controller-runtime batches the spec-change event and the workload-change event into a single reconcile call, the controller sees hashChanged==true, returns early, and consumes the single reconcile that would have refreshed the refs. No additional reconcile is triggered because the watch event for the workload has already been processed. This is a real (if narrow) window of inconsistency.
Impact and fix: The actual deletion protection in handleDeletion queries live state and is unaffected. The impact is stale ReferencingWorkloads status visible to users and operators. The fix is to call findReferencingWorkloads before the hash-change early return and include its result in the combined status update (as the MCPTelemetryConfig controller already does correctly), and to return the error rather than swallowing it when the lookup fails.
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: kubernetes-expert, go-security-reviewer, code-reviewer, toolhive-expert
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| 1 | Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity | 10/10 | HIGH | Fix |
| 2 | Missing RBAC markers for watched workload resources | 10/10 | MEDIUM | Fix |
| 3 | mustMarshalJSON uses panic in controller | 10/10 | MEDIUM | Fix |
| 4 | No nil-check on spec.Config.Raw before use | 8/10 | MEDIUM | Fix |
| 5 | Test coverage gaps in watch handlers and deletion paths | 7/10 | MEDIUM | Fix |
| 6 | No Validate() method on MCPAuthzConfig type (pattern deviation) | 7/10 | LOW | Discuss |
| 7 | ReferencingWorkloads listMapKey=name insufficient for uniqueness | 7/10 | LOW | Discuss (pre-existing) |
| 8 | Controller blank-imports authorizer backend packages | 7/10 | LOW | Discuss |
Overall
This is a well-structured PR that closely follows the established MCPOIDCConfig controller pattern. The CRD design is clean, the ConfigKey() interface addition is minimal, and the findStaleRefs helper is a nice DRY improvement over the inline stale-ref scanning in existing controllers. The part 1/part 2 split is a sensible approach for incremental delivery.
The main blocker is the missing CEL XValidation rules (Finding #1). The existing OIDC pattern enforces mutual exclusivity between oidcConfig and oidcConfigRef via CEL rules, but no equivalent rules exist for authzConfig/authzConfigRef. Without these, the API server silently accepts resources with both fields set, contradicting the documented contract. The remaining findings are medium-severity improvements: RBAC marker completeness, replacing panic with error return, and adding a nil-check on Config.Raw.
Generated with Claude Code
| // The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer. | ||
| // Mutually exclusive with authzConfig. | ||
| // +optional | ||
| AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"` |
There was a problem hiding this comment.
[HIGH] Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity (Consensus: 10/10)
The AuthzConfig and AuthzConfigRef fields are documented as "mutually exclusive" but no CEL validation rule enforces this at the API server level. The existing OIDC pattern has:
// +kubebuilder:validation:XValidation:rule="!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."The same rule is needed here on MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig:
// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"Without this, users can set both fields simultaneously and the controller must decide which takes precedence.
Raised by: kubernetes-expert, code-reviewer, toolhive-expert
|
|
||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs,verbs=get;list;watch;create;update;patch;delete | ||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/status,verbs=get;update;patch | ||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update |
There was a problem hiding this comment.
[MEDIUM] Missing RBAC markers for watched workload resources (Consensus: 10/10)
The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy (via SetupWithManager) and lists them (in findReferencingWorkloads), but no +kubebuilder:rbac markers declare these permissions. The MCPOIDCConfig controller declares them:
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpservers,verbs=get;list;watch | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=virtualmcpservers,verbs=get;list;watch | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch |
While these permissions are currently aggregated from other controllers, they should be self-declared for completeness and to prevent breakage if another controller is removed.
Raised by: kubernetes-expert, code-reviewer, toolhive-expert
| func mustMarshalJSON(v string) json.RawMessage { | ||
| b, err := json.Marshal(v) | ||
| if err != nil { | ||
| panic(fmt.Sprintf("failed to marshal %q: %v", v, err)) |
There was a problem hiding this comment.
[MEDIUM] mustMarshalJSON uses panic in controller reconciliation path (Consensus: 10/10)
While json.Marshal on a string cannot realistically fail, panicking in a controller crashes the entire operator. The Go style rules reserve panics for top-level boundaries only. Since BuildFullAuthzConfigJSON already returns ([]byte, error), this should propagate errors instead.
Consider replacing this function with one that returns (json.RawMessage, error) and updating BuildFullAuthzConfigJSON to handle the error.
Raised by: kubernetes-expert, go-security-reviewer, code-reviewer, toolhive-expert
| fullConfig := map[string]json.RawMessage{ | ||
| "version": mustMarshalJSON(authzConfigVersion), | ||
| "type": mustMarshalJSON(spec.Type), | ||
| configKey: spec.Config.Raw, |
There was a problem hiding this comment.
[MEDIUM] No nil-check on spec.Config.Raw before use (Consensus: 8/10)
If spec.Config.Raw is nil (bypassed validation, manual testing, or future code changes), this produces "cedar": null in the JSON output, leading to confusing downstream errors. Add a defensive check before use:
if len(spec.Config.Raw) == 0 {
return nil, fmt.Errorf("config field is empty")
}The config field is required in the CRD schema, but defense-in-depth is warranted for RawExtension fields.
Raised by: code-reviewer, kubernetes-expert
| @@ -0,0 +1,616 @@ | |||
| // SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc. | |||
There was a problem hiding this comment.
[MEDIUM] Test coverage gaps in watch handlers and deletion paths (Consensus: 7/10)
The following methods have zero test coverage (Codecov reports 43.5% patch / 122 uncovered lines):
findReferencingWorkloads(covers MCPServer, VirtualMCPServer, MCPRemoteProxy listing)findStaleRefs(stale reference cleanup)- All three
map*ToAuthzConfigwatch handlers SetupWithManager
Additionally, the handleDeletion test only covers the MCPServer reference path -- VirtualMCPServer and MCPRemoteProxy deletion-blocking cases are missing.
At minimum, consider adding tests for findReferencingWorkloads covering all three workload types, and extending the deletion test with VirtualMCPServer and MCPRemoteProxy cases.
Raised by: code-reviewer
| // +optional | ||
| ReferencingWorkloads []WorkloadReference `json:"referencingWorkloads,omitempty"` | ||
| } | ||
|
|
There was a problem hiding this comment.
[LOW] ReferencingWorkloads listMapKey=name may be insufficient (Consensus: 7/10)
Using only name as the list map key could collide if an MCPServer and a VirtualMCPServer share the same name in the same namespace. The composite key should be kind + name:
// +listMapKey=kind
// +listMapKey=nameThis is a pre-existing pattern copied from MCPOIDCConfig, MCPTelemetryConfig, etc. -- consider fixing across all config CRDs in a separate PR.
Raised by: kubernetes-expert
| ctrlutil "github.com/stacklok/toolhive/cmd/thv-operator/pkg/controllerutil" | ||
| "github.com/stacklok/toolhive/pkg/authz/authorizers" | ||
| // Import authorizer backends so they register with the factory registry. | ||
| _ "github.com/stacklok/toolhive/pkg/authz/authorizers/cedar" |
There was a problem hiding this comment.
[LOW] Controller directly imports authorizer backend packages (Consensus: 7/10)
The blank imports of cedar and http authorizer packages create a direct compile-time dependency from the controller to specific backends. No other controller in this package has this pattern. Consider moving these blank imports to cmd/thv-operator/main.go (where scheme registrations already happen), keeping the controller package backend-agnostic.
Raised by: toolhive-expert, kubernetes-expert
| // Type identifies the authorizer backend (e.g., "cedarv1", "httpv1"). | ||
| // Must match a registered authorizer type in the factory registry. | ||
| // +kubebuilder:validation:Required | ||
| // +kubebuilder:validation:MinLength=1 |
There was a problem hiding this comment.
[LOW] No Validate() method on MCPAuthzConfig type (Consensus: 7/10)
Other shared config CRDs (MCPOIDCConfig, MCPTelemetryConfig, MCPExternalAuthConfig) define a Validate() method on the type itself, called from the controller as config.Validate(). This CRD uses a standalone validateAuthzConfigSpec() function in the controller package instead.
The deviation is understandable (validation requires the factory registry, an external dependency), but a comment in the controller explaining why would help maintainability.
Raised by: toolhive-expert, code-reviewer
The existing AuthzConfigRef inline variant is Cedar-specific, with fields that only make sense for the Cedar authorizer. This adds a new MCPAuthzConfig CRD that accepts any registered authorizer backend configuration via a type discriminator and an opaque config blob. The MCPAuthzConfig CRD follows the established shared-config pattern (MCPOIDCConfig, MCPExternalAuthConfig) with status conditions, hash-based change detection, referencing workload tracking, and deletion protection. A new ConfigKey() method on the AuthorizerFactory interface enables the controller to reconstruct the full runtime config JSON from the CRD spec. The AuthzConfigRef field is added to MCPServer, MCPRemoteProxy, and VirtualMCPServer as a new option alongside the existing authzConfig field, enabling incremental migration with no breaking changes. Refs: #3157 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add CEL XValidation rules for authzConfig/authzConfigRef mutual exclusivity on MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig - Add RBAC markers for watched workload resources (mcpservers, virtualmcpservers, mcpremoteproxies) - Replace mustMarshalJSON panic with error-returning marshalJSON - Add nil-check on spec.Config.Raw before use in BuildFullAuthzConfigJSON - Fix reconcile logic: call findReferencingWorkloads before hash-changed check and return errors instead of swallowing them (matches MCPTelemetryConfig pattern) - Move authorizer backend blank imports from controller to main.go - Add comment explaining why validateAuthzConfigSpec is standalone - Add listMapKey=kind to ReferencingWorkloads for composite uniqueness - Add tests for findReferencingWorkloads covering all workload types, deletion blocking by VirtualMCPServer and MCPRemoteProxy, and empty Config.Raw validation - Regenerate CRD manifests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
734e8db to
d571e89
Compare
Summary
authzConfig) is Cedar-specific and doesn't support alternative authorization backends like HTTP PDP. This adds a newMCPAuthzConfigCRD that decouples authorization configuration from the workload CRDs, enabling any registered authorizer backend to be used.ConfigKey() stringto theAuthorizerFactoryinterface so the MCPAuthzConfig controller can reconstruct full config JSON without hardcoding backend-specific keys.MCPAuthzConfigCRD withtype+config(RawExtension) spec, validation via factory registry, hash-based change detection, referencing workload tracking, and deletion protection.AuthzConfigReffield to MCPServer, MCPRemoteProxy, and VirtualMCPServer specs for referencing shared MCPAuthzConfig resources.Relates to #3157
Type of change
Test plan
task test)task lint-fix)Changes
pkg/authz/authorizers/registry.goConfigKey()toAuthorizerFactoryinterfacepkg/authz/authorizers/cedar/core.goConfigKey()returning"cedar"pkg/authz/authorizers/http/core.goConfigKey()returning"http"cmd/thv-operator/api/v1alpha1/mcpauthzconfig_types.goMCPAuthzConfig,MCPAuthzConfigSpec,MCPAuthzConfigStatus,MCPAuthzConfigReferencecmd/thv-operator/api/v1alpha1/mcpserver_types.goAuthzConfigReffield toMCPServerSpeccmd/thv-operator/api/v1alpha1/mcpremoteproxy_types.goAuthzConfigReffield toMCPRemoteProxySpeccmd/thv-operator/api/v1alpha1/virtualmcpserver_types.goAuthzConfigReffield toIncomingAuthConfigcmd/thv-operator/controllers/mcpauthzconfig_controller.gocmd/thv-operator/controllers/mcpauthzconfig_controller_test.gocmd/thv-operator/main.godeploy/charts/operator-crds/crd-helm-wrapper/main.gomcpauthzconfigsfeature flagSpecial notes for reviewers
BuildFullAuthzConfigJSON()reconstructs the full config envelope ({"version","type","<configKey>": ...}) using the factory'sConfigKey()— this avoids hardcoding Cedar-specific structure.ReferencingWorkloadsstatus and enforce deletion protection.Generated with Claude Code