feat: route workloads to city locations via distributed scheduling (foundation)#107
Open
scotwells wants to merge 25 commits into
Open
feat: route workloads to city locations via distributed scheduling (foundation)#107scotwells wants to merge 25 commits into
scotwells wants to merge 25 commits into
Conversation
0c0d8df to
134086f
Compare
6e9a268 to
492eb6c
Compare
mattdjenkinson
approved these changes
May 22, 2026
mattdjenkinson
previously approved these changes
May 27, 2026
privateip
previously approved these changes
May 28, 2026
Contributor
Author
|
Setting to draft while I continue to iterate on getting this working in staging. |
This was referenced May 29, 2026
The base branch was changed.
This was referenced Jun 4, 2026
82955e2 to
bf73355
Compare
Contributor
Author
|
📦 The federation e2e chainsaw suites (~900 lines of test YAML) have been split out into a dedicated PR so this foundation reviews without them inline. The shared |
a67b32c to
b45810f
Compare
Bump the toolchain to Go 1.25 and golangci-lint v2.12.2, introduce a Taskfile for the standard build/test/lint targets, and align the CI workflows and Makefile with the new versions. Remove stale RFC and enhancement docs that the federated-scheduling work supersedes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Delete the central scheduler that placed WorkloadDeployments from a single control plane. Placement now happens through the distributed federator and per-cell controllers introduced in the following commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce the federator that fans a WorkloadDeployment out to the cells selected for its placement, replacing the central scheduler. Add the city-code field indexer it uses to map subnet/location events back to the deployments that depend on them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the projector that mirrors cell-side Instances back to the management plane, writing their status (readiness, placement, blocking reasons) onto the project-scoped Instance so callers see a single view across cells. Include the shared controller test helpers that build the project/Karmada fake clients and multi-cluster manager used by the federation tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…liation Rework the WorkloadDeployment and Workload controllers to run per cell, resolving networks and Locations locally and driving Instance lifecycle through the stateful instance-control logic rather than a central scheduler. Update the instance-control packages to manage Instances within a cell's control plane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update the Instance controller to compute the Ready condition and apply the per-project quota gate within a single reconcile pass, surfacing blocking reasons when quota is unavailable so federated placement reflects real allocatable capacity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the manager to run in either cell or management-plane mode, gating the federator, projector, and per-cell controllers behind feature flags. Add the feature-gate registry and extend configuration to carry the downstream kubeconfig and discovery settings each mode needs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update Workload webhook and Instance validation so the API accepts the fields federated scheduling adds and continues to reject invalid placement and runtime specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerate the Instance, Workload, and WorkloadDeployment CRDs for the new API fields and add the kustomize structure that deploys the manager in cell or management-plane mode: federation and downstream RBAC bases, cell/management/quota-credentials components, the WorkloadDeployment status interpreter, and the matching overlays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The WorkloadDeploymentFederator mirrors the downstream Karmada WorkloadDeployment status onto the project (VCP) WorkloadDeployment, but SetupWithManager only watched the project WD via For(). Nothing watched the downstream WD whose status it mirrors, so when Karmada aggregated new status onto the downstream object the federator was not notified — it only caught up on the next informer resync (~10h default) or an incidental project-WD spec write. This is why a freshly created workload's replica counts stayed empty on the VCP long after its projected Instance had already appeared (the InstanceProjector holds the analogous downstream watch and so propagates immediately). Add a downstream watch using the same cross-plane mechanism the InstanceProjector and unikraft-provider use (milosource cluster source + TypedEnqueueRequestsFromMapFunc). The map function correlates a downstream WD event back to its project WD reconcile request: name is stable across planes, namespace comes from the UpstreamOwnerNamespace label the federator stamps, and the project cluster name is recovered by decoding the UpstreamOwnerClusterName label on the downstream namespace (the exact inverse of the encoding applied in ensureDownstreamNamespace). The federation manager already constructed for the InstanceProjector is reused as the watchable source, so there is no additional manager or informer-cache cost beyond the new WD and Namespace informers. Karmada's own status-aggregation interval (edge cell → downstream WD) remains outside this repo; once Karmada writes the aggregated status, the new watch reacts immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch mapped events to a reconcile request whose ClusterName was the full decoded org/project path (decodeUpstreamClusterName turned the "cluster-<org>_<project>" namespace label into "<org>/<project>"). But the Milo multicluster provider keys project clusters by bare project name only. As a result every project except the org-less "datum-cloud" failed to resolve: mcmanager routed the unmatched name (ultimately the empty string) to the local host cluster, which has no compute CRDs, so Reconcile failed with "no matches for kind WorkloadDeployment" in a hot loop (~2 errors/sec observed on staging). Extract the bare project name (final path segment) so it matches the provider key, and guard the mapping with GetCluster: if the project cluster isn't engaged yet, drop the event instead of enqueuing a request that falls back to the host cluster and errors. Dropping is safe — once the provider engages the cluster, the For watch reconciles it and the next downstream status event maps cleanly. Rename decodeUpstreamClusterName to projectClusterNameFromLabel to reflect that it now returns the provider cluster key, and add the not-engaged drop case to the mapping test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch was a complete no-op and the source of a steady ~130 errors/min on the management plane. Two layered causes: milosource.NewClusterSource binds the raw source to the empty cluster name, and the default mchandler.TypedEnqueueRequestsFromMapFunc wraps the map in TypedInjectCluster, which overwrites each request's ClusterName with that bound empty name. So the project cluster name computed by mapDownstreamDeploymentToRequest (and validated by its GetCluster guard) was discarded at enqueue time; every downstream event reached Reconcile with ClusterName="". mcmanager routes the empty name to the local host management cluster, which has no compute CRDs, so the Get failed with "no matches for kind WorkloadDeployment" and requeued in a hot loop — while the watch's actual purpose (immediate status mirror-back) never ran for any project. Switch the handler to TypedEnqueueRequestsFromMapFuncWithClusterPreservation so the map's project cluster name survives to Reconcile, making the downstream watch functional. Add a defensive guard at the top of Reconcile that drops (returns nil, not an error) any request with an empty cluster name, so a host-cluster fallback can never again spin in a requeue loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tance name An Instance could wedge Pending forever (QuotaGranted=Unknown/QuotaNoBudget, Quota scheduling gate never removed) even though its Milo ResourceClaim was granted: the Instance reconciled once while the claim was still pending, and nothing re-triggered it when the grant landed a beat later. The ResourceClaim watch mapped a claim to its Spec.ResourceRef — the Project — so the grant enqueued the project name, never the owning Instance. Fix the watch to enqueue the owning Instance: its namespace is carried on a new compute.datumapis.com/instance-namespace label (the claim lives in the project quota namespace, not the Instance's), and its name is the claim name with the resource-kind prefix stripped. Also name the claim after the Instance (unique among Instances in the project control plane) with an "instance-" prefix so it cannot collide with other resource kinds' claims sharing the quota namespace, replacing the previous "<namespace>--<name>" scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… them A template-hash change (an image update, or a restartedAt annotation from `datumctl compute restart`) previously resolved to an in-place Update of the Instance. The unikraft provider bakes the pod at creation time and never recomputes an existing pod's spec, so the in-place update silently failed to roll the running workload — instances kept their old pod. Emit a delete (recreate) for drifted Ready instances instead. The next reconcile refills the slot via the create path with the new template, and the provider's finalizer-gated teardown plus create-on-new-Instance roll the pod with no provider changes. Ordered one-at-a-time pacing is preserved by the existing descending-ordinal sort, skip-all-but-first, and the DeletionTimestamp WaitAction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Instance "Running" status condition is renamed to "Available" (wire
value "Available"). An instance can be available while not actively
running a pod (e.g. scaled to zero), so "Running" was misleading as a
serving/health signal.
Renamed constants:
InstanceRunning -> InstanceAvailable ("Available")
InstanceReadyReasonRunning -> InstanceReadyReasonAvailable ("Available")
InstanceRunningReasonRunning -> InstanceAvailableReasonAvailable ("Available")
InstanceRunningReasonStopped -> InstanceAvailableReasonStopped
InstanceRunningReasonStarting -> InstanceAvailableReasonStarting
InstanceRunningReasonStopping -> InstanceAvailableReasonStopping
BREAKING CHANGE: the on-the-wire Instance condition type changes from
"Running" to "Available". Consumers reading conditions[type=="Running"]
must switch to "Available". Existing Instances self-heal on the next
provider reconcile (the provider re-asserts the condition under its new
name); the stale "Running" entry lingers cosmetically until then and is
no longer read by the Ready derivation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eals The instance controller is re-queued by a ResourceClaim watch when the claim is granted, but that grant event lives on the project control plane and can be missed (informer engagement races, watch relist gaps), wedging the instance at QuotaGranted!=True indefinitely (observed: claim Granted, instance stuck QuotaNoBudget until a manual reconcile cleared it). The pending-quota path returned no RequeueAfter, so there was no safety net. Add a backing-off requeue while QuotaGranted is not True, anchored on the condition's last transition: <60s : 1s (catch a grant landing almost immediately) 60s–5m : 15s 5m–10m : 60s >=10m : 300s Folded into the existing referenced-data requeue (soonest wins). The ResourceClaim watch remains the fast path; this only guarantees a missed grant self-heals instead of wedging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roof The pending-quota safety-net requeue was wired only at the tail of Reconcile, so an early return during the pending window (a status-update or upstream-writeback conflict) silently dropped it onto controller- runtime's exponential error-backoff — which can stretch to minutes, leaving an instance wedged at QuotaGranted!=True even though its ResourceClaim was granted (observed: the 2nd instance in a rapid burst consistently wedged). - Compute the requeue once, up front, so every return path honors it. - On a Conflict during the pending window, requeue at the bounded quota interval instead of returning the error (which would back off). - Log the requeue decision (and conflict-driven requeues) so the path is observable: a re-firing requeue prints every pass while pending, a dropped one does not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… LTT Observability revealed the safety-net requeue was firing every reconcile but always at the slowest tier (300s): elapsed was measured from the QuotaGranted condition's LastTransitionTime, which stays at the 1970-01-01 CRD default while quota is pending (PendingEvaluation and NoBudget are both Unknown, so SetStatusCondition never bumps it). Result: a watch-missed instance waited up to 5 minutes for the safety net instead of ~1s, appearing wedged. Anchor elapsed on instance.CreationTimestamp, which reflects actual wait time, so the fast tiers (1s/15s) apply early as intended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instance controller emits Warning events on Instances (QuotaNoBudget, ImageUnavailable, InstanceCrashing, ConfigurationError, NetworkFailedToCreate, …) via the event recorder, but no RBAC rule granted it. Every write was rejected — "events is forbidden: ... cannot create resource events in API group \"\" in the namespace ns-<uid>" — so the user-facing signals explaining why an instance is stuck never reached the Instance (kubectl describe / activity timeline). Reconciliation was unaffected; this is an observability gap. Add the kubebuilder marker and regenerate the role. The regen also syncs a pre-existing work.karmada.io/resourcebindings rule (from an existing marker that wasn't reflected in the committed role). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rvedGeneration A restart/rolling update was invisible from the project plane: there was no status field representing how many instances are on the new template revision. Add UpdatedReplicas (instances whose observed template hash matches the desired template, regardless of readiness) and ObservedGeneration to both WorkloadDeployment and Workload (plus placement) status. UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas (which is now its Programmed subset), aggregated up into the Workload, and rides the existing status sync to the project plane. Repoint the "Up-to-date" printcolumn to .status.updatedReplicas to match `kubectl get deployment` semantics, so a roll is visible as the count dips below Replicas and recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…emory Two Instance-controller correctness changes: - Blocking-reason rollup: surface the most specific provider sub-condition (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its message onto the Instance Ready condition instead of a generic "Instance has not been programmed", so e.g. an image-pull failure reads as ImageUnavailable with the real message. Adds the reason constants and ranks them in the blocking-reason priority. - Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota ResourceClaim requests vcpus + memory, not just instance count. Explicit container limits / instance requests still take precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tests Make the cherry-picked instanceType-sizing and blocking-reason tests lint-clean: hoist the repeated "datumcloud/d1-standard-2", "app", and "test/image:latest" literals into named constants (goconst) and apply gofmt. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b45810f to
73177eb
Compare
The "cluster-<name>" label encoding (project path with "/" → "_") was open-coded at five sites with two of them re-deriving the decode by hand. Extract EncodeClusterName/DecodeClusterName so the wire format lives in one place. The federator keeps its distinct semantics — it decodes then trims to the last path segment because the multicluster provider keys clusters by bare project name — now by wrapping the shared decoder rather than re-implementing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
singleModeProjectID/singleModeProjectNamespace/readEdgeNamespace decoded edge namespace labels into project identity — domain logic that had no business in package main. Move them into the controller package as NewSingleModeProjectID/ NewSingleModeProjectNamespace constructors. main.go keeps only the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
KarmadaClient was never assigned, so writeStatusToKarmada always early-returned nil. WorkloadDeployment status reaches Karmada via the cell-local Status().Update plus the statusAggregation interpreter, not a controller push. Remove the field, the method, and its call site. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Workloads targeting a city location are automatically routed to the correct physical site, with instance health and readiness surfaced back to the platform in real time. This replaces the single central scheduler with per-site distributed scheduling, so each site operates independently. User-facing behavior is unchanged — city-code targeting, instance visibility, and the existing API all work as before.
This is the complete federation foundation. Decomposed from one large PR; the genuinely-independent pieces landed first and are merged:
It now contains the full controller layer and the operational-completeness fixes that make it correct on its own — quota self-heal on a late grant, instance restart actually rolls, downstream-WorkloadDeployment status watch (no resync needed), the
Running → Availablecondition rename, rollout progress, andinstanceTypevCPU/memory sizing. (These were briefly split into a separate PR and folded back in, since the review showed the foundation is incomplete without them.)Design: #106
Testing
Covered by unit tests here. End-to-end coverage is deferred to #149 — the original harness ran the operators locally (
go run) rather than deploying them to the cells, so it didn't exercise RBAC/manifests/image. It'll be rebuilt as a proper in-cluster harness; the deferred suites are preserved onarchive/e2e-local-deferred.Known follow-ups (from review)
Not blockers for review, tracked separately: single-cluster overlay bootability, the status interpreter not being wired into any overlay, management-plane leader-election scoping, and observability (metrics/Events) on the federation paths.
Closes #85