feat: route workloads to city locations via distributed scheduling (foundation) by scotwells · Pull Request #107 · datum-cloud/compute

scotwells · 2026-05-18T22:41:29Z

Summary

Workloads targeting a city location are automatically routed to the correct physical site, with instance health and readiness surfaced back to the platform in real time. This replaces the single central scheduler with per-site distributed scheduling, so each site operates independently. User-facing behavior is unchanged — city-code targeting, instance visibility, and the existing API all work as before.

This is the complete federation foundation. Decomposed from one large PR; the genuinely-independent pieces landed first and are merged:

API types → Add the API types for federated workload delivery #147 ✅
Quota client + metrics → Add the per-project quota client and metrics #148 ✅

It now contains the full controller layer and the operational-completeness fixes that make it correct on its own — quota self-heal on a late grant, instance restart actually rolls, downstream-WorkloadDeployment status watch (no resync needed), the Running → Available condition rename, rollout progress, and instanceType vCPU/memory sizing. (These were briefly split into a separate PR and folded back in, since the review showed the foundation is incomplete without them.)

Design: #106

Testing

Covered by unit tests here. End-to-end coverage is deferred to #149 — the original harness ran the operators locally (go run) rather than deploying them to the cells, so it didn't exercise RBAC/manifests/image. It'll be rebuilt as a proper in-cluster harness; the deferred suites are preserved on archive/e2e-local-deferred.

Known follow-ups (from review)

Not blockers for review, tracked separately: single-cluster overlay bootability, the status interpreter not being wired into any overlay, management-plane leader-election scoping, and observability (metrics/Events) on the federation paths.

Closes #85

scotwells · 2026-05-28T20:54:34Z

Setting to draft while I continue to iterate on getting this working in staging.

The base branch was changed.

scotwells · 2026-06-05T01:42:39Z

📦 The federation e2e chainsaw suites (~900 lines of test YAML) have been split out into a dedicated PR so this foundation reviews without them inline. The shared test/e2e/env harness stays here. See the federation-e2e PR (stacked on this branch).

Bump the toolchain to Go 1.25 and golangci-lint v2.12.2, introduce a Taskfile for the standard build/test/lint targets, and align the CI workflows and Makefile with the new versions. Remove stale RFC and enhancement docs that the federated-scheduling work supersedes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Delete the central scheduler that placed WorkloadDeployments from a single control plane. Placement now happens through the distributed federator and per-cell controllers introduced in the following commits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Introduce the federator that fans a WorkloadDeployment out to the cells selected for its placement, replacing the central scheduler. Add the city-code field indexer it uses to map subnet/location events back to the deployments that depend on them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add the projector that mirrors cell-side Instances back to the management plane, writing their status (readiness, placement, blocking reasons) onto the project-scoped Instance so callers see a single view across cells. Include the shared controller test helpers that build the project/Karmada fake clients and multi-cluster manager used by the federation tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…liation Rework the WorkloadDeployment and Workload controllers to run per cell, resolving networks and Locations locally and driving Instance lifecycle through the stateful instance-control logic rather than a central scheduler. Update the instance-control packages to manage Instances within a cell's control plane. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update the Instance controller to compute the Ready condition and apply the per-project quota gate within a single reconcile pass, surfacing blocking reasons when quota is unavailable so federated placement reflects real allocatable capacity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire the manager to run in either cell or management-plane mode, gating the federator, projector, and per-cell controllers behind feature flags. Add the feature-gate registry and extend configuration to carry the downstream kubeconfig and discovery settings each mode needs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Update Workload webhook and Instance validation so the API accepts the fields federated scheduling adds and continues to reject invalid placement and runtime specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Regenerate the Instance, Workload, and WorkloadDeployment CRDs for the new API fields and add the kustomize structure that deploys the manager in cell or management-plane mode: federation and downstream RBAC bases, cell/management/quota-credentials components, the WorkloadDeployment status interpreter, and the matching overlays. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The WorkloadDeploymentFederator mirrors the downstream Karmada WorkloadDeployment status onto the project (VCP) WorkloadDeployment, but SetupWithManager only watched the project WD via For(). Nothing watched the downstream WD whose status it mirrors, so when Karmada aggregated new status onto the downstream object the federator was not notified — it only caught up on the next informer resync (~10h default) or an incidental project-WD spec write. This is why a freshly created workload's replica counts stayed empty on the VCP long after its projected Instance had already appeared (the InstanceProjector holds the analogous downstream watch and so propagates immediately). Add a downstream watch using the same cross-plane mechanism the InstanceProjector and unikraft-provider use (milosource cluster source + TypedEnqueueRequestsFromMapFunc). The map function correlates a downstream WD event back to its project WD reconcile request: name is stable across planes, namespace comes from the UpstreamOwnerNamespace label the federator stamps, and the project cluster name is recovered by decoding the UpstreamOwnerClusterName label on the downstream namespace (the exact inverse of the encoding applied in ensureDownstreamNamespace). The federation manager already constructed for the InstanceProjector is reused as the watchable source, so there is no additional manager or informer-cache cost beyond the new WD and Namespace informers. Karmada's own status-aggregation interval (edge cell → downstream WD) remains outside this repo; once Karmada writes the aggregated status, the new watch reacts immediately. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The downstream WorkloadDeployment status watch mapped events to a reconcile request whose ClusterName was the full decoded org/project path (decodeUpstreamClusterName turned the "cluster-<org>_<project>" namespace label into "<org>/<project>"). But the Milo multicluster provider keys project clusters by bare project name only. As a result every project except the org-less "datum-cloud" failed to resolve: mcmanager routed the unmatched name (ultimately the empty string) to the local host cluster, which has no compute CRDs, so Reconcile failed with "no matches for kind WorkloadDeployment" in a hot loop (~2 errors/sec observed on staging). Extract the bare project name (final path segment) so it matches the provider key, and guard the mapping with GetCluster: if the project cluster isn't engaged yet, drop the event instead of enqueuing a request that falls back to the host cluster and errors. Dropping is safe — once the provider engages the cluster, the For watch reconciles it and the next downstream status event maps cleanly. Rename decodeUpstreamClusterName to projectClusterNameFromLabel to reflect that it now returns the provider cluster key, and add the not-engaged drop case to the mapping test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The downstream WorkloadDeployment status watch was a complete no-op and the source of a steady ~130 errors/min on the management plane. Two layered causes: milosource.NewClusterSource binds the raw source to the empty cluster name, and the default mchandler.TypedEnqueueRequestsFromMapFunc wraps the map in TypedInjectCluster, which overwrites each request's ClusterName with that bound empty name. So the project cluster name computed by mapDownstreamDeploymentToRequest (and validated by its GetCluster guard) was discarded at enqueue time; every downstream event reached Reconcile with ClusterName="". mcmanager routes the empty name to the local host management cluster, which has no compute CRDs, so the Get failed with "no matches for kind WorkloadDeployment" and requeued in a hot loop — while the watch's actual purpose (immediate status mirror-back) never ran for any project. Switch the handler to TypedEnqueueRequestsFromMapFuncWithClusterPreservation so the map's project cluster name survives to Reconcile, making the downstream watch functional. Add a defensive guard at the top of Reconcile that drops (returns nil, not an error) any request with an empty cluster name, so a host-cluster fallback can never again spin in a requeue loop. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tance name An Instance could wedge Pending forever (QuotaGranted=Unknown/QuotaNoBudget, Quota scheduling gate never removed) even though its Milo ResourceClaim was granted: the Instance reconciled once while the claim was still pending, and nothing re-triggered it when the grant landed a beat later. The ResourceClaim watch mapped a claim to its Spec.ResourceRef — the Project — so the grant enqueued the project name, never the owning Instance. Fix the watch to enqueue the owning Instance: its namespace is carried on a new compute.datumapis.com/instance-namespace label (the claim lives in the project quota namespace, not the Instance's), and its name is the claim name with the resource-kind prefix stripped. Also name the claim after the Instance (unique among Instances in the project control plane) with an "instance-" prefix so it cannot collide with other resource kinds' claims sharing the quota namespace, replacing the previous "<namespace>--<name>" scheme. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… them A template-hash change (an image update, or a restartedAt annotation from `datumctl compute restart`) previously resolved to an in-place Update of the Instance. The unikraft provider bakes the pod at creation time and never recomputes an existing pod's spec, so the in-place update silently failed to roll the running workload — instances kept their old pod. Emit a delete (recreate) for drifted Ready instances instead. The next reconcile refills the slot via the create path with the new template, and the provider's finalizer-gated teardown plus create-on-new-Instance roll the pod with no provider changes. Ordered one-at-a-time pacing is preserved by the existing descending-ordinal sort, skip-all-but-first, and the DeletionTimestamp WaitAction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The Instance "Running" status condition is renamed to "Available" (wire value "Available"). An instance can be available while not actively running a pod (e.g. scaled to zero), so "Running" was misleading as a serving/health signal. Renamed constants: InstanceRunning -> InstanceAvailable ("Available") InstanceReadyReasonRunning -> InstanceReadyReasonAvailable ("Available") InstanceRunningReasonRunning -> InstanceAvailableReasonAvailable ("Available") InstanceRunningReasonStopped -> InstanceAvailableReasonStopped InstanceRunningReasonStarting -> InstanceAvailableReasonStarting InstanceRunningReasonStopping -> InstanceAvailableReasonStopping BREAKING CHANGE: the on-the-wire Instance condition type changes from "Running" to "Available". Consumers reading conditions[type=="Running"] must switch to "Available". Existing Instances self-heal on the next provider reconcile (the provider re-asserts the condition under its new name); the stale "Running" entry lingers cosmetically until then and is no longer read by the Ready derivation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eals The instance controller is re-queued by a ResourceClaim watch when the claim is granted, but that grant event lives on the project control plane and can be missed (informer engagement races, watch relist gaps), wedging the instance at QuotaGranted!=True indefinitely (observed: claim Granted, instance stuck QuotaNoBudget until a manual reconcile cleared it). The pending-quota path returned no RequeueAfter, so there was no safety net. Add a backing-off requeue while QuotaGranted is not True, anchored on the condition's last transition: <60s : 1s (catch a grant landing almost immediately) 60s–5m : 15s 5m–10m : 60s >=10m : 300s Folded into the existing referenced-data requeue (soonest wins). The ResourceClaim watch remains the fast path; this only guarantees a missed grant self-heals instead of wedging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…roof The pending-quota safety-net requeue was wired only at the tail of Reconcile, so an early return during the pending window (a status-update or upstream-writeback conflict) silently dropped it onto controller- runtime's exponential error-backoff — which can stretch to minutes, leaving an instance wedged at QuotaGranted!=True even though its ResourceClaim was granted (observed: the 2nd instance in a rapid burst consistently wedged). - Compute the requeue once, up front, so every return path honors it. - On a Conflict during the pending window, requeue at the bounded quota interval instead of returning the error (which would back off). - Log the requeue decision (and conflict-driven requeues) so the path is observable: a re-firing requeue prints every pass while pending, a dropped one does not. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… LTT Observability revealed the safety-net requeue was firing every reconcile but always at the slowest tier (300s): elapsed was measured from the QuotaGranted condition's LastTransitionTime, which stays at the 1970-01-01 CRD default while quota is pending (PendingEvaluation and NoBudget are both Unknown, so SetStatusCondition never bumps it). Result: a watch-missed instance waited up to 5 minutes for the safety net instead of ~1s, appearing wedged. Anchor elapsed on instance.CreationTimestamp, which reflects actual wait time, so the fast tiers (1s/15s) apply early as intended. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The instance controller emits Warning events on Instances (QuotaNoBudget, ImageUnavailable, InstanceCrashing, ConfigurationError, NetworkFailedToCreate, …) via the event recorder, but no RBAC rule granted it. Every write was rejected — "events is forbidden: ... cannot create resource events in API group \"\" in the namespace ns-<uid>" — so the user-facing signals explaining why an instance is stuck never reached the Instance (kubectl describe / activity timeline). Reconciliation was unaffected; this is an observability gap. Add the kubebuilder marker and regenerate the role. The regen also syncs a pre-existing work.karmada.io/resourcebindings rule (from an existing marker that wasn't reflected in the committed role). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rvedGeneration A restart/rolling update was invisible from the project plane: there was no status field representing how many instances are on the new template revision. Add UpdatedReplicas (instances whose observed template hash matches the desired template, regardless of readiness) and ObservedGeneration to both WorkloadDeployment and Workload (plus placement) status. UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas (which is now its Programmed subset), aggregated up into the Workload, and rides the existing status sync to the project plane. Repoint the "Up-to-date" printcolumn to .status.updatedReplicas to match `kubectl get deployment` semantics, so a roll is visible as the count dips below Replicas and recovers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emory Two Instance-controller correctness changes: - Blocking-reason rollup: surface the most specific provider sub-condition (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its message onto the Instance Ready condition instead of a generic "Instance has not been programmed", so e.g. an image-pull failure reads as ImageUnavailable with the real message. Adds the reason constants and ranks them in the blocking-reason priority. - Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota ResourceClaim requests vcpus + memory, not just instance count. Explicit container limits / instance requests still take precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… tests Make the cherry-picked instanceType-sizing and blocking-reason tests lint-clean: hoist the repeated "datumcloud/d1-standard-2", "app", and "test/image:latest" literals into named constants (goconst) and apply gofmt. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "cluster-<name>" label encoding (project path with "/" → "_") was open-coded at five sites with two of them re-deriving the decode by hand. Extract EncodeClusterName/DecodeClusterName so the wire format lives in one place. The federator keeps its distinct semantics — it decodes then trims to the last path segment because the multicluster provider keys clusters by bare project name — now by wrapping the shared decoder rather than re-implementing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

singleModeProjectID/singleModeProjectNamespace/readEdgeNamespace decoded edge namespace labels into project identity — domain logic that had no business in package main. Move them into the controller package as NewSingleModeProjectID/ NewSingleModeProjectNamespace constructors. main.go keeps only the wiring. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

KarmadaClient was never assigned, so writeStatusToKarmada always early-returned nil. WorkloadDeployment status reaches Karmada via the cell-local Status().Update plus the statusAggregation interpreter, not a controller push. Remove the field, the method, and its call site. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells requested review from JoseSzycho, kevwilliams, mattdjenkinson, privateip and savme May 19, 2026 19:31

scotwells mentioned this pull request May 19, 2026

Launch Datum Compute Service datum-cloud/enhancements#682

Open

scotwells force-pushed the feat/federated-deployment-scheduling branch from 0c0d8df to 134086f Compare May 19, 2026 21:10

scotwells changed the title ~~feat: federated deployment scheduling across POP cells~~ feat: Route workloads to city locations via distributed scheduling May 20, 2026

scotwells force-pushed the feat/federated-deployment-scheduling branch 3 times, most recently from 6e9a268 to 492eb6c Compare May 20, 2026 22:19

mattdjenkinson approved these changes May 22, 2026

View reviewed changes

scotwells requested a review from mattdjenkinson May 27, 2026 00:15

mattdjenkinson previously approved these changes May 27, 2026

View reviewed changes

privateip previously approved these changes May 28, 2026

View reviewed changes

scotwells closed this May 28, 2026

scotwells reopened this May 28, 2026

scotwells marked this pull request as draft May 28, 2026 20:53

This was referenced May 29, 2026

feat(api): add Command and Args fields to SandboxContainer #125

Merged

feat: federated workload scheduling across POP cells #116

Closed

fix: Report accurate health for federated workloads #127

Open

Base automatically changed from docs/issue-85-karmada-federation-design to main June 1, 2026 22:01

This was referenced Jun 4, 2026

Simpler, more reliable webhook TLS via a cert-manager CSI mount #141

Merged

Instances self-heal, restart, and report status correctly on the federation foundation #142

Merged

scotwells force-pushed the feat/federated-deployment-scheduling branch from 82955e2 to bf73355 Compare June 5, 2026 01:42

scotwells mentioned this pull request Jun 5, 2026

End-to-end coverage for federated workload delivery #146

Closed

scotwells force-pushed the feat/federated-deployment-scheduling branch from a67b32c to b45810f Compare June 5, 2026 17:49

scotwells and others added 22 commits June 5, 2026 13:35

feat(webhook): validation updates for federation

0d49455

Update Workload webhook and Instance validation so the API accepts the fields federated scheduling adds and continues to reject invalid placement and runtime specs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scotwells force-pushed the feat/federated-deployment-scheduling branch from b45810f to 73177eb Compare June 5, 2026 18:38

scotwells mentioned this pull request Jun 5, 2026

An Instance is "Available" when it's ready to serve, even when scaled to zero #150

Merged

scotwells changed the base branch from main to split/api-rename June 5, 2026 18:39

Base automatically changed from split/api-rename to main June 5, 2026 19:56

scotwells and others added 3 commits June 8, 2026 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: route workloads to city locations via distributed scheduling (foundation)#107

feat: route workloads to city locations via distributed scheduling (foundation)#107
scotwells wants to merge 25 commits into
mainfrom
feat/federated-deployment-scheduling

scotwells commented May 18, 2026 •

edited

Loading

Uh oh!

scotwells commented May 28, 2026

Uh oh!

scotwells commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

scotwells commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Known follow-ups (from review)

Uh oh!

scotwells commented May 28, 2026

Uh oh!

scotwells commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scotwells commented May 18, 2026 •

edited

Loading