Skip to content

feat: route workloads to city locations via distributed scheduling (foundation)#107

Open
scotwells wants to merge 25 commits into
mainfrom
feat/federated-deployment-scheduling
Open

feat: route workloads to city locations via distributed scheduling (foundation)#107
scotwells wants to merge 25 commits into
mainfrom
feat/federated-deployment-scheduling

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

@scotwells scotwells commented May 18, 2026

Summary

Workloads targeting a city location are automatically routed to the correct physical site, with instance health and readiness surfaced back to the platform in real time. This replaces the single central scheduler with per-site distributed scheduling, so each site operates independently. User-facing behavior is unchanged — city-code targeting, instance visibility, and the existing API all work as before.

This is the complete federation foundation. Decomposed from one large PR; the genuinely-independent pieces landed first and are merged:

It now contains the full controller layer and the operational-completeness fixes that make it correct on its own — quota self-heal on a late grant, instance restart actually rolls, downstream-WorkloadDeployment status watch (no resync needed), the Running → Available condition rename, rollout progress, and instanceType vCPU/memory sizing. (These were briefly split into a separate PR and folded back in, since the review showed the foundation is incomplete without them.)

Design: #106

Testing

Covered by unit tests here. End-to-end coverage is deferred to #149 — the original harness ran the operators locally (go run) rather than deploying them to the cells, so it didn't exercise RBAC/manifests/image. It'll be rebuilt as a proper in-cluster harness; the deferred suites are preserved on archive/e2e-local-deferred.

Known follow-ups (from review)

Not blockers for review, tracked separately: single-cluster overlay bootability, the status interpreter not being wired into any overlay, management-plane leader-election scoping, and observability (metrics/Events) on the federation paths.

Closes #85

@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from 0c0d8df to 134086f Compare May 19, 2026 21:10
@scotwells scotwells changed the title feat: federated deployment scheduling across POP cells feat: Route workloads to city locations via distributed scheduling May 20, 2026
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch 3 times, most recently from 6e9a268 to 492eb6c Compare May 20, 2026 22:19
@scotwells scotwells requested a review from mattdjenkinson May 27, 2026 00:15
mattdjenkinson
mattdjenkinson previously approved these changes May 27, 2026
privateip
privateip previously approved these changes May 28, 2026
@scotwells scotwells closed this May 28, 2026
@scotwells scotwells reopened this May 28, 2026
@scotwells scotwells marked this pull request as draft May 28, 2026 20:53
@scotwells
Copy link
Copy Markdown
Contributor Author

Setting to draft while I continue to iterate on getting this working in staging.

@scotwells
Copy link
Copy Markdown
Contributor Author

📦 The federation e2e chainsaw suites (~900 lines of test YAML) have been split out into a dedicated PR so this foundation reviews without them inline. The shared test/e2e/env harness stays here. See the federation-e2e PR (stacked on this branch).

@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from a67b32c to b45810f Compare June 5, 2026 17:49
scotwells and others added 22 commits June 5, 2026 13:35
Bump the toolchain to Go 1.25 and golangci-lint v2.12.2, introduce a
Taskfile for the standard build/test/lint targets, and align the CI
workflows and Makefile with the new versions. Remove stale RFC and
enhancement docs that the federated-scheduling work supersedes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Delete the central scheduler that placed WorkloadDeployments from a
single control plane. Placement now happens through the distributed
federator and per-cell controllers introduced in the following commits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce the federator that fans a WorkloadDeployment out to the cells
selected for its placement, replacing the central scheduler. Add the
city-code field indexer it uses to map subnet/location events back to the
deployments that depend on them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the projector that mirrors cell-side Instances back to the
management plane, writing their status (readiness, placement, blocking
reasons) onto the project-scoped Instance so callers see a single view
across cells. Include the shared controller test helpers that build the
project/Karmada fake clients and multi-cluster manager used by the
federation tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…liation

Rework the WorkloadDeployment and Workload controllers to run per cell,
resolving networks and Locations locally and driving Instance lifecycle
through the stateful instance-control logic rather than a central
scheduler. Update the instance-control packages to manage Instances
within a cell's control plane.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update the Instance controller to compute the Ready condition and apply
the per-project quota gate within a single reconcile pass, surfacing
blocking reasons when quota is unavailable so federated placement
reflects real allocatable capacity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the manager to run in either cell or management-plane mode, gating
the federator, projector, and per-cell controllers behind feature flags.
Add the feature-gate registry and extend configuration to carry the
downstream kubeconfig and discovery settings each mode needs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update Workload webhook and Instance validation so the API accepts the
fields federated scheduling adds and continues to reject invalid
placement and runtime specs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regenerate the Instance, Workload, and WorkloadDeployment CRDs for the
new API fields and add the kustomize structure that deploys the manager
in cell or management-plane mode: federation and downstream RBAC bases,
cell/management/quota-credentials components, the WorkloadDeployment
status interpreter, and the matching overlays.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The WorkloadDeploymentFederator mirrors the downstream Karmada
WorkloadDeployment status onto the project (VCP) WorkloadDeployment, but
SetupWithManager only watched the project WD via For(). Nothing watched
the downstream WD whose status it mirrors, so when Karmada aggregated new
status onto the downstream object the federator was not notified — it
only caught up on the next informer resync (~10h default) or an
incidental project-WD spec write. This is why a freshly created
workload's replica counts stayed empty on the VCP long after its
projected Instance had already appeared (the InstanceProjector holds the
analogous downstream watch and so propagates immediately).

Add a downstream watch using the same cross-plane mechanism the
InstanceProjector and unikraft-provider use (milosource cluster source +
TypedEnqueueRequestsFromMapFunc). The map function correlates a
downstream WD event back to its project WD reconcile request: name is
stable across planes, namespace comes from the UpstreamOwnerNamespace
label the federator stamps, and the project cluster name is recovered by
decoding the UpstreamOwnerClusterName label on the downstream namespace
(the exact inverse of the encoding applied in ensureDownstreamNamespace).

The federation manager already constructed for the InstanceProjector is
reused as the watchable source, so there is no additional manager or
informer-cache cost beyond the new WD and Namespace informers.

Karmada's own status-aggregation interval (edge cell → downstream WD)
remains outside this repo; once Karmada writes the aggregated status, the
new watch reacts immediately.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch mapped events to a
reconcile request whose ClusterName was the full decoded org/project path
(decodeUpstreamClusterName turned the "cluster-<org>_<project>" namespace
label into "<org>/<project>"). But the Milo multicluster provider keys
project clusters by bare project name only. As a result every project
except the org-less "datum-cloud" failed to resolve: mcmanager routed the
unmatched name (ultimately the empty string) to the local host cluster,
which has no compute CRDs, so Reconcile failed with "no matches for kind
WorkloadDeployment" in a hot loop (~2 errors/sec observed on staging).

Extract the bare project name (final path segment) so it matches the
provider key, and guard the mapping with GetCluster: if the project
cluster isn't engaged yet, drop the event instead of enqueuing a request
that falls back to the host cluster and errors. Dropping is safe — once
the provider engages the cluster, the For watch reconciles it and the
next downstream status event maps cleanly.

Rename decodeUpstreamClusterName to projectClusterNameFromLabel to
reflect that it now returns the provider cluster key, and add the
not-engaged drop case to the mapping test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The downstream WorkloadDeployment status watch was a complete no-op and
the source of a steady ~130 errors/min on the management plane. Two
layered causes:

milosource.NewClusterSource binds the raw source to the empty cluster
name, and the default mchandler.TypedEnqueueRequestsFromMapFunc wraps the
map in TypedInjectCluster, which overwrites each request's ClusterName
with that bound empty name. So the project cluster name computed by
mapDownstreamDeploymentToRequest (and validated by its GetCluster guard)
was discarded at enqueue time; every downstream event reached Reconcile
with ClusterName="". mcmanager routes the empty name to the local host
management cluster, which has no compute CRDs, so the Get failed with
"no matches for kind WorkloadDeployment" and requeued in a hot loop —
while the watch's actual purpose (immediate status mirror-back) never
ran for any project.

Switch the handler to TypedEnqueueRequestsFromMapFuncWithClusterPreservation
so the map's project cluster name survives to Reconcile, making the
downstream watch functional. Add a defensive guard at the top of Reconcile
that drops (returns nil, not an error) any request with an empty cluster
name, so a host-cluster fallback can never again spin in a requeue loop.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tance name

An Instance could wedge Pending forever (QuotaGranted=Unknown/QuotaNoBudget,
Quota scheduling gate never removed) even though its Milo ResourceClaim was
granted: the Instance reconciled once while the claim was still pending, and
nothing re-triggered it when the grant landed a beat later. The ResourceClaim
watch mapped a claim to its Spec.ResourceRef — the Project — so the grant
enqueued the project name, never the owning Instance.

Fix the watch to enqueue the owning Instance: its namespace is carried on a new
compute.datumapis.com/instance-namespace label (the claim lives in the project
quota namespace, not the Instance's), and its name is the claim name with the
resource-kind prefix stripped.

Also name the claim after the Instance (unique among Instances in the project
control plane) with an "instance-" prefix so it cannot collide with other
resource kinds' claims sharing the quota namespace, replacing the previous
"<namespace>--<name>" scheme.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… them

A template-hash change (an image update, or a restartedAt annotation from
`datumctl compute restart`) previously resolved to an in-place Update of the
Instance. The unikraft provider bakes the pod at creation time and never
recomputes an existing pod's spec, so the in-place update silently failed to
roll the running workload — instances kept their old pod.

Emit a delete (recreate) for drifted Ready instances instead. The next
reconcile refills the slot via the create path with the new template, and the
provider's finalizer-gated teardown plus create-on-new-Instance roll the pod
with no provider changes. Ordered one-at-a-time pacing is preserved by the
existing descending-ordinal sort, skip-all-but-first, and the
DeletionTimestamp WaitAction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Instance "Running" status condition is renamed to "Available" (wire
value "Available"). An instance can be available while not actively
running a pod (e.g. scaled to zero), so "Running" was misleading as a
serving/health signal.

Renamed constants:
  InstanceRunning                -> InstanceAvailable               ("Available")
  InstanceReadyReasonRunning     -> InstanceReadyReasonAvailable    ("Available")
  InstanceRunningReasonRunning   -> InstanceAvailableReasonAvailable ("Available")
  InstanceRunningReasonStopped   -> InstanceAvailableReasonStopped
  InstanceRunningReasonStarting  -> InstanceAvailableReasonStarting
  InstanceRunningReasonStopping  -> InstanceAvailableReasonStopping

BREAKING CHANGE: the on-the-wire Instance condition type changes from
"Running" to "Available". Consumers reading conditions[type=="Running"]
must switch to "Available". Existing Instances self-heal on the next
provider reconcile (the provider re-asserts the condition under its new
name); the stale "Running" entry lingers cosmetically until then and is
no longer read by the Ready derivation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eals

The instance controller is re-queued by a ResourceClaim watch when the
claim is granted, but that grant event lives on the project control plane
and can be missed (informer engagement races, watch relist gaps),
wedging the instance at QuotaGranted!=True indefinitely (observed: claim
Granted, instance stuck QuotaNoBudget until a manual reconcile cleared
it). The pending-quota path returned no RequeueAfter, so there was no
safety net.

Add a backing-off requeue while QuotaGranted is not True, anchored on the
condition's last transition:

  <60s : 1s     (catch a grant landing almost immediately)
  60s–5m : 15s
  5m–10m : 60s
  >=10m : 300s

Folded into the existing referenced-data requeue (soonest wins). The
ResourceClaim watch remains the fast path; this only guarantees a missed
grant self-heals instead of wedging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…roof

The pending-quota safety-net requeue was wired only at the tail of
Reconcile, so an early return during the pending window (a status-update
or upstream-writeback conflict) silently dropped it onto controller-
runtime's exponential error-backoff — which can stretch to minutes,
leaving an instance wedged at QuotaGranted!=True even though its
ResourceClaim was granted (observed: the 2nd instance in a rapid burst
consistently wedged).

- Compute the requeue once, up front, so every return path honors it.
- On a Conflict during the pending window, requeue at the bounded quota
  interval instead of returning the error (which would back off).
- Log the requeue decision (and conflict-driven requeues) so the path is
  observable: a re-firing requeue prints every pass while pending, a
  dropped one does not.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… LTT

Observability revealed the safety-net requeue was firing every reconcile
but always at the slowest tier (300s): elapsed was measured from the
QuotaGranted condition's LastTransitionTime, which stays at the
1970-01-01 CRD default while quota is pending (PendingEvaluation and
NoBudget are both Unknown, so SetStatusCondition never bumps it). Result:
a watch-missed instance waited up to 5 minutes for the safety net instead
of ~1s, appearing wedged.

Anchor elapsed on instance.CreationTimestamp, which reflects actual wait
time, so the fast tiers (1s/15s) apply early as intended.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The instance controller emits Warning events on Instances (QuotaNoBudget,
ImageUnavailable, InstanceCrashing, ConfigurationError, NetworkFailedToCreate,
…) via the event recorder, but no RBAC rule granted it. Every write was
rejected — "events is forbidden: ... cannot create resource events in API
group \"\" in the namespace ns-<uid>" — so the user-facing signals explaining
why an instance is stuck never reached the Instance (kubectl describe /
activity timeline). Reconciliation was unaffected; this is an observability gap.

Add the kubebuilder marker and regenerate the role. The regen also syncs a
pre-existing work.karmada.io/resourcebindings rule (from an existing marker
that wasn't reflected in the committed role).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rvedGeneration

A restart/rolling update was invisible from the project plane: there was no
status field representing how many instances are on the new template revision.
Add UpdatedReplicas (instances whose observed template hash matches the desired
template, regardless of readiness) and ObservedGeneration to both
WorkloadDeployment and Workload (plus placement) status.

UpdatedReplicas is computed on the cell WD reconcile alongside CurrentReplicas
(which is now its Programmed subset), aggregated up into the Workload, and rides
the existing status sync to the project plane. Repoint the "Up-to-date"
printcolumn to .status.updatedReplicas to match `kubectl get deployment`
semantics, so a roll is visible as the count dips below Replicas and recovers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…emory

Two Instance-controller correctness changes:

- Blocking-reason rollup: surface the most specific provider sub-condition
  (ImageUnavailable, InstanceCrashing, ConfigurationError, Provisioning) and its
  message onto the Instance Ready condition instead of a generic "Instance has
  not been programmed", so e.g. an image-pull failure reads as ImageUnavailable
  with the real message. Adds the reason constants and ranks them in the
  blocking-reason priority.

- Quota sizing: resolve vCPU/memory for instanceType-sized instances from a new
  instanceTypeCatalog (datumcloud/d1-standard-2 = 1 vCPU / 2 GiB) so the quota
  ResourceClaim requests vcpus + memory, not just instance count. Explicit
  container limits / instance requests still take precedence.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… tests

Make the cherry-picked instanceType-sizing and blocking-reason tests
lint-clean: hoist the repeated "datumcloud/d1-standard-2", "app", and
"test/image:latest" literals into named constants (goconst) and apply
gofmt. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from b45810f to 73177eb Compare June 5, 2026 18:38
@scotwells scotwells changed the base branch from main to split/api-rename June 5, 2026 18:39
Base automatically changed from split/api-rename to main June 5, 2026 19:56
scotwells and others added 3 commits June 8, 2026 13:28
The "cluster-<name>" label encoding (project path with "/" → "_") was
open-coded at five sites with two of them re-deriving the decode by hand.
Extract EncodeClusterName/DecodeClusterName so the wire format lives in one
place. The federator keeps its distinct semantics — it decodes then trims to
the last path segment because the multicluster provider keys clusters by bare
project name — now by wrapping the shared decoder rather than re-implementing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
singleModeProjectID/singleModeProjectNamespace/readEdgeNamespace decoded edge
namespace labels into project identity — domain logic that had no business in
package main. Move them into the controller package as NewSingleModeProjectID/
NewSingleModeProjectNamespace constructors. main.go keeps only the wiring.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
KarmadaClient was never assigned, so writeStatusToKarmada always early-returned
nil. WorkloadDeployment status reaches Karmada via the cell-local Status().Update
plus the statusAggregation interpreter, not a controller push. Remove the field,
the method, and its call site.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Define integration strategy with federated control plane for workload deployment scheduling

3 participants