Skip to content

Add pool/claim types and controllers for numbered resource allocation#258

Open
felix-kaestner wants to merge 7 commits into
mainfrom
numbered-resources
Open

Add pool/claim types and controllers for numbered resource allocation#258
felix-kaestner wants to merge 7 commits into
mainfrom
numbered-resources

Conversation

@felix-kaestner
Copy link
Copy Markdown
Contributor

@felix-kaestner felix-kaestner commented Mar 25, 2026

Introduces a pool-based allocation system for numbered resources (ASNs, VLAN IDs, IP addresses, prefixes). Three pool types — IndexPool, IPAddressPool, IPPrefixPool — define the available value ranges. A Claim references a pool via spec.poolRef and the controller allocates the next available value by creating a dedicated allocation object (Index, IPAddress, or IPPrefix).

Allocation objects are first-class Kubernetes resources with a deterministic name derived from the pool name and value (e.g. asn-pool-64512). This naming scheme uses the API server's name-uniqueness guarantee as the concurrency guard: two controllers racing for the same value both attempt Create, exactly one succeeds, and the other retries with a fresh list. Each allocation object carries a spec.claimRef binding it to the owning claim and a spec.poolRef back to the pool.

Pools support Recycle (default) and Retain reclaim policies. On claim deletion, Recycle deletes the allocation object to free the value. Retain clears the claimRef so the allocation persists as reserved but unbound — it can be rebound later using the allow-binding annotation. Owner references from the pool to both claims and allocation objects enable automatic garbage collection when a pool is deleted.

Pool types implement a Pool interface with Allocate and ListAllocations methods. Allocation types implement an Allocation interface with GetClaimRef, SetClaimRef, and GetValue. The claim controller operates generically through these interfaces with no type-specific logic beyond a single kind switch for pool instantiation.

Concept

numbered-resources

Usage

$ kubectl get idxpool -o wide
NAME               ALLOCATED   TOTAL      AVAILABLE   AGE
indexpool-sample   1           94968318   True        9s

$ kubectl get ippool -o wide
NAME                   ALLOCATED   TOTAL      AVAILABLE   AGE
ipaddresspool-sample   1           17891328   True        12s

$ kubectl get pfxpool -o wide
NAME                  ALLOCATED   TOTAL   AVAILABLE   AGE
ipprefixpool-sample   1           65536   True        16s

$ kubectl get claim -o wide
NAME              VALUE         ALLOCATED   AGE
claim-index       64512         True        19s
claim-ipaddress   10.0.0.0      True        19s
claim-prefix      10.0.0.0/24   True        18s

$ kubectl get idx -o wide
NAME                     POOL               INDEX   CLAIM         VALID   AGE
indexpool-sample-64512   indexpool-sample   64512   claim-index   True    23s

@felix-kaestner felix-kaestner force-pushed the numbered-resources branch 2 times, most recently from fe66c1f to 02d7fbd Compare March 25, 2026 23:54
@hardikdr hardikdr added the area/switch-automation Automation processes for network switch management and operations. label Mar 26, 2026
@hardikdr hardikdr added this to Roadmap Mar 26, 2026
@felix-kaestner felix-kaestner marked this pull request as ready for review March 26, 2026 10:00
@felix-kaestner felix-kaestner requested a review from a team as a code owner March 26, 2026 10:00
@felix-kaestner felix-kaestner force-pushed the numbered-resources branch 5 times, most recently from cc6b628 to f8ff9ec Compare April 29, 2026 13:29
@felix-kaestner felix-kaestner force-pushed the numbered-resources branch 7 times, most recently from 3b85728 to 9d64fe5 Compare May 5, 2026 19:25
The previously used logo had a low resolution and didn't work well in
dark mode and the navigation header. With this change we are replacing
it with the standard IronCore logo which is also inline with other
projects such as the metal-operator.

Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
SVGs are resolution-independent, render natively in browsers,
and avoid Git LFS complications for the docs build.

Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
Introduces three pool types (IndexPool, IPAddressPool, IPPrefixPool) and
a Claim type that references a pool via spec.poolRef. The claim
controller allocates the lowest available resource from the referenced
pool and writes it back to both the pool status and the claim status.
Pools track allocations with ClaimRef+ClaimUID for idempotency, and
support Recycle/Retain reclaim policies on claim deletion. All pool
types expose an Available condition (HasCapacity/Exhausted).

A preferred value can be requested by setting the annotation
pool.networking.metal.ironcore.dev/preferred-value on a Claim. The
controller will attempt to allocate that exact value; if it is outside
the pool's configured ranges or already taken, the claim enters a
terminal error state with reason PreferredValueUnavailable. Removing
the annotation re-triggers reconciliation and falls back to normal
allocation.

Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
Documents the pool-based allocation system introduced alongside the
IndexPool, IPAddressPool, IPPrefixPool, and Claim types. Covers pool
and claim concepts, reclaim policies, the preferred-value annotation,
and allocation result fields. Includes light/dark theme diagram images
and links to sample manifests in the repository.
@felix-kaestner felix-kaestner force-pushed the numbered-resources branch 2 times, most recently from 866f5e6 to 012e6c0 Compare May 11, 2026 12:15
Replace the embedded Status.Allocations lists on pool types with
dedicated Kubernetes objects (Index, IPAddress, IPPrefix) that model
allocations as first-class resources following the PV/PVC pattern.

Previously, the claim controller reserved values by appending entries
to a pool's status and used optimistic locking on the pool status
update to prevent races. This coupled allocation state tightly to the
pool object and made it impossible to pre-provision or inspect
individual allocations.

The new design introduces three allocation types — Index, IPAddress,
and IPPrefix — each with a spec containing the reserved value, a
poolRef back to the owning pool, and an optional claimRef binding it
to a Claim. The claim controller now creates allocation objects with
deterministic names derived from the pool name and value (e.g.
my-pool-64512), using the API server's name-uniqueness guarantee as
the concurrency guard: two controllers racing for the same value
both attempt Create, exactly one succeeds, the other retries via
retry.OnError with a fresh list of used values.

Pool types implement a Pool interface with Allocate and
ListAllocations methods. Allocate receives the existing allocation
objects, builds a used-value set, finds the first free slot, and
returns a ready-to-create allocation object. ListAllocations
encapsulates the typed List call and returns []Allocation, an
interface providing GetClaimRef, SetClaimRef, and GetValue so the
claim controller operates generically without type switches.

Pool controllers count allocations by listing objects via a shared
poolRef field index rather than reading Status.Allocations. The
allocation type controllers (Index, IPAddress, IPPrefix) validate
each object's value against its pool's ranges/prefixes and set a
Valid condition. The claim controller's finalize path applies the
pool's reclaim policy: Recycle deletes the allocation object,
Retain clears its claimRef so the value persists as reserved but
unbound.

The AllowBindingAnnotation enables rebinding of allocation objects
whose claimRef name matches a Claim but whose UID is stale, such
as after a Claim is deleted and recreated with the same name.

Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
Reflect the migration from embedded pool status allocations to
dedicated allocation objects. Document the new allocation types
(Index, IPAddress, IPPrefix), deterministic naming, pre-provisioning
with the allow-binding annotation, and the updated claim status
fields (status.value, status.allocationRef). Update the flow
diagram to match the simplified allocation and reclaim logic.

Signed-off-by: Felix Kästner <felix.kaestner@sap.com>
@github-actions
Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/ironcore-dev/network-operator/api/core/v1alpha1 0.00% (ø)
github.com/ironcore-dev/network-operator/api/pool/v1alpha1 0.00% (ø)
github.com/ironcore-dev/network-operator/cmd 0.00% (ø)
github.com/ironcore-dev/network-operator/internal/controller/pool 74.26% (+74.26%) 🌟

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/ironcore-dev/network-operator/api/core/v1alpha1/addr_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/core/v1alpha1/index_range.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/core/v1alpha1/ref_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/claim_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/claimref_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/doc.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/groupversion_info.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/index_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/indexpool_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipaddress_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipaddresspool_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipprefix_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipprefixpool_types.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/reclaim_policy.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/api/pool/v1alpha1/zz_generated.deepcopy.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/cmd/main.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/internal/controller/pool/claim_controller.go 71.72% (+71.72%) 198 (+198) 142 (+142) 56 (+56) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/doc.go 0.00% (ø) 0 0 0
github.com/ironcore-dev/network-operator/internal/controller/pool/index_controller.go 77.19% (+77.19%) 57 (+57) 44 (+44) 13 (+13) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/indexpool_controller.go 74.29% (+74.29%) 35 (+35) 26 (+26) 9 (+9) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/ipaddress_controller.go 77.19% (+77.19%) 57 (+57) 44 (+44) 13 (+13) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/ipaddresspool_controller.go 74.29% (+74.29%) 35 (+35) 26 (+26) 9 (+9) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/ipprefix_controller.go 77.19% (+77.19%) 57 (+57) 44 (+44) 13 (+13) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/ipprefixpool_controller.go 74.29% (+74.29%) 35 (+35) 26 (+26) 9 (+9) 🌟
github.com/ironcore-dev/network-operator/internal/controller/pool/poolref.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/ironcore-dev/network-operator/api/pool/v1alpha1/indexpool_types_test.go
  • github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipaddresspool_types_test.go
  • github.com/ironcore-dev/network-operator/api/pool/v1alpha1/ipprefixpool_types_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/claim_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/index_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/indexpool_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/ipaddress_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/ipaddresspool_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/ipprefix_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/ipprefixpool_controller_test.go
  • github.com/ironcore-dev/network-operator/internal/controller/pool/suite_test.go

Copy link
Copy Markdown
Contributor

@nikatza nikatza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic contribution!! 🚀🚀🚀

Since this is a very large PR I came with a larger number of comments than usual, but they are mostly to help me understand the new concepts, so please bear with me.

Most of my comments reduce to one of these 3 points:

  • Pools are quite central to the Fabric controller (or users). Hence, we may want to increase our test coverage to make sure we don't run into regressions by accident in the future.
  • Unless there is a valid use case, I think I would try to make ReclaimPolicy immutable and avoid complex management corner cases for the time being.
  • We should extend our user docs with the expected behavior for a resizing of the pool. In this regard, are we covering it correctly? For example, after downsizing a pool, e.g., an allocated prefix gets removed, the controllers sets anValueOutOfRangeReason, but the Claim remains Allocated and continues to report its value to the controller.

Happy to hear your thoughts on this. Please let me know if any of the comments needs further explanation. But again, amazing work, looking forward merging this PR!

@@ -0,0 +1,92 @@
// SPDX-FileCopyrightText: 2025 SAP SE or an SAP affiliate company and IronCore contributors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2025 -> 2026 (other new files are also affected)

Comment on lines +18 to +19
Start uint64 `json:"-"`
End uint64 `json:"-"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API conventions document does not seem to apply here as this is then later serialized as a string in IndexRange. However, this is later used in Index.Spec.Index, which is an int64. In theory we could run into an overflow I think. In fact, the nosec directive indexpool_types.go:155 is (mis?)handling this issue. May be I am missing part of the puzzle: Is my understanding here correct?

Comment thread api/pool/v1alpha1/groupversion_info.go Outdated
const (
// PreferredValueAnnotation is an optional annotation on a Claim that requests a specific
// allocation value. The format depends on the pool type:
// - IndexPool: decimal uint64, e.g. "64512"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comments from earlier are valid then this should be int64

// Returns ErrPoolExhausted when no free value remains.
Allocate(claim *Claim, existing []Allocation) (Allocation, error)
}
type ClaimSpec struct {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a comment for the function.

Comment on lines +8 to +13
const (
// AllowBindingAnnotation permits an allocation object whose claimRef name matches
// a Claim but whose UID is stale (e.g. after the Claim was deleted and recreated)
// to be rebound by updating the UID to the current Claim.
AllowBindingAnnotation = "pool.networking.metal.ironcore.dev/allow-binding"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I would have first looked into groupversion_info.go, seems a better home for this. That said, here is also fine.

UpdateFunc: func(e event.UpdateEvent) bool {
oldPool := e.ObjectOld.(*poolv1alpha1.IPPrefixPool)
newPool := e.ObjectNew.(*poolv1alpha1.IPPrefixPool)
return oldPool.Status.Allocated != newPool.Status.Allocated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider also .Status.Total here? If the pool is exhausted, we change the spec and add new ranges. The controller won't trigger a reconciliation and we are anyhow stucked due to the TerminalError() from the exhaustion.

Suggested change
return oldPool.Status.Allocated != newPool.Status.Allocated
return oldPool.Status.Allocated != newPool.Status.Allocated ||
oldPool.Status.Total != newPool.Status.Total

UpdateFunc: func(e event.UpdateEvent) bool {
oldPool := e.ObjectOld.(*poolv1alpha1.IPAddressPool)
newPool := e.ObjectNew.(*poolv1alpha1.IPAddressPool)
return oldPool.Status.Allocated != newPool.Status.Allocated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as below.

UpdateFunc: func(e event.UpdateEvent) bool {
oldPool := e.ObjectOld.(*poolv1alpha1.IndexPool)
newPool := e.ObjectNew.(*poolv1alpha1.IndexPool)
return oldPool.Status.Allocated != newPool.Status.Allocated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as below.

Reason: poolv1alpha1.PoolExhaustedReason,
Message: "Referenced pool is exhausted",
})
return reconcile.TerminalError(err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that we can change the pool size in run-time, we might want how we handle this case. If we keep the TerminalError then we probably need to modify the watch above to consider the .Status.Total. Alternatively, we could also check on the Available condition. By this we might ensure a retry if someone frees resources.

"github.com/ironcore-dev/network-operator/internal/conditions"
)

var _ = Describe("Claim Controller", func() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we should be more exhaustive here and consider the following tests:

  • Pool owns the claim and it owns the allocation
  • Pool Available condition is correctly set
  • Claim conditions are correctly set
  • Allocation claimRef has correct UID
  • Re-reconciliation doesn't double allocate
  • On conflict retry with the next slot
  • Exhausted claims retry once a slot is free
  • Exhausted claims retry after pool is expanded
  • Behavior when pool is reduced?
  • Pool deletion cascades on the claims

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/switch-automation Automation processes for network switch management and operations. size/XXL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants