Skip to content

Add burst-proof admission control and fast resource accounting#187

Open
sjmiller609 wants to merge 5 commits intomainfrom
codex/disk-admission-burst-proofing
Open

Add burst-proof admission control and fast resource accounting#187
sjmiller609 wants to merge 5 commits intomainfrom
codex/disk-admission-burst-proofing

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 5, 2026

Summary

  • add burst-proof admission reservations for create, start, and restore
  • enforce disk in the same aggregate admission-control path as cpu, memory, network, and disk I/O
  • speed up resource accounting with cached visible allocations plus cached image, volume, and OCI totals
  • share the same fast visible-allocation path with resource status and hypeman resources

The main thing to balance in this PR is making admission reservation consistent (i.e. burst-proof) but also not slowing down concurrent creation.

Existing Issue

  • Could oversubscribe during bursts
  • Disk was not part of admission control
  • hypeman resources command too slow

What This Changes

  • in-flight lifecycle operations now reserve capacity before slow work starts, so concurrent requests see each other
  • disk admission now uses the same logical/provisioned disk model already reported by Hypeman resources
  • admission control and resource status now use the same cached visible allocation path, so accounting is consistent and much faster

Testing

  • go test ./lib/resources
  • benchmarked current branch against dev-yul-hypeman-1 with a temporary read-only probe built from this code
  • ReserveAllocation: mean 27.6 us, p95 56.1 us
  • GetFullStatus / hypeman resources path: mean 20.0 ms, p95 26.2 ms
  • validated live resource output against the host and confirmed cpu, memory, network, disk I/O, per-instance allocations, and disk overlay/volume accounting matched; image and OCI cache totals matched direct filesystem totals on host, improving accuracy vs currently deployed code

Note

Medium Risk
Touches core admission/resource accounting paths and introduces new in-memory reservation/caching behavior, which could affect capacity enforcement under concurrency. Changes are internal but impact scheduling decisions across CPU/memory/disk/network/disk I/O.

Overview
Implements burst-proof admission control by adding pending resource reservations for in-flight create/start/restore operations, and releasing them only once the instance is marked visible to allocation tracking.

Extends aggregate admission validation to include disk bytes (overlay + volume overlays + images + OCI cache + volumes), and speeds up accounting by caching a conservative instance-allocation view in the instances manager plus cached totals for image/OCI cache bytes and volume bytes (with startup warm-up and incremental refresh on lifecycle/storage changes).

Reviewed by Cursor Bugbot for commit 9566638. Bugbot is set up for automated code reviews on this repo. Configure here.

sjmiller609

This comment was marked as resolved.

@sjmiller609 sjmiller609 requested a review from hiroTamada April 6, 2026 15:18
@sjmiller609 sjmiller609 marked this pull request as ready for review April 6, 2026 15:18
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

Addressed the review follow-ups in the latest commits:

  • switched the read-only CanAllocate / ValidateAllocation paths to RLock
  • documented the intentional visible-then-finish ordering around reservations
  • documented the warmup vs runtime active-state distinction in the cached allocation path
  • hardened disk usage accounting with conservative filesystem fallbacks plus focused tests
  • fixed the restore stale-cache case after guest-network reconfigure failure by rolling the cached visible allocation back before returning the error

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9566638. Configure here.

state := string(StateStopped)
if active {
state = string(StateCreated)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admission allocation uses stale state for "Created" only

Low Severity

allocationFromStoredMetadata maps all active instances to state "Created" and all inactive to "Stopped". GetFullStatus in resource.go filters per-instance allocation breakdowns with a hardcoded check for "Running" || "Paused" || "Created". Since both paths now share the same ListInstanceAllocations source, instances that were previously in "Initializing" state (and excluded from the allocation breakdown) are now included as "Created". This is a subtle behavioral change to the /resources API response for any consumer relying on the previous filtering semantics.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9566638. Configure here.

State: state,
VolumeBytes: volumeBytes,
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volume lookups hold admission lock causing contention

Medium Severity

allocationFromStoredMetadata calls m.volumeManager.GetVolume (disk I/O per volume) with context.Background() while every caller holds admissionAllocationsMu.Lock(). This runs on every saveMetadata call via syncAdmissionAllocation and on every setAdmissionAllocationActive call, blocking all concurrent ListInstanceAllocations readers. For instances with many volumes, this lock-holding duration scales linearly with attached volume count.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9566638. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant