Skip to content

[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2115

Open
DiegoTavares wants to merge 30 commits intoAcademySoftwareFoundation:masterfrom
DiegoTavares:slot-based-scheduling
Open

[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2115
DiegoTavares wants to merge 30 commits intoAcademySoftwareFoundation:masterfrom
DiegoTavares:slot-based-scheduling

Conversation

@DiegoTavares
Copy link
Collaborator

@DiegoTavares DiegoTavares commented Dec 17, 2025

Add a new booking mode that doesn't take cores and memory into consideration, but a predefined limit on how many concurrent frames a host is allowed to run.

Rationale: Booking by slot is useful for pipelines where frames are small and limited not by their
cpu/memory consumption but by other resources like storage bandwidth or network availability. In
these scenarios, limiting the concurrency is more important than the resource consumption.

Attention:* This branch is stacked on top of #2002

Tasks:

  • Implement booking logic on Scheduler
  • Add new columns to Host to mark how many slots are available and fill them up on Cuebot
  • Add new column to Layer to define slot limit and implement logic to fill it up on Cuebot
  • Handle new attributes on Host and Layer using pycue
  • Handle new attributes on Host and Layer using pyoutline
  • Allow setting hosts' slot limit on cuegui
  • Change rqd to ignore core count for slot base booking
  • Change scheduler to mark slot layers as non-threadable

Cuegui Changes

Add a new column to HostMonitorTree to show the number of concurrent slots:
Screenshot 2025-12-18 at 10 34 19 AM

Add a new menu option on HostMonitorTree to change the concurrent slots limit value:
Screenshot 2025-12-18 at 10 34 26 AM
Screenshot 2025-12-18 at 10 37 07 AM

Attention

The new feature is initially being developed on top of the new scheduler module. At this time Cuebot will not take slot booking into consideration. A future PR will handle implementing the Cuebot changes to avoid touching too many files on a single PR.

This field limits the number of concurrent frames allowed to run on a specific host.
This commit is the first step towards the goal of allowing a new booking mode that doesn't take
cores and memory into consideration, but a predefined limit on how many concurrent frames a host is
allowed to run.

Rationale: Booking by slot is useful for pipelines where frames are small and limited not by their
cpu/memory consumption but by other resources like storage bandwith or network availability. In
these scenarios, limiting the concurrency is more important than the resource consumption.
Add slots_required attribute to layer for slot-based booking
Add a menu action for setting a host's slot limit.
When a limit is defined, booking will only allocate layers with slots_required > 0 to be executed on this host. Which means regular booking by cores/memory/gpu becomes disabled.
(0 for no limit, >0 for specific limit)

Changes:
 - Add new proto field to Host and NestedHost
 - Change pycue to allow setting concurrent_procs_limit
 - Change cuegui action menu to add an option to update the new field
 - Update Cuebot to receive the request and update the database
@DiegoTavares DiegoTavares marked this pull request as ready for review February 3, 2026 23:10
Copy link
Collaborator

@ramonfigueiredo ramonfigueiredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DiegoTavares

Some recommended changes are below.

Thanks for your contribution!

@ramonfigueiredo
Copy link
Collaborator

Missing Slot reporting from RQD to Cuebot

Location: rust/crates/rqd/src/system/machine.rs:932-969

Issue: RQD tracks slots internally but does NOT report slot consumption to Cuebot. The HostReport proto contains CoreDetail and RenderHost but neither has slot fields.

Impact:

  • Cuebot cannot make informed booking decisions without knowing available slots
  • Scheduler's host cache has no slot visibility
  • Multiple dispatchers could over-book hosts

Root Cause: proto/src/report.proto missing slot fields in CoreDetail (lines 39-49) or RenderHost (lines 70-91)

Required Fix:

  1. Extend CoreDetail message with: int32 total_slots, int32 idle_slots, int32 consumed_slots
  2. Update RQD's collect_host_report() to populate these fields
  3. Update Cuebot's report handler to process slot metrics

@ramonfigueiredo
Copy link
Collaborator

No Atomic Slot allocation in Java Cuebot

Location: cuebot/src/main/java/com/imageworks/spcue/dao/postgres/HostDaoJdbc.java:569-583

Issue: Slot updates are NOT atomic with respect to slot checking/allocation. No row-level locking or optimistic concurrency control.

Code:

public void updateConcurrentSlotsLimit(HostInterface host, int limit) {
    getJdbcTemplate().update("UPDATE host SET int_concurrent_slots_limit=? WHERE pk_host=?",
        limit, host.getHostId());
}

Missing: Atomic check-and-decrement operation for slot booking:

// Expected but doesn't exist:
UPDATE host_stat
SET int_running_procs = int_running_procs + ?
WHERE pk_host = ?
    AND int_running_procs + ? <= (SELECT int_concurrent_slots_limit FROM host WHERE pk_host = ?)

Impact: Race conditions during concurrent bookings, potential over-booking

@ramonfigueiredo
Copy link
Collaborator

Spec Version Mismatch in PyOutline

Locations:

  • pyoutline/outline/backend/cue.py:378 - Requires spec_version >= 1.16
  • pyoutline/outline/outline.cfg:8 - Config has spec_version = 1.15

Issue: slots_required will be silently ignored with a warning, causing jobs to fall back to non-slot-based scheduling.

Required fix: Update config to spec_version = 1.16 or update minimum required version

@ramonfigueiredo
Copy link
Collaborator

ramonfigueiredo commented Feb 10, 2026

No cross-validation between host and layer settings

Locations: Throughout cuebot/cuegui

Issue: No check that layer.slots_required <= host.concurrent_slots_limit

Impact: Could result in layers that can never run

Suggested fix: Add validation in dispatcher or show warnings in GUI

@ramonfigueiredo
Copy link
Collaborator

Missing Slot utilization monitoring

Location: cuegui/cuegui/HostMonitorTree.py

Issue: No column showing "Slots In Use" vs "Slots Available" for real-time monitoring

Suggested Fix: Add columns showing running_procs_count and available slots

@ramonfigueiredo
Copy link
Collaborator

Missing SQL constraint

Locations: V36 and V38 migrations

Issue: No CHECK constraints to enforce valid ranges:

-- Suggested:
ALTER TABLE layer ADD CONSTRAINT layer_slots_check CHECK (int_slots_required >= 0);
ALTER TABLE host ADD CONSTRAINT host_slots_check CHECK (int_concurrent_slots_limit >= -1);

@ramonfigueiredo
Copy link
Collaborator

Other minor changes

Inconsistent Terminology in GUI Tooltips

Locations:

  • cuegui/cuegui/HostMonitorTree.py:167 - "Usually: 1 frame = 1 slot"
  • cuegui/cuegui/MenuActions.py:1950 - "usually a frame consumes 1 slot"

Suggested fix: Use consistent phrasing

Column Display Inconsistency

Locations:

  • HostMonitorTree.py:166 - Shows "-" for values < 0
  • LayerMonitorTree.py:167 - Shows "-" for values <= 0

Suggested fix: Use consistent threshold logic

Missing Docstring Type Annotation

Location: pycue/opencue/wrappers/host.py:131-132

Issue: setConcurrentSlotsLimit() lacks :type: and :param: docstring

Suggested fix: Add type annotations like setSlotsRequired() has

Dummy Cuebot Always Returns -1

Location: rust/crates/dummy-cuebot/src/report_servant.rs:84

Issue: Hardcoded -1 prevents testing slot-based booking with dummy-cuebot

Suggested fix: Make configurable for testing

Deprecated Field Usage

Location: rust/crates/scheduler/src/pipeline/dispatcher/actor.rs:973

Issue: slots_required: 0 hardcoded in deprecated field

Note: May be intentional for backward compatibility

@ramonfigueiredo
Copy link
Collaborator

ramonfigueiredo commented Feb 10, 2026

Hi @DiegoTavares

The code review is complete. I have added some recommended changes and bug fixes to the PR.

@DiegoTavares
Copy link
Collaborator Author

Missing Slot reporting from RQD to Cuebot

Location: rust/crates/rqd/src/system/machine.rs:932-969

Issue: RQD tracks slots internally but does NOT report slot consumption to Cuebot. The HostReport proto contains CoreDetail and RenderHost but neither has slot fields.

Impact:

  • Cuebot cannot make informed booking decisions without knowing available slots
  • Scheduler's host cache has no slot visibility
  • Multiple dispatchers could over-book hosts

Root Cause: proto/src/report.proto missing slot fields in CoreDetail (lines 39-49) or RenderHost (lines 70-91)

Required Fix:

  1. Extend CoreDetail message with: int32 total_slots, int32 idle_slots, int32 consumed_slots
  2. Update RQD's collect_host_report() to populate these fields
  3. Update Cuebot's report handler to process slot metrics

Rqd doesn't need to report the number of allocated slots as this information is available on the report's list of running procs.

But at reviewing this logic, I've found a different problem. Cuebot is counting each proc as a slot, without taking into consideration the slots_required field in each frame. I'm implementing a fix for this

Cuebot now accounts for a sum of frames' consumed slots, not 1 slot per frame as previously.
- Add runningSlots field to Host protobuf and HostEntity
- Update Whiteboard DAO queries to include runningSlots
- Show available slots (concurrentSlotsLimit - runningSlots) in CueGUI host monitor tree
@DiegoTavares
Copy link
Collaborator Author

Missing Slot utilization monitoring

Location: cuegui/cuegui/HostMonitorTree.py

Issue: No column showing "Slots In Use" vs "Slots Available" for real-time monitoring

Suggested Fix: Add columns showing running_procs_count and available slots

Missing Slot utilization monitoring

Location: cuegui/cuegui/HostMonitorTree.py

Issue: No column showing "Slots In Use" vs "Slots Available" for real-time monitoring

Suggested Fix: Add columns showing running_procs_count and available slots

Added on 091a98b

@DiegoTavares
Copy link
Collaborator Author

Missing SQL constraint

Locations: V36 and V38 migrations

Issue: No CHECK constraints to enforce valid ranges:

-- Suggested:
ALTER TABLE layer ADD CONSTRAINT layer_slots_check CHECK (int_slots_required >= 0);
ALTER TABLE host ADD CONSTRAINT host_slots_check CHECK (int_concurrent_slots_limit >= -1);

Missing SQL constraint

Locations: V36 and V38 migrations

Issue: No CHECK constraints to enforce valid ranges:

-- Suggested:
ALTER TABLE layer ADD CONSTRAINT layer_slots_check CHECK (int_slots_required >= 0);
ALTER TABLE host ADD CONSTRAINT host_slots_check CHECK (int_concurrent_slots_limit >= -1);

Negative values behave the same as 0 throughout the application. Enforcing a constraint will only push a possible exception that will have to be considered in weird places where functions are expected to be total.

@DiegoTavares
Copy link
Collaborator Author

Sorry about the weird state this PR was it. It was implemented in 2 parts with the new year break in between. Although it has been tested in isolation, I failed to test the integrated solution. I believe all the suggestions have been applied or handled.

Copy link
Collaborator

@ramonfigueiredo ramonfigueiredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM. Approved!

Thanks @DiegoTavares

@ramonfigueiredo
Copy link
Collaborator

Missing Test Coverage

Component Missing
Java DAO updateConcurrentSlotsLimit, getHostConcurrentSlotsLimit
Java DAO setLayerSlotsRequired update path
Java gRPC ManageHost.setConcurrentSlotsLimit, ManageLayer.setSlotsRequired, RqdReportStatic.getHostSlotsLimit
Java DAO test updateHostStats doesn't assert int_running_slots was stored
pycue host.setConcurrentSlotsLimit, host.concurrentSlotsLimit
pycue layer.slotsRequired getter
Negative values slots_required with negative input via JobSpec

@ramonfigueiredo
Copy link
Collaborator

Thanks for the contribution, @DiegoTavares !

I've reviewed the PR and left comments throughout.

@DiegoTavares
Copy link
Collaborator Author

No Atomic Slot allocation in Java Cuebot

Location: cuebot/src/main/java/com/imageworks/spcue/dao/postgres/HostDaoJdbc.java:569-583

Issue: Slot updates are NOT atomic with respect to slot checking/allocation. No row-level locking or optimistic concurrency control.

Code:

public void updateConcurrentSlotsLimit(HostInterface host, int limit) {
    getJdbcTemplate().update("UPDATE host SET int_concurrent_slots_limit=? WHERE pk_host=?",
        limit, host.getHostId());
}

Missing: Atomic check-and-decrement operation for slot booking:

// Expected but doesn't exist:
UPDATE host_stat
SET int_running_procs = int_running_procs + ?
WHERE pk_host = ?
    AND int_running_procs + ? <= (SELECT int_concurrent_slots_limit FROM host WHERE pk_host = ?)

Impact: Race conditions during concurrent bookings, potential over-booking

I understand the problem of the updated not being atomic, which in this case is not a real issue. The updates are coming from the GUI API, if two users update the same host at the same seconds, it's okay to have them race each other. Besides that I don't quite understand the query on your suggestion.

@DiegoTavares DiegoTavares force-pushed the slot-based-scheduling branch from 52ff8b9 to 6b97a04 Compare March 2, 2026 22:49
@DiegoTavares
Copy link
Collaborator Author

The last round of fixes related to the current review pass are available at this commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants