[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2115
[scheduler/cuebot/pycue/rqd/pyoutline] Booking by Slot #2115DiegoTavares wants to merge 30 commits intoAcademySoftwareFoundation:masterfrom
Conversation
This field limits the number of concurrent frames allowed to run on a specific host.
This commit is the first step towards the goal of allowing a new booking mode that doesn't take cores and memory into consideration, but a predefined limit on how many concurrent frames a host is allowed to run. Rationale: Booking by slot is useful for pipelines where frames are small and limited not by their cpu/memory consumption but by other resources like storage bandwith or network availability. In these scenarios, limiting the concurrency is more important than the resource consumption.
Add slots_required attribute to layer for slot-based booking
Add a menu action for setting a host's slot limit. When a limit is defined, booking will only allocate layers with slots_required > 0 to be executed on this host. Which means regular booking by cores/memory/gpu becomes disabled. (0 for no limit, >0 for specific limit) Changes: - Add new proto field to Host and NestedHost - Change pycue to allow setting concurrent_procs_limit - Change cuegui action menu to add an option to update the new field - Update Cuebot to receive the request and update the database
cuebot/src/main/java/com/imageworks/spcue/dao/postgres/LayerDaoJdbc.java
Show resolved
Hide resolved
cuebot/src/main/resources/conf/ddl/postgres/migrations/V38__Add_host_frame_limit.sql
Outdated
Show resolved
Hide resolved
Missing Slot reporting from RQD to CuebotLocation: rust/crates/rqd/src/system/machine.rs:932-969 Issue: RQD tracks slots internally but does NOT report slot consumption to Cuebot. The HostReport proto contains CoreDetail and RenderHost but neither has slot fields. Impact:
Root Cause: proto/src/report.proto missing slot fields in CoreDetail (lines 39-49) or RenderHost (lines 70-91) Required Fix:
|
No Atomic Slot allocation in Java CuebotLocation: cuebot/src/main/java/com/imageworks/spcue/dao/postgres/HostDaoJdbc.java:569-583 Issue: Slot updates are NOT atomic with respect to slot checking/allocation. No row-level locking or optimistic concurrency control. Code: public void updateConcurrentSlotsLimit(HostInterface host, int limit) {
getJdbcTemplate().update("UPDATE host SET int_concurrent_slots_limit=? WHERE pk_host=?",
limit, host.getHostId());
}Missing: Atomic check-and-decrement operation for slot booking: // Expected but doesn't exist:
UPDATE host_stat
SET int_running_procs = int_running_procs + ?
WHERE pk_host = ?
AND int_running_procs + ? <= (SELECT int_concurrent_slots_limit FROM host WHERE pk_host = ?)Impact: Race conditions during concurrent bookings, potential over-booking |
Spec Version Mismatch in PyOutlineLocations:
Issue: Required fix: Update config to |
No cross-validation between host and layer settingsLocations: Throughout cuebot/cuegui Issue: No check that Impact: Could result in layers that can never run Suggested fix: Add validation in dispatcher or show warnings in GUI |
Missing Slot utilization monitoringLocation: cuegui/cuegui/HostMonitorTree.py Issue: No column showing "Slots In Use" vs "Slots Available" for real-time monitoring Suggested Fix: Add columns showing |
Missing SQL constraintLocations: V36 and V38 migrations Issue: No CHECK constraints to enforce valid ranges: -- Suggested:
ALTER TABLE layer ADD CONSTRAINT layer_slots_check CHECK (int_slots_required >= 0);
ALTER TABLE host ADD CONSTRAINT host_slots_check CHECK (int_concurrent_slots_limit >= -1); |
Other minor changesInconsistent Terminology in GUI TooltipsLocations:
Suggested fix: Use consistent phrasing Column Display InconsistencyLocations:
Suggested fix: Use consistent threshold logic Missing Docstring Type AnnotationLocation: pycue/opencue/wrappers/host.py:131-132 Issue: setConcurrentSlotsLimit() lacks :type: and :param: docstring Suggested fix: Add type annotations like setSlotsRequired() has Dummy Cuebot Always Returns -1Location: rust/crates/dummy-cuebot/src/report_servant.rs:84 Issue: Hardcoded -1 prevents testing slot-based booking with dummy-cuebot Suggested fix: Make configurable for testing Deprecated Field UsageLocation: rust/crates/scheduler/src/pipeline/dispatcher/actor.rs:973 Issue: slots_required: 0 hardcoded in deprecated field Note: May be intentional for backward compatibility |
|
The code review is complete. I have added some recommended changes and bug fixes to the PR. |
Rqd doesn't need to report the number of allocated slots as this information is available on the report's list of running procs. But at reviewing this logic, I've found a different problem. Cuebot is counting each proc as a slot, without taking into consideration the slots_required field in each frame. I'm implementing a fix for this |
Cuebot now accounts for a sum of frames' consumed slots, not 1 slot per frame as previously.
- Add runningSlots field to Host protobuf and HostEntity - Update Whiteboard DAO queries to include runningSlots - Show available slots (concurrentSlotsLimit - runningSlots) in CueGUI host monitor tree
Added on 091a98b |
Negative values behave the same as 0 throughout the application. Enforcing a constraint will only push a possible exception that will have to be considered in weird places where functions are expected to be total. |
|
Sorry about the weird state this PR was it. It was implemented in 2 parts with the new year break in between. Although it has been tested in isolation, I failed to test the integrated solution. I believe all the suggestions have been applied or handled. |
ramonfigueiredo
left a comment
There was a problem hiding this comment.
LGFM. Approved!
Thanks @DiegoTavares
cuebot/src/main/java/com/imageworks/spcue/dao/postgres/NestedWhiteboardDaoJdbc.java
Outdated
Show resolved
Hide resolved
cuebot/src/main/resources/conf/ddl/postgres/migrations/V38__Add_host_frame_slot_limit.sql
Show resolved
Hide resolved
cuebot/src/main/resources/conf/spring/applicationContext-grpc.xml
Outdated
Show resolved
Hide resolved
Missing Test Coverage
|
|
Thanks for the contribution, @DiegoTavares ! I've reviewed the PR and left comments throughout. |
I understand the problem of the updated not being atomic, which in this case is not a real issue. The updates are coming from the GUI API, if two users update the same host at the same seconds, it's okay to have them race each other. Besides that I don't quite understand the query on your suggestion. |
52ff8b9 to
6b97a04
Compare
|
The last round of fixes related to the current review pass are available at this commit |
Add a new booking mode that doesn't take cores and memory into consideration, but a predefined limit on how many concurrent frames a host is allowed to run.
Rationale: Booking by slot is useful for pipelines where frames are small and limited not by their
cpu/memory consumption but by other resources like storage bandwidth or network availability. In
these scenarios, limiting the concurrency is more important than the resource consumption.
Attention:* This branch is stacked on top of #2002
Tasks:
Cuegui Changes
Add a new column to HostMonitorTree to show the number of concurrent slots:

Add a new menu option on HostMonitorTree to change the concurrent slots limit value:


Attention
The new feature is initially being developed on top of the new scheduler module. At this time Cuebot will not take slot booking into consideration. A future PR will handle implementing the Cuebot changes to avoid touching too many files on a single PR.