-
Notifications
You must be signed in to change notification settings - Fork 870
[Pools] Prioritize Idle Workers for Scale Down #8243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @lloyd-brown, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the worker pool autoscaling mechanism by introducing a more intelligent selection process for scaling down workers. The primary goal is to minimize disruption and optimize resource utilization by ensuring that workers with active jobs are retained over idle ones when a scale-down event occurs. This change refines the existing sorting criteria to explicitly account for the number of running jobs on each worker, making the autoscaler more efficient and job-aware. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces a significant improvement to the autoscaling logic by prioritizing idle workers during scale-down operations. This ensures that active jobs are not prematurely terminated, leading to a more robust and user-friendly service. The changes are well-documented, and the new sorting criteria are clearly explained in the docstrings. Comprehensive unit tests and a smoke test have been added to validate the new behavior, covering various scenarios including job count prioritization, equal job counts, and the preservation of existing status and version priorities.
|
/smoke-test |
|
/smoke-test -k test_pool_scale_down_with_job_count_priority --kubernetes |
|
/smoke-test |
|
Tests only failing due to issue fixed in #8256 and resource unavailability. |
cg505
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The autoscaler change looks good.
However, I think a new version that keeps the same config (except for worker count) should re-use the same replicas without needing to add the update_replicas RPC.
skypilot/sky/serve/replica_managers.py
Lines 1476 to 1480 in afb7768
| # Reuse all replicas that have the same config as the new version | |
| # (except for the `service` field) by directly setting the version to be | |
| # the latest version. This can significantly improve the speed | |
| # for updating an existing service with only config changes to the | |
| # service specs, e.g. scale down the service. |
If this isn't working correctly, we should try to understand why instead of just going around it, to avoid having two implementations of the same thing.
| ) -> None: | ||
| """Updates an existing service or pool.""" | ||
|
|
||
| def _validate_task(task: 'task_lib.Task') -> 'task_lib.Task': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming: It's confusing that this doesn't call task.validate(). Maybe just _apply_admin_policy?
| # Get the number of running jobs for each replica | ||
| replica_job_counts = {} | ||
| for info in replicas: | ||
| job_ids = managed_job_state.get_nonterminal_job_ids_by_pool( | ||
| service_name, info.cluster_name) | ||
| replica_job_counts[info.replica_id] = len(job_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a no-op for non-pools?
This PR addresses a problem where changing the number of workers in a pool leads to us downing workers that are actively running jobs. Here is the working example: we have two workers and there is only a job running on worker 2, we want to scale the number of workers down to 1, ideally we would only tear down worker 1 and leave worker 2 to finish its job, this does not work, currently we tear down both workers in order to update to the new service version that has 2 workers.
Changes made
--workersupdates (which do not allow a new YAML so the only thing that can be changing is the number of workers) we introduce a new call in the serve controller to directly change the number of replicas in the autoscaler.Backwards compatibility:
If we don’t have support for updating replicas directly we will fall back to the old logic of creating a new version with the updated number of workers, losing the ability to scale without cancelling jobs.
I added several unit tests to make sure
I added a smoke test that makes sure when we scale down a pool that has 2 workers and only 1 of them is running a job we always scale down the one that is idle.
Tested (run the relevant ones):
bash format.sh/smoke-test(CI) orpytest tests/test_smoke.py(local)/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local)/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)