feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262
feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262prince8273 wants to merge 6 commits into
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 31 files ± 0 31 suites ±0 11h 2m 58s ⏱️ - 28m 4s For more details on these failures, see this check. Results for commit 2e48bea. ± Comparison against base commit cf508b9. ♻️ This comment has been updated with latest results. |
… death When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.
f052680 to
2e48bea
Compare
CI Status NoteThe failing checks are pre-existing flaky tests unrelated to this PR. The dask/distributed test report The GitHub Actions bot confirms: 3 ❌ -1 against base commit cf508b9 — |
Problem
The scheduler would steal a task whenever the thief was even 1ms
faster than the victim. For data-heavy, compute-light tasks this
caused chronic thrashing — transfer costs routinely exceeded savings.
Change
Added a margin constraint to
balance():The thief must now promise a speedup of at least 50% of the network
transfer cost. Marginal steals that are net-negative under realistic
network jitter are suppressed.
Observability
Added
reject_count_margin_total(keyed by level) toWorkStealing.metricsso operators can measure exactly how manythrashing steals are being prevented. A
logger.debugline isemitted on each rejection with full task and margin details.
Tests
Added
test_reject_count_margin_metric— simulates a highcomm_cost/low compute scenario, triggers
balance(), and assertsreject_count_margin_total >= 1.