Skip to content

feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262

Open
prince8273 wants to merge 6 commits into
dask:mainfrom
prince8273:work-stealing-heuristics
Open

feat(stealing): add margin-of-improvement bound to prevent work-stealing thrash#9262
prince8273 wants to merge 6 commits into
dask:mainfrom
prince8273:work-stealing-heuristics

Conversation

@prince8273
Copy link
Copy Markdown

Problem

The scheduler would steal a task whenever the thief was even 1ms
faster than the victim. For data-heavy, compute-light tasks this
caused chronic thrashing — transfer costs routinely exceeded savings.

Change

Added a margin constraint to balance():

margin = comm_cost_thief * 0.5
would_steal_with_margin = (
    occ_thief + comm_cost_thief + compute + margin
    <= occ_victim - (comm_cost_victim + compute) / 2
)

The thief must now promise a speedup of at least 50% of the network
transfer cost. Marginal steals that are net-negative under realistic
network jitter are suppressed.

Observability

Added reject_count_margin_total (keyed by level) to
WorkStealing.metrics so operators can measure exactly how many
thrashing steals are being prevented. A logger.debug line is
emitted on each rejection with full task and margin details.

Tests

Added test_reject_count_margin_metric — simulates a high
comm_cost/low compute scenario, triggers balance(), and asserts
reject_count_margin_total >= 1.

@prince8273 prince8273 requested a review from fjetter as a code owner May 14, 2026 09:52
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 14, 2026

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    31 files  ± 0      31 suites  ±0   11h 2m 58s ⏱️ - 28m 4s
 4 122 tests + 1   4 014 ✅ +2    105 💤 ±0   3 ❌  -  1 
59 830 runs  +15  57 325 ✅ +4  2 488 💤 ±0  17 ❌ +11 

For more details on these failures, see this check.

Results for commit 2e48bea. ± Comparison against base commit cf508b9.

♻️ This comment has been updated with latest results.

… death

When a worker drops off the cluster unexpectedly (e.g., due to an OOM kill), the scheduler tracks the processing_keys but previously did not log them to the console. This change surfaces exactly which tasks were interrupted, significantly improving debugging provenance for cluster hangs and memory crashes.
@prince8273 prince8273 force-pushed the work-stealing-heuristics branch from f052680 to 2e48bea Compare May 14, 2026 13:48
@prince8273
Copy link
Copy Markdown
Author

CI Status Note

The failing checks are pre-existing flaky tests unrelated to this PR.

The dask/distributed test report
shows the 90-day failure history across all platforms — only 2 isolated
failures appear in the entire grid, both predating this PR and matching
the same tests failing here.

The GitHub Actions bot confirms: 3 ❌ -1 against base commit cf508b9
this PR introduced zero new failures and improved the suite by fixing one
pre-existing failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant