[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM by sunchao · Pull Request #56293 · apache/spark

sunchao · 2026-06-03T05:14:02Z

Why are the changes needed?

Spark page allocation has two distinct steps:

MemoryManager grants the task permission to use execution memory.
The Tungsten MemoryAllocator physically allocates the on-heap or off-heap page.

An execution-memory grant is an accounting decision; it does not guarantee that the physical allocator can create the page. The allocator may still throw OutOfMemoryError, for example because of memory pressure outside Spark's execution-memory accounting or because a sufficiently large allocation cannot be satisfied.

When that happens, TaskMemoryManager.allocatePage() currently retains the grant as acquired-but-unused memory and recursively calls allocatePage():

acquire grant G1 -> allocator OOM -> retain G1 -> recurse
  acquire grant G2 -> allocator OOM -> retain G2 -> recurse
    ...

Each retry therefore asks for another execution-memory grant even though the task has not physically allocated the previous one. There is no check that retrying made progress. Under a persistent allocator/accounting mismatch, the task can pin an increasing amount of execution memory, recurse repeatedly, and eventually block waiting for more execution memory or fail far away from the original allocator OOM instead of recovering or failing promptly.

This is the generic TaskMemoryManager failure path underlying the long-running allocation retry described by SPARK-54354. That issue bounded temporary memory managers used by hashed relations, while SPARK-54818 improved the allocator OOM diagnostics. The recursive retry behavior remains for other page-allocation consumers.

The intended recovery behavior is instead:

acquire grant G1 -> allocator OOM
  -> spill task-managed consumers
  -> retry the allocator with G1 only if spilling released tracked memory
  -> otherwise return allocation failure promptly

Recovery is best-effort: consumers that support spilling may free enough physical memory for the same allocation to succeed. If the task has no spillable memory, or spilling makes no measurable progress, retrying the same allocation is not useful and the caller should receive SparkOutOfMemoryError.

What changes were proposed in this PR?

For TaskMemoryManager:

Replace recursive page allocation with an iterative retry that keeps and reuses the original execution-memory grant.
After allocator OOM, spill registered task-memory consumers directly without acquiring another fair-share execution-memory grant.
Measure progress using the consumer's tracked memory before and after spilling, and retry only while that value decreases.
Prevent page allocations made from inside a spill callback from recursively entering allocator recovery.
Return allocation failure when no consumer can make progress, so the existing caller path raises SparkOutOfMemoryError.
Make cleanup of acquired-but-unused grants idempotent.

The direct spill path can reset ShuffleExternalSorter while record insertion is in progress, so this PR also makes its pointer-array lifecycle safe for that recovery path:

ShuffleInMemorySorter.reset() frees its pointer array and allocates the replacement lazily, outside the spill callback.
ShuffleExternalSorter restores the initial pointer array when the next record is inserted.
Record insertion rechecks pointer-array capacity after data-page allocation because that allocation may have triggered a spill and reset the array.
Empty iteration and cleanup remain valid while the replacement pointer array has not yet been allocated.

Successful page allocations are unchanged. This introduces no new configuration or public API. Consumers that can spill gain a bounded recovery opportunity; consumers that cannot spill fail promptly rather than entering an unbounded retry loop.

How was this PR tested?

Added deterministic TaskMemoryManagerSuite coverage for:

allocator OOM with no spillable memory
successful spill and retry using the original grant
nested page allocation during allocator recovery
off-heap allocator failure
idempotent failed-grant cleanup

Added shuffle sorter coverage for:

lazy pointer-array allocation after reset
cleanup while the pointer array is lazily unallocated
data-page allocation triggering a spill during record insertion

Validation performed:

TaskMemoryManagerSuite
ShuffleInMemorySorterSuite
ShuffleExternalSorterSuite
UnsafeExternalSorterSuite
Core reactor compilation and Checkstyle
Scalafmt validation
git diff --check

…ocator OOM

[SPARK-57242][CORE] Avoid unbounded page allocation retries after all…

d25673e

…ocator OOM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293

[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/spark-57242-allocator-oom-recovery

sunchao commented Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunchao commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

What changes were proposed in this PR?

How was this PR tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunchao commented Jun 3, 2026 •

edited

Loading