Skip to content

[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293

Open
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/spark-57242-allocator-oom-recovery
Open

[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/spark-57242-allocator-oom-recovery

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented Jun 3, 2026

Why are the changes needed?

Spark page allocation has two distinct steps:

  1. MemoryManager grants the task permission to use execution memory.
  2. The Tungsten MemoryAllocator physically allocates the on-heap or off-heap page.

An execution-memory grant is an accounting decision; it does not guarantee that the physical allocator can create the page. The allocator may still throw OutOfMemoryError, for example because of memory pressure outside Spark's execution-memory accounting or because a sufficiently large allocation cannot be satisfied.

When that happens, TaskMemoryManager.allocatePage() currently retains the grant as acquired-but-unused memory and recursively calls allocatePage():

acquire grant G1 -> allocator OOM -> retain G1 -> recurse
  acquire grant G2 -> allocator OOM -> retain G2 -> recurse
    ...

Each retry therefore asks for another execution-memory grant even though the task has not physically allocated the previous one. There is no check that retrying made progress. Under a persistent allocator/accounting mismatch, the task can pin an increasing amount of execution memory, recurse repeatedly, and eventually block waiting for more execution memory or fail far away from the original allocator OOM instead of recovering or failing promptly.

This is the generic TaskMemoryManager failure path underlying the long-running allocation retry described by SPARK-54354. That issue bounded temporary memory managers used by hashed relations, while SPARK-54818 improved the allocator OOM diagnostics. The recursive retry behavior remains for other page-allocation consumers.

The intended recovery behavior is instead:

acquire grant G1 -> allocator OOM
  -> spill task-managed consumers
  -> retry the allocator with G1 only if spilling released tracked memory
  -> otherwise return allocation failure promptly

Recovery is best-effort: consumers that support spilling may free enough physical memory for the same allocation to succeed. If the task has no spillable memory, or spilling makes no measurable progress, retrying the same allocation is not useful and the caller should receive SparkOutOfMemoryError.

What changes were proposed in this PR?

For TaskMemoryManager:

  • Replace recursive page allocation with an iterative retry that keeps and reuses the original execution-memory grant.
  • After allocator OOM, spill registered task-memory consumers directly without acquiring another fair-share execution-memory grant.
  • Measure progress using the consumer's tracked memory before and after spilling, and retry only while that value decreases.
  • Prevent page allocations made from inside a spill callback from recursively entering allocator recovery.
  • Return allocation failure when no consumer can make progress, so the existing caller path raises SparkOutOfMemoryError.
  • Make cleanup of acquired-but-unused grants idempotent.

The direct spill path can reset ShuffleExternalSorter while record insertion is in progress, so this PR also makes its pointer-array lifecycle safe for that recovery path:

  • ShuffleInMemorySorter.reset() frees its pointer array and allocates the replacement lazily, outside the spill callback.
  • ShuffleExternalSorter restores the initial pointer array when the next record is inserted.
  • Record insertion rechecks pointer-array capacity after data-page allocation because that allocation may have triggered a spill and reset the array.
  • Empty iteration and cleanup remain valid while the replacement pointer array has not yet been allocated.

Successful page allocations are unchanged. This introduces no new configuration or public API. Consumers that can spill gain a bounded recovery opportunity; consumers that cannot spill fail promptly rather than entering an unbounded retry loop.

How was this PR tested?

Added deterministic TaskMemoryManagerSuite coverage for:

  • allocator OOM with no spillable memory
  • successful spill and retry using the original grant
  • nested page allocation during allocator recovery
  • off-heap allocator failure
  • idempotent failed-grant cleanup

Added shuffle sorter coverage for:

  • lazy pointer-array allocation after reset
  • cleanup while the pointer array is lazily unallocated
  • data-page allocation triggering a spill during record insertion

Validation performed:

  • TaskMemoryManagerSuite
  • ShuffleInMemorySorterSuite
  • ShuffleExternalSorterSuite
  • UnsafeExternalSorterSuite
  • Core reactor compilation and Checkstyle
  • Scalafmt validation
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant