[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293
Open
sunchao wants to merge 1 commit into
Open
[SPARK-57242][CORE] Avoid unbounded page allocation retries after allocator OOM#56293sunchao wants to merge 1 commit into
sunchao wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
Spark page allocation has two distinct steps:
MemoryManagergrants the task permission to use execution memory.MemoryAllocatorphysically allocates the on-heap or off-heap page.An execution-memory grant is an accounting decision; it does not guarantee that the physical allocator can create the page. The allocator may still throw
OutOfMemoryError, for example because of memory pressure outside Spark's execution-memory accounting or because a sufficiently large allocation cannot be satisfied.When that happens,
TaskMemoryManager.allocatePage()currently retains the grant as acquired-but-unused memory and recursively callsallocatePage():Each retry therefore asks for another execution-memory grant even though the task has not physically allocated the previous one. There is no check that retrying made progress. Under a persistent allocator/accounting mismatch, the task can pin an increasing amount of execution memory, recurse repeatedly, and eventually block waiting for more execution memory or fail far away from the original allocator OOM instead of recovering or failing promptly.
This is the generic
TaskMemoryManagerfailure path underlying the long-running allocation retry described by SPARK-54354. That issue bounded temporary memory managers used by hashed relations, while SPARK-54818 improved the allocator OOM diagnostics. The recursive retry behavior remains for other page-allocation consumers.The intended recovery behavior is instead:
Recovery is best-effort: consumers that support spilling may free enough physical memory for the same allocation to succeed. If the task has no spillable memory, or spilling makes no measurable progress, retrying the same allocation is not useful and the caller should receive
SparkOutOfMemoryError.What changes were proposed in this PR?
For
TaskMemoryManager:SparkOutOfMemoryError.The direct spill path can reset
ShuffleExternalSorterwhile record insertion is in progress, so this PR also makes its pointer-array lifecycle safe for that recovery path:ShuffleInMemorySorter.reset()frees its pointer array and allocates the replacement lazily, outside the spill callback.ShuffleExternalSorterrestores the initial pointer array when the next record is inserted.Successful page allocations are unchanged. This introduces no new configuration or public API. Consumers that can spill gain a bounded recovery opportunity; consumers that cannot spill fail promptly rather than entering an unbounded retry loop.
How was this PR tested?
Added deterministic
TaskMemoryManagerSuitecoverage for:Added shuffle sorter coverage for:
Validation performed:
TaskMemoryManagerSuiteShuffleInMemorySorterSuiteShuffleExternalSorterSuiteUnsafeExternalSorterSuitegit diff --check