Fix ArrowScan materializing entire FileScanTask into memory #3037
+301
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP not ready for review:
Closes #3036
Rationale for this change
ArrowScan in PyIceberg does not support true streaming, leading to OOM failures when processing large files (file size > container size). While the API returns an iterator, the implementation eagerly materializes all record batches for a FileScanTask before yielding the first row.
Two primary bottlenecks were identified in the pyiceberg.io.pyarrow implementation:
The internal scan logic uses a list() constructor on the batch iterator, forcing the entire file into memory.
The batch_size parameter is not forwarded to the underlying PyArrow ds.Scanner, preventing granular memory control. Though, it does fallback to the standard
This behavior makes it impossible to process files larger than the available memory in distributed environments (e.g., Ray workers)
Are these changes tested?
Yes, tested
Are there any user-facing changes?
Yes, new API in ArrowScan
to_record_batch_stream