Fix ArrowScan materializing entire FileScanTask into memory #3037

ShreyeshArangath · 2026-02-12T00:56:46Z

WIP not ready for review:
Closes #3036

Rationale for this change

ArrowScan in PyIceberg does not support true streaming, leading to OOM failures when processing large files (file size > container size). While the API returns an iterator, the implementation eagerly materializes all record batches for a FileScanTask before yielding the first row.

Two primary bottlenecks were identified in the pyiceberg.io.pyarrow implementation:

The internal scan logic uses a list() constructor on the batch iterator, forcing the entire file into memory.
The batch_size parameter is not forwarded to the underlying PyArrow ds.Scanner, preventing granular memory control. Though, it does fallback to the standard
This behavior makes it impossible to process files larger than the available memory in distributed environments (e.g., Ray workers)

Are these changes tested?

Yes, tested

Are there any user-facing changes?

Yes, new API in ArrowScan to_record_batch_stream

ShreyeshArangath added 3 commits February 10, 2026 10:57

feat: add support for streaming the batches using ArrowScan

af0d5e3

cleanup scanner fragment kwargs

f43130c

fix tests

babeba2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ArrowScan materializing entire FileScanTask into memory #3037

Fix ArrowScan materializing entire FileScanTask into memory #3037

ShreyeshArangath commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix ArrowScan materializing entire FileScanTask into memory #3037

Are you sure you want to change the base?

Fix ArrowScan materializing entire FileScanTask into memory #3037

Conversation

ShreyeshArangath commented Feb 12, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant