Skip to content

feat(scan): implement size-based file scan task planning#2031

Closed
xbattlax wants to merge 1 commit intoapache:mainfrom
xbattlax:scan_file_planning
Closed

feat(scan): implement size-based file scan task planning#2031
xbattlax wants to merge 1 commit intoapache:mainfrom
xbattlax:scan_file_planning

Conversation

@xbattlax
Copy link
Copy Markdown
Contributor

Summary

Implement size-based file scan task planning for iceberg-rust, addressing issue #128.

Changes

  • Add crates/iceberg/src/scan/bin_packing.rs with:

    • Greedy bin-packing algorithm with configurable lookback
    • BinPackingStream<S> for async streaming (memory efficient)
    • CombinedScanTask grouping for balanced parallel execution
    • Weight calculation considering data size and file open costs
  • Update crates/iceberg/src/scan/context.rs:

    • File splitting based on split_offsets (Parquet row group boundaries)
    • Fallback to byte-range splitting when offsets unavailable
    • Optimized delete file handling (move ownership for last split)
  • Update crates/iceberg/src/scan/mod.rs:

    • Add with_split_target_size(), with_split_open_file_cost(), with_split_lookback() builder methods
    • Add plan_tasks() method returning CombinedScanTaskStream
  • Update crates/iceberg/src/scan/task.rs:

    • Add CombinedScanTask struct and CombinedScanTaskStream type

Notes

This matches the Java Iceberg implementation's TableScanUtil.planTasks() functionality:

  • Large files are split into multiple tasks for parallel processing
  • Small files are combined to reduce file open overhead
  • Streaming implementation avoids collecting all tasks into memory

Closes #128

@mbutrovich mbutrovich self-requested a review February 2, 2026 18:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 6, 2026

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Plan file scan task according scan file size.

1 participant