Flink: Pass FileIO on Flink's read path by nastra · Pull Request #15663 · apache/iceberg

nastra · 2026-03-17T12:20:47Z

When accessing/reading data files, the codebase is using the Table's FileIO instance through table.io() on Flink's read path. With remote scan planning the FileIO instance is configured with a PlanID + custom storage credentials inside RESTTableScan, but that instance is never propagated to the place(s) that actually perform the read., thus leading to errors.

This PR passes the FileIO obtained during remote/distributed scan planning next to the Table instance on Flink's read path.

This is similar to #15448, where we applied the same approach on Spark's read path

pvary · 2026-03-17T12:45:13Z

+      if (version >= 4) {
+        if (fileIO != null) {
+          out.writeBoolean(true);
+          byte[] fileIOBytes = InstantiationUtil.serializeObject(fileIO);


What is the size of the serialized fileIO?

This is a bit concerning for me as we basically send this with every task

pvary · 2026-03-17T12:46:22Z


+      if (version >= 4) {
+        if (fileIO != null) {
+          out.writeBoolean(true);


Why not just a specific length instead of an extra boolean?
Maybe -1 length for null?

pvary · 2026-03-17T12:47:40Z

      case 2:
        return in.readUTF();
      case 3:
+      case 4:


We will need to add some unit tests for the serialization

pvary · 2026-03-17T12:49:50Z

-          CloseableIterable.transform(tasksIterable, IcebergSourceSplit::fromCombinedScanTask));
+          CloseableIterable.transform(
+              planResult.tasks(),
+              task -> IcebergSourceSplit.fromCombinedScanTask(task, planResult.fileIO().get())));


Do we have an inkling how the supplier will be implemented? Maybe calling get for every split would be an overkill?

yes I agree, this will definitely be an overkill and that's also why we didn't add FileIO to the ScanTaskGroup for Spark. Right now I'm just experimenting here a bit, but we definitely need to improve this here

pvary · 2026-03-17T12:56:34Z

-  static CloseableIterable<CombinedScanTask> planTasks(
-      Table table, ScanContext context, ExecutorService workerPool) {
+  /** Result of planning that includes the scan's FileIO for use when reading. */
+  private static class PlanResult implements Closeable {


Why not just use scan?

github-actions · 2026-04-17T00:32:58Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2026-04-24T00:35:17Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

nastra added 4 commits March 16, 2026 09:26

API, Core: Use Supplier for FileIO on Scan

df2a659

Add workaround to avoid breaking API in BaseScan#io()

a7e24f2

Spark: Pass FileIO on Spark's read path

3d5c71a

Flink: Pass FileIO on Flink's read path

5d63ee0

nastra marked this pull request as draft March 17, 2026 12:21

github-actions Bot added API spark core flink build labels Mar 17, 2026

pvary reviewed Mar 17, 2026

View reviewed changes

github-actions Bot added the stale label Apr 17, 2026

github-actions Bot closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: Pass FileIO on Flink's read path#15663

Flink: Pass FileIO on Flink's read path#15663
nastra wants to merge 4 commits intoapache:mainfrom
nastra:remote-planning-with-flink

nastra commented Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

nastra Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nastra commented Mar 17, 2026

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

nastra Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

pvary Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants