refactor: Change TableScan.projection from indices to expressions #20091

adriangb · 2026-01-31T13:30:21Z

Motivation

This PR was split from #20061 (which consolidates both projections and filters) and enables future work on #19538 and #19387.

The idea here is that:

By placing projections into the TableScan itself we can have the logical optimizer rule from Logical + Physical optimizer for struct field access pushdown #20036 push get_field(...) expressions into TableScan and then when we do logical -> physical planning we can immediately call try_swapping_with_projection on the ExecutionPlan that TableProvider::scan_with_args returns. Thus ensuring we push down the get_field(...) projections into the physical scan without needing a phyiscal optimizer rule that is also aware of expression placement, is implemented for each execuiton plan, etc.
This is a nice cleanup. Coupled with consolidating filters into the scan (Consolidate filter classification into physical planner #20092) we'll end up with a nice compact representation of a "table scan with projections and filters". This also reflects the end state on the physical side where we generally to push projections and filters into the scan operator instead of having nodes right above them.

Changes

Core Type Change

Changed TableScan.projection from Option<Vec<usize>> to Option<Vec<Expr>>
Each expression is a simple Expr::Column reference

New APIs

Added TableScanBuilder for constructing TableScan nodes with expression-based projections directly
Added projection_indices_from_exprs() helper in utils.rs to convert expressions back to indices when needed (for physical planning and serialization)

Backward Compatibility

TableScan::try_new() still accepts Option<Vec<usize>> and converts indices to expressions internally
LogicalPlanBuilder::scan* methods still accept indices for backward compatibility

Updated Components

optimize_projections: Updated to work with expression-based projections using TableScanBuilder
physical_planner: Converts expressions back to indices for ScanArgs
proto serialization: Extracts column names from expressions for serialization
substrait: Converts between expressions and Substrait field indices
SQL unparser: Extracts column names from projection expressions

Test Plan

cargo check passes for all affected crates
cargo test -p datafusion-expr passes
cargo test -p datafusion-optimizer passes
Proto and substrait crates compile successfully

🤖 Generated with Claude Code

This PR changes `TableScan.projection` from `Option<Vec<usize>>` (column indices) to `Option<Vec<Expr>>` (column expressions). This is the first step toward consolidating projection and filter expressions directly in the TableScan node. ## Motivation Currently, TableScan stores projections as column indices which requires constant conversion between indices and expressions throughout the codebase. By storing expressions directly, we: 1. Simplify the data model - projections are naturally expressions 2. Enable future consolidation of filter expressions into projections 3. Reduce conversion overhead in optimization passes This is the first of two PRs splitting PR apache#20061, which consolidates both projections and filters. This PR focuses solely on the projection type change to make the refactoring easier to review. ## Changes - Changed `TableScan.projection` type from `Option<Vec<usize>>` to `Option<Vec<Expr>>` - Added `TableScanBuilder` for constructing TableScan nodes with expression-based projections - Added `projection_indices_from_exprs()` helper to convert expressions back to indices when needed - Updated `TableScan::try_new()` to accept indices (for backward compatibility) and convert them to expressions internally - Updated optimize_projections, physical_planner, proto serialization, substrait conversion, and SQL unparser to work with the new type ## Related Issues / PRs - Split from apache#20061 - Enables future work on filter consolidation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…dices Change the return type from Option<Vec<usize>> to (Option<Vec<Expr>>, Vec<usize>) where: - First element is the remainder projection to apply on top of the scan output (None if projection is all simple column references) - Second element is the column indices to scan For pure column projections, indices preserve projection order (including duplicates) so no remainder projection is needed. For expression projections, indices are deduplicated/sorted for efficiency, and the original expressions are returned as remainder. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Remove separate is_simple_column_projection check and instead use a single loop that tracks whether any complex expressions are encountered via a bool flag. At the end, use the flag to decide whether to return Some or None for the remainder projection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adriangb · 2026-01-31T17:33:33Z

datafusion/expr/src/logical_plan/plan.rs

+/// This builder provides a flexible way to construct `TableScan` nodes,
+/// particularly when working with expression-based projections directly.
+#[derive(Clone)]
+pub struct TableScanBuilder {


This is a bit of refactor that while not strictly necessary for this PR I think will be helpful for future (and makes the code in this PR nicer).

adriangb · 2026-01-31T17:34:33Z

datafusion/expr/src/utils.rs

+/// # Returns
+/// * `Some(Vec<usize>)` - If all expressions are column references found in the schema
+/// * `None` - If any expression is not a column reference or not found in schema
+pub fn projection_indices_from_exprs(


I'm not super sold on this function. I feel there is a footgun where you don't know if None means complex expressions (maybe you forgot to check for them?) or if None means a column was not found in the schema (also something that maybe should never happen?).

adriangb

self review

adriangb · 2026-01-31T23:06:59Z

datafusion/core/src/physical_planner.rs

+                    session_state,
+                )?;
+            } else {
+                // Mixed: push non-async expressions + columns needed by async, keep async on top


Need to make sure we have test coverage that hits this. Generally we should run a full coverage check.

adriangb · 2026-01-31T23:08:10Z

datafusion/core/src/physical_planner.rs

+        // Wrap with ProjectionExec if remainder projection is needed
+        if let Some(ref proj_exprs) = remainder_projection {
+            let scan_output_schema = plan.schema();
+            let scan_df_schema = DFSchema::try_from(scan_output_schema.as_ref().clone())?;
+            let unnormalized_proj_exprs: Vec<Expr> =
+                unnormalize_cols(proj_exprs.iter().cloned());
+
+            // Classify expressions as async or non-async


Not sure if there's a better way to handle this. Async UDFs seem to be a bit of a pain. I wonder if there's some existing helpers, etc. If not I think I can also encapsulate this pattern of "split the projection into a top remainder and inner one that matches a closure" since I see us using it a lot.

adriangb · 2026-01-31T23:09:18Z

datafusion/core/src/physical_planner.rs

+        let num_input_columns = input.schema().fields().len();
+        let input_schema = input.schema();
+
+        match self.try_plan_async_exprs(


Need to investigate how try_plan_async_exprs and how it overlaps with the projection splitting above.

adriangb · 2026-01-31T23:10:04Z

datafusion/expr/src/logical_plan/plan.rs

+                                            col.name.clone()
+                                        } else {
+                                            e.schema_name().to_string()
+                                        }


Note: this is in the Display impl

adriangb · 2026-01-31T23:10:28Z

datafusion/expr/src/logical_plan/plan.rs

@@ -2807,14 +2829,129 @@ impl TableScan {
        Ok(Self {


Wonder if we could ruse TableScanBuilder here?

adriangb · 2026-01-31T23:12:05Z

datafusion/core/src/physical_planner.rs

+                let projection_exec =
+                    ProjectionExec::try_new(proj_exprs, Arc::clone(&plan))?;
+
+                match plan.try_swapping_with_projection(&projection_exec)? {


Might move this out of this PR. The idea is that once we have projections in TableScan then the projection pushdown optimizer can handle pushing down get_field expressions all the way into TableScan (past any filters) and then we can here push those into the scan itself (if the scan accepts them).

adriangb · 2026-01-31T23:13:37Z

@ethan-tyler this is another PR of interest for the filename() function (which should behave the same as get_field() in terms of pushing down into the scan)

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate labels Jan 31, 2026

adriangb mentioned this pull request Jan 31, 2026

Consolidate filter classification into physical planner #20092

Draft

adriangb marked this pull request as draft January 31, 2026 13:59

adriangb and others added 6 commits January 31, 2026 11:16

incorporate more changes

170818e

fix: collapse nested if-let statements for clippy

aaae938

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

simplify, comment

d5f8a5f

pushdown projections

3503ab2

adriangb commented Jan 31, 2026

View reviewed changes

adriangb marked this pull request as ready for review January 31, 2026 23:02

adriangb commented Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change TableScan.projection from indices to expressions #20091

refactor: Change TableScan.projection from indices to expressions #20091

adriangb commented Jan 31, 2026 •

edited

Loading

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb left a comment

Uh oh!

adriangb Jan 31, 2026 •

edited

Loading

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb Jan 31, 2026

Uh oh!

adriangb commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

refactor: Change TableScan.projection from indices to expressions #20091

Are you sure you want to change the base?

refactor: Change TableScan.projection from indices to expressions #20091

Conversation

adriangb commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Core Type Change

New APIs

Backward Compatibility

Updated Components

Test Plan

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

adriangb commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adriangb commented Jan 31, 2026 •

edited

Loading

adriangb Jan 31, 2026 •

edited

Loading