feat: add input_file_name() for file-backed scans (plumbing PR) #20071

ethan-tyler · 2026-01-30T04:31:54Z

Which issue does this PR close?

Closes Add input_file_name built-in function #6051

Rationale for this change

This is the end to end plumbing PR to get input_file_name() working. Started with an SLT test to define the expected behavior, then built out the plumbing to make it pass. Scoped to SELECT-list only (guaranteed pushdown case) per discussion with @alamb and @adriangb, with broader pushdown support to follow once #19538 lands.

What changes are included in this PR?

Add input_file_name() function that returns the file path for each row by injecting the value at the file opener boundary. Opt in (only when referenced), keeps SELECT * stable, errors on unsupported contexts.

Analyzer rewrite

Rewrites input_file_name() to reserved column __datafusion_input_file_name
Annotates TableScan.projected_schema only when needed
Errors on reserved name collisions

Physical planning + execution

Planner enables scan time injection when internal field is projected
FileScanConfig::open wraps opener to append Utf8 column with file location per batch
Stats/equivalence properties/schema updated for appended field

Optimizer

OptimizeProjections handles internal column safely (prevents index OOB)
Regression test: reserved column from source schema not treated as injected

Scope (V1)

Works in SELECT list only
Plan time errors for non file sources (VALUES/MemTable), joins (ambiguous file origin), and non SELECT list usage (WHERE/GROUP BY/ORDER BY/HAVING)

Are these changes tested?

Yes.

cargo test -p datafusion-sqllogictest --test sqllogictests -- input_file_name.slt
cargo test -p datafusion-datasource extended_file_columns_inject_input_file_name -q
cargo test -p datafusion-optimizer optimize_projections_keeps_reserved_column_from_source -q

SLT uses CSV for deterministic multi file assertions. Parquet supported via same FileScanConfig path and Parquet specific SLTs can follow.

Are there any user-facing changes?

Yes. New 0-arg volatile scalar function: input_file_name() -> Utf8

CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION '...';
SELECT col1, input_file_name() FROM t;

SELECT * output unchanged unless input_file_name() is explicitly referenced.

adriangb · 2026-01-30T11:28:13Z

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

ethan-tyler · 2026-01-30T18:14:36Z

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

Thanks @adriangb - really appreciate the feedback!

I've been working on implementing your approach where input_file_name() stays as a 0-arg scalar UDF, then ProjectionOpener::open(partitioned_file) does expr.transform(...) to replace the ScalarFunctionExpr with Literal(Utf8(partitioned_file.location)).

Does this direction look right to you?

One question on plumbing: in many plans (e.g. SELECT input_file_name() FROM t WHERE a > 1), a top ProjectionExec can't push through FilterExec without dropping columns needed by the filter. This causes the UDF to stay above the scan and hit the runtime error guard.

For this first PR, which is better?

Minimal opener rewrite only: works only when projection naturally pushes into the scan
Include analyzer/optimizer/planner glue: marks scan as needing the reserved column, injects UDF into file source projection so it works with filters/sorts/limits

adriangb · 2026-01-30T18:36:33Z

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

Thanks @adriangb - really appreciate the feedback!

I've been working on implementing your approach where input_file_name() stays as a 0-arg scalar UDF, then ProjectionOpener::open(partitioned_file) does expr.transform(...) to replace the ScalarFunctionExpr with Literal(Utf8(partitioned_file.location)).

Does this direction look right to you?

Yes that's what I'm thinking. Both ProjectionOpener and FileSource implementations that don't use it (ParquetOpener) should do the rewrite.

One question on plumbing: in many plans (e.g. SELECT input_file_name() FROM t WHERE a > 1), a top ProjectionExec can't push through FilterExec without dropping columns needed by the filter. This causes the UDF to stay above the scan and hit the runtime error guard.

For this first PR, which is better?
1. Minimal opener rewrite only: works only when projection naturally pushes into the scan

2. Include analyzer/optimizer/planner glue: marks scan as needing the reserved column, injects UDF into file source projection so it works with filters/sorts/limits

This is precisely what I was referring to in #6051 (comment). analyzer/optimizer/planner to sort this out is very complex and error prone. I think let's not do that for now. You should make the default behavior null instead of an error so that if the expression is not rewritten the whole query doesn't fail.

alamb · 2026-01-30T19:53:33Z

A new scalarUDF (that errors by default but is rewritten by the parquet opener) seems like a good design to me ❤️

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

ethan-tyler · 2026-01-30T21:50:01Z

Thanks both - agreed on keeping this simpler. Pivoted to the "minimal opener rewrite + NULL default" direction. Looking forward to your feedback!

@adriangb appreciate the reminder on our previous convo, sometimes I need it right in front of me 😅. Still owe you eyes on #19538 as promised.

ethan-tyler added 4 commits January 29, 2026 22:14

test: add input_file_name sqllogictest

b6668e1

feat: add input_file_name() scalar udf stub

c8e0702

feat: inject input_file_name via file scan extended columns

b7742ca

feat: rewrite input_file_name() to file scan column

8c049a2

feat: rewrite input_file_name() at file opener boundary

83a1366

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>

github-actions bot added documentation Improvements or additions to documentation and removed physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate common Related to common crate labels Jan 30, 2026

ethan-tyler changed the title ~~[WIP] feat: add input_file_name() for file-backed scans (plumbing PR)~~ feat: add input_file_name() for file-backed scans (plumbing PR) Jan 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add input_file_name() for file-backed scans (plumbing PR) #20071

feat: add input_file_name() for file-backed scans (plumbing PR) #20071

ethan-tyler commented Jan 30, 2026

Uh oh!

adriangb commented Jan 30, 2026

Uh oh!

ethan-tyler commented Jan 30, 2026

Uh oh!

adriangb commented Jan 30, 2026 •

edited

Loading

Uh oh!

alamb commented Jan 30, 2026

Uh oh!

ethan-tyler commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add input_file_name() for file-backed scans (plumbing PR) #20071

Are you sure you want to change the base?

feat: add input_file_name() for file-backed scans (plumbing PR) #20071

Conversation

ethan-tyler commented Jan 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

adriangb commented Jan 30, 2026

Uh oh!

ethan-tyler commented Jan 30, 2026

Uh oh!

adriangb commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Jan 30, 2026

Uh oh!

ethan-tyler commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

adriangb commented Jan 30, 2026 •

edited

Loading