Skip to content

Conversation

@ethan-tyler
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This is the end to end plumbing PR to get input_file_name() working. Started with an SLT test to define the expected behavior, then built out the plumbing to make it pass. Scoped to SELECT-list only (guaranteed pushdown case) per discussion with @alamb and @adriangb, with broader pushdown support to follow once #19538 lands.

What changes are included in this PR?

Add input_file_name() function that returns the file path for each row by injecting the value at the file opener boundary. Opt in (only when referenced), keeps SELECT * stable, errors on unsupported contexts.

Analyzer rewrite

  • Rewrites input_file_name() to reserved column __datafusion_input_file_name
  • Annotates TableScan.projected_schema only when needed
  • Errors on reserved name collisions

Physical planning + execution

  • Planner enables scan time injection when internal field is projected
  • FileScanConfig::open wraps opener to append Utf8 column with file location per batch
  • Stats/equivalence properties/schema updated for appended field

Optimizer

  • OptimizeProjections handles internal column safely (prevents index OOB)
  • Regression test: reserved column from source schema not treated as injected

Scope (V1)

  • Works in SELECT list only
  • Plan time errors for non file sources (VALUES/MemTable), joins (ambiguous file origin), and non SELECT list usage (WHERE/GROUP BY/ORDER BY/HAVING)

Are these changes tested?

Yes.

cargo test -p datafusion-sqllogictest --test sqllogictests -- input_file_name.slt
cargo test -p datafusion-datasource extended_file_columns_inject_input_file_name -q
cargo test -p datafusion-optimizer optimize_projections_keeps_reserved_column_from_source -q

SLT uses CSV for deterministic multi file assertions. Parquet supported via same FileScanConfig path and Parquet specific SLTs can follow.

Are there any user-facing changes?

Yes. New 0-arg volatile scalar function: input_file_name() -> Utf8

CREATE EXTERNAL TABLE t STORED AS PARQUET LOCATION '...';
SELECT col1, input_file_name() FROM t;

SELECT * output unchanged unless input_file_name() is explicitly referenced.

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate functions Changes to functions implementation datasource Changes to the datasource crate labels Jan 30, 2026
@adriangb
Copy link
Contributor

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

@ethan-tyler
Copy link
Contributor Author

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

Thanks @adriangb - really appreciate the feedback!

I've been working on implementing your approach where input_file_name() stays as a 0-arg scalar UDF, then ProjectionOpener::open(partitioned_file) does expr.transform(...) to replace the ScalarFunctionExpr with Literal(Utf8(partitioned_file.location)).

Does this direction look right to you?

One question on plumbing: in many plans (e.g. SELECT input_file_name() FROM t WHERE a > 1), a top ProjectionExec can't push through FilterExec without dropping columns needed by the filter. This causes the UDF to stay above the scan and hit the runtime error guard.

For this first PR, which is better?

  1. Minimal opener rewrite only: works only when projection naturally pushes into the scan
  2. Include analyzer/optimizer/planner glue: marks scan as needing the reserved column, injects UDF into file source projection so it works with filters/sorts/limits

@adriangb
Copy link
Contributor

adriangb commented Jan 30, 2026

I think it should be much simpler than this: create a UDF and pass it through ProjectionExprs::transform_exprs(|expr| expr.transform(|expr| // if expr is our ScalarUDF, replace with literal filename))

Thanks @adriangb - really appreciate the feedback!

I've been working on implementing your approach where input_file_name() stays as a 0-arg scalar UDF, then ProjectionOpener::open(partitioned_file) does expr.transform(...) to replace the ScalarFunctionExpr with Literal(Utf8(partitioned_file.location)).

Does this direction look right to you?

Yes that's what I'm thinking. Both ProjectionOpener and FileSource implementations that don't use it (ParquetOpener) should do the rewrite.

One question on plumbing: in many plans (e.g. SELECT input_file_name() FROM t WHERE a > 1), a top ProjectionExec can't push through FilterExec without dropping columns needed by the filter. This causes the UDF to stay above the scan and hit the runtime error guard.

For this first PR, which is better?

1. Minimal opener rewrite only: works only when projection naturally pushes into the scan

2. Include analyzer/optimizer/planner glue: marks scan as needing the reserved column, injects UDF into file source projection so it works with filters/sorts/limits

This is precisely what I was referring to in #6051 (comment). analyzer/optimizer/planner to sort this out is very complex and error prone. I think let's not do that for now. You should make the default behavior null instead of an error so that if the expression is not rewritten the whole query doesn't fail.

@alamb
Copy link
Contributor

alamb commented Jan 30, 2026

A new scalarUDF (that errors by default but is rewritten by the parquet opener) seems like a good design to me ❤️

Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
@github-actions github-actions bot added documentation Improvements or additions to documentation and removed physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate common Related to common crate labels Jan 30, 2026
@ethan-tyler
Copy link
Contributor Author

Thanks both - agreed on keeping this simpler. Pivoted to the "minimal opener rewrite + NULL default" direction. Looking forward to your feedback!

@adriangb appreciate the reminder on our previous convo, sometimes I need it right in front of me 😅. Still owe you eyes on #19538 as promised.

@ethan-tyler ethan-tyler changed the title [WIP] feat: add input_file_name() for file-backed scans (plumbing PR) feat: add input_file_name() for file-backed scans (plumbing PR) Jan 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add input_file_name built-in function

3 participants