-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat: add input_file_name() for file-backed scans (plumbing PR) #20071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I think it should be much simpler than this: create a UDF and pass it through |
Thanks @adriangb - really appreciate the feedback! I've been working on implementing your approach where Does this direction look right to you? One question on plumbing: in many plans (e.g. For this first PR, which is better?
|
Yes that's what I'm thinking. Both
This is precisely what I was referring to in #6051 (comment). analyzer/optimizer/planner to sort this out is very complex and error prone. I think let's not do that for now. You should make the default behavior null instead of an error so that if the expression is not rewritten the whole query doesn't fail. |
|
A new scalarUDF (that errors by default but is rewritten by the parquet opener) seems like a good design to me ❤️ |
Signed-off-by: Ethan Urbanski <ethan@urbanskitech.com>
Which issue does this PR close?
Rationale for this change
This is the end to end plumbing PR to get
input_file_name()working. Started with an SLT test to define the expected behavior, then built out the plumbing to make it pass. Scoped to SELECT-list only (guaranteed pushdown case) per discussion with @alamb and @adriangb, with broader pushdown support to follow once #19538 lands.What changes are included in this PR?
Add
input_file_name()function that returns the file path for each row by injecting the value at the file opener boundary. Opt in (only when referenced), keepsSELECT *stable, errors on unsupported contexts.Analyzer rewrite
input_file_name()to reserved column__datafusion_input_file_nameTableScan.projected_schemaonly when neededPhysical planning + execution
FileScanConfig::openwraps opener to append Utf8 column with file location per batchOptimizer
OptimizeProjectionshandles internal column safely (prevents index OOB)Scope (V1)
Are these changes tested?
Yes.
SLT uses CSV for deterministic multi file assertions. Parquet supported via same
FileScanConfigpath and Parquet specific SLTs can follow.Are there any user-facing changes?
Yes. New 0-arg volatile scalar function:
input_file_name() -> Utf8SELECT *output unchanged unlessinput_file_name()is explicitly referenced.