Avoid null-restrict evaluation for predicates that reference non-join columns in PushDownFilter by kosiew · Pull Request #20961 · apache/datafusion

kosiew · 2026-03-16T14:07:22Z

Which issue does this PR close?

Part of perf: push_down_filter is pathologically slow for some plans #20002

Rationale for this change

PushDownFilter has been identified as a major planning-time bottleneck, with profiling showing significant time spent repeatedly evaluating expression types while reasoning about predicate pushdown and null-restriction behavior. This PR addresses a slice of that problem by reducing unnecessary work in null-restriction evaluation and by tightening join-filter conversion behavior to avoid incorrect rewrites in scalar-side join cases.

In particular, the patch focuses on two goals:

Avoid expensive authoritative null-restriction evaluation when a cheaper syntactic determination is sufficient.
Preserve correct post-join filter semantics for joins involving scalar-producing inputs, which regressed when filters were converted into join conditions too aggressively.

Together, these changes both improve optimizer robustness and target one of the hot paths called out in the profiling for PushDownFilter.

What changes are included in this PR?

This PR includes the following changes:

Adds regression SQL tests under datafusion/core/tests/sql/push_down_filter_regressions.rs and wires the new module into the SQL test suite.
Adds end-to-end regression coverage for:
- window + scalar subquery planning and execution
- sqllogictest-style execution with push_down_filter enabled/disabled
- aggregate regression functions returning non-NULL results
- correlated IN subqueries
- NATURAL JOIN combined with UNION ALL
Adds an optimizer regression test to ensure a post-join filter is retained for a cross join where one side is scalar/aggregated.
Changes push_down_filter so filters are only converted into join conditions for inner joins when it is safe to do so, specifically avoiding that conversion for joins with empty ON clauses when either side is known to be scalar (max_rows() == Some(1)).
Updates inferred predicate handling to treat null-restriction evaluation failures conservatively via unwrap_or(false).
Refactors null-restriction evaluation in optimizer::utils by:
- introducing NullRestrictionEvalMode and a test-only guard for controlling evaluation mode
- adding a fast syntactic null-substitution path for common predicate shapes
- short-circuiting predicates that reference columns outside the provided join-column set
- retaining the existing authoritative evaluation path as a fallback
- adding debug assertions to verify the syntactic fast path matches the authoritative evaluator in debug builds
Expands unit test coverage for is_restrict_null_predicate and for agreement between the syntactic fast path and the authoritative evaluator.

Are these changes tested?

Yes.

The PR adds both logical optimizer tests and SQL-level regression coverage:

new SQL regression tests covering the query patterns implicated by this area of the optimizer
a focused optimizer unit test ensuring filters are not incorrectly folded into join conditions for scalar-side cross joins
expanded unit tests for is_restrict_null_predicate, including additional CASE, BETWEEN, IN, and null-related predicate forms
a validation test that checks the syntactic fast path against the authoritative evaluator for supported expressions
plan-shape assertions comparing optimized and physical plans with push_down_filter enabled and disabled

These tests verify both correctness and protection against the regressions addressed by this patch.

Are there any user-facing changes?

There are no intended user-facing API changes.

This PR improves optimizer behavior and planner performance characteristics internally, and fixes incorrect or fragile filter-pushdown behavior for certain query shapes. Users may observe more stable planning behavior and preserved correctness for affected queries, but no documentation or API changes should be required.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

…ting conditions

Introduce a test case to assert non-restricting behavior when evaluating the predicate a > b, focusing on join keys that only include a. This directly tests the new early-return branch in the is_restrict_null_predicate function in utils.rs, enhancing overall code coverage.

Extract the column-membership check into a new helper function called `predicate_uses_only_columns` in utils.rs. Update the current implementation at utils.rs:91 to use this new helper, improving code readability and maintainability.

Add call-site contract comment in push_down_filter.rs to specify that only Ok(true) is treated as null-restricting. State that both Ok(false) and Err(_) are considered non-restricting and will be skipped during processing.

Inline iterator predicate in utils.rs and streamline the null-restrict handling in push_down_filter.rs. This reduces indirections and lines of code while maintaining the same logic and behavior. No public interface or behavior changes intended.

…te_uses_only_columns function

kosiew · 2026-03-16T14:08:56Z

run benchmark sql_planner_extended

adriangbot · 2026-03-16T14:11:46Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4067930471-314-dhsls 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

kosiew · 2026-03-18T07:51:37Z

show benchmark queue

adriangbot · 2026-03-18T07:51:38Z

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

No pending jobs.

kosiew · 2026-03-18T07:52:02Z

run benchmark sql_planner_extended

kosiew · 2026-03-18T07:52:55Z

show benchmark queue

adriangbot · 2026-03-18T07:52:57Z

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment	Repo	PR	User	Benchmarks	Status
#4080428628	apache/datafusion	#20961	kosiew	["sql_planner_extended"]	running

adriangbot · 2026-03-18T07:54:33Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4080428628-401-tbflf 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

kosiew · 2026-03-18T08:48:50Z

show benchmark queue

adriangbot · 2026-03-18T08:48:52Z

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment	Repo	PR	User	Benchmarks	Status
#4080428628	apache/datafusion	#20961	kosiew	["sql_planner_extended"]	running

kosiew · 2026-03-18T12:30:39Z

show benchmark queue

adriangbot · 2026-03-18T12:30:40Z

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment	Repo	PR	User	Benchmarks	Status
#4082114714	apache/datafusion	#21026	Dandandan	["clickbench_partitioned"]	running
#4082114714	apache/datafusion	#21026	Dandandan	["tpcds"]	running
#4082114714	apache/datafusion	#21026	Dandandan	["tpch"]	running

Implement fast path for syntactic null-restriction in utils.rs to classify predicates without evaluating physical expressions. Enhance SQL boolean handling with a large supporting evaluator, including CASE management. Retain existing branch helper styles and expand test coverage for constant simple CASE and outside-join-key fast paths. Fix correctness edge case for simple CASE indeterminate comparisons, ensuring proper tracking and fallback to Unknown.

kosiew · 2026-03-18T15:20:19Z

I think the benchmark never completes or gets killed because it's too heavy.
Amending benchmark in #21029

kosiew · 2026-03-19T02:24:24Z

run benchmark sql_planner_extended --sample-size 10

kosiew · 2026-03-19T02:25:48Z

show benchmark queue

adriangbot · 2026-03-19T02:25:50Z

Hi @kosiew, you asked to view the benchmark queue (#20961 (comment)).

Comment	Repo	PR	User	Benchmarks	Status
#4087159258	apache/datafusion	#20961	kosiew	["sql_planner_extended"]	running
#4087159258	apache/datafusion	#20961	kosiew	["--sample-size"]	running
#4087159258	apache/datafusion	#20961	kosiew	["10"]	running

adriangbot · 2026-03-19T02:27:15Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-443-hxscx 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=sql_planner_extended
BENCH_COMMAND=cargo bench --features=parquet --bench sql_planner_extended
BENCH_FILTER=
Results will be posted here when complete

adriangbot · 2026-03-19T02:27:16Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-444-4d9n5 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=--sample-size
BENCH_COMMAND=cargo bench --features=parquet --bench --sample-size
BENCH_FILTER=
Results will be posted here when complete

adriangbot · 2026-03-19T02:27:16Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

HEAD is now at ab28234 Support `columns_sorted` in row_filters (#20497)
rustc 1.94.0 (4a4ef493e 2026-03-02)
3d3945ce07b7015c11b0a4f89f3b456d785b7bdf
ab2823475d0c79a749120ae354572ab85c043b78
error: unexpected argument '--sample-size' found

  tip: a similar argument exists: '--examples'
  tip: to pass '--sample-size' as a value, use '-- --sample-size'

Usage: cargo bench --features <FEATURES> --bench [<NAME>] --examples [BENCHNAME] [-- [ARGS]...]

For more information, try '--help'.
error: unexpected argument '--sample-size' found

  tip: a similar argument exists: '--examples'
  tip: to pass '--sample-size' as a value, use '-- --sample-size'

Usage: cargo bench --features <FEATURES> --bench [<NAME>] --examples [BENCHNAME] [-- [ARGS]...]

For more information, try '--help'.

adriangbot · 2026-03-19T02:27:49Z

🤖 Criterion benchmark running (GKE) | trigger
Linux bench-c4087159258-445-b4z9j 6.12.55+ #1 SMP Sun Feb 1 08:59:41 UTC 2026 aarch64 GNU/Linux
Comparing push-down-02-20002 (3d3945c) to ab28234 (merge-base) diff
BENCH_NAME=10
BENCH_COMMAND=cargo bench --features=parquet --bench 10
BENCH_FILTER=
Results will be posted here when complete

adriangbot · 2026-03-19T02:27:54Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

Cloning into '/workspace/datafusion-branch'...
push-down-02-20002
From https://github.com/apache/datafusion
 * [new ref]         refs/pull/20961/head -> push-down-02-20002
 * branch            main                 -> FETCH_HEAD
Switched to branch 'push-down-02-20002'
ab2823475d0c79a749120ae354572ab85c043b78
Cloning into '/workspace/datafusion-base'...
HEAD is now at ab28234 Support `columns_sorted` in row_filters (#20497)
rustc 1.94.0 (4a4ef493e 2026-03-02)
3d3945ce07b7015c11b0a4f89f3b456d785b7bdf
ab2823475d0c79a749120ae354572ab85c043b78
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
    Blocking waiting for file lock on package cache
error: no bench target named `10` in default-run packages

help: a target with a similar name exists: `chr`

Implement focused regression tests in push_down_filter_regressions.rs. These tests cover the following SQL shapes: window + scalar subquery, regr_* aggregate query, correlated IN subquery, and NATURAL JOIN with UNION ALL cases. This ensures accurate reporting and enhances SQL handling reliability.

Extract authoritative_restrict_null_predicate() to directly access the old physical-evaluation path. Implement a debug check in is_restrict_null_predicate() to ensure fast-path results match the authoritative evaluator. Add a unit test to verify consistency between both paths, enhancing debug capabilities and test coverage.

Remove early-return handling for complex constructs like CASE and scalar functions. Narrow fast path to a limited subset including aliases, literals, direct join-key column substitution, and specific operators. All other cases now utilize evaluate_expr_with_null_column(...) for authoritative evaluation.

Add window_scalar_subquery_optimizer_delta to compare query execution plans with different push_down_filter settings. Implement test hook for easier comparison without manual edits. Confirm that optimizer changes the plan from Filter + Cross Join to Inner Join, indicating a key delta for this case.

Prevent conversion of post-join filters to join filters when the join lacks equijoin keys and one side is scalar. This change maintains the filter + cross join structure for failing scalar-subquery shapes. Added tests for optimizer-level guards and regression coverage.

Refactor null checks in utils.rs by collapsing fast-path gates and removing duplicated branches for various null handling scenarios. Extract reusable helpers for null checks to improve readability. In push_down_filter.rs, streamline the scalar-cross-join guard and eliminate inline boolean expressions from the push_down_all_join function for clearer logic.

kosiew added 6 commits March 16, 2026 21:38

fix: optimize null predicate evaluation by early exit for non-restric…

b9828ca

…ting conditions

Clarify null-restricting behavior in filter check

144cab3

Add call-site contract comment in push_down_filter.rs to specify that only Ok(true) is treated as null-restricting. State that both Ok(false) and Err(_) are considered non-restricting and will be skipped during processing.

refactor: streamline null predicate evaluation by introducing predica…

3d3945c

…te_uses_only_columns function

github-actions bot added the optimizer Optimizer rules label Mar 16, 2026

github-actions bot added the core Core DataFusion crate label Mar 19, 2026

kosiew added 5 commits March 19, 2026 21:42

kosiew force-pushed the push-down-02-20002 branch from 8658e02 to 9665c63 Compare March 19, 2026 14:27

kosiew added 3 commits March 19, 2026 22:29

new

00dcba4

clippy fix

4fdeb7c

Conversation

kosiew commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

kosiew commented Mar 16, 2026

Uh oh!

adriangbot commented Mar 16, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026

Uh oh!

adriangbot commented Mar 18, 2026

Uh oh!

kosiew commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosiew commented Mar 19, 2026

Uh oh!

kosiew commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

adriangbot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Mar 16, 2026 •

edited

Loading

kosiew commented Mar 18, 2026 •

edited

Loading