feat(functions-nested): add array_filter higher-order function by ologlogn · Pull Request #21895 · apache/datafusion

ologlogn · 2026-04-28T16:22:46Z

Which issue does this PR close?

Partially addresses #14509 — implements array_filter / list_filter.

Rationale for this change

array_transform (#21679) added the first HigherOrderUDF. array_filter is the natural companion: filter array elements with a boolean lambda, matching Spark filter / DuckDB list_filter semantics.

What changes are included in this PR?

New HigherOrderUDF ArrayFilter (array_filter / list_filter alias)
- Boolean lambda per element; true keeps, false/null drops (matches Spark semantics)
- Handles List, LargeList, sliced arrays, null sublists
- Scalar predicate short-circuit (x -> true / x -> false)
- No-copy fast path when nothing is filtered (skips arrow::compute::filter)
lambda_utils.rs: shared HOF helpers extracted from array_transform (value_lambda_pair, coerce_single_list_arg, single_list_lambda_parameters, extract_list_values)
test_utils.rs: shared unit test helpers (create_i32_list, eval_hof_on_i32_list)

Are these changes tested?

Unit tests: basic filter, multiple sublists, sliced arrays, null sublists, all-filtered-out, nothing-filtered (fast path), scalar true/false predicates
SQL logic tests in array_filter.slt: filter variants, array_filter + array_transform combinations, error cases

Are there any user-facing changes?

Yes — array_filter(array, lambda) and alias list_filter(array, lambda) are now available as SQL functions.

ologlogn · 2026-04-28T18:09:02Z

Hi @gabotechs, could you please trigger CI? Thanks!

benbellick · 2026-04-29T14:10:35Z

Forgive me if I missed something, but how does this function behave if the array argument is null or the lambda itself is null? Could we have some tests for those as well?

null array -> should return null.
if lambda returns null for some elements -> those will be filtered out. Null is treated as false.
if lambda always returns null -> output will be empty list.

i will try to add sql tests for this

Added in array_filter.slt: null array input returns null, lambda returning null for some elements drops those elements (null treated as false), lambda always returning null returns empty list.

LiaCastaneda · 2026-04-29T13:22:45Z

+impl ArrayFilter {
+    pub fn new() -> Self {
+        Self {
+            signature: HigherOrderSignature::user_defined(Volatility::Immutable),


I plan to open a PR soon so we can be more specific about the Lambda signature we want (e.g. exact types) so all the validation can be hidden into the planner (and potentially be able to remove value_lambda_pair)

@LiaCastaneda , do you wanna merge your before?
or we can merge it after this PR?

ologlogn · 2026-05-04T15:13:57Z

hey @comphead is it possible to review this too? It's on the top of lambda PR~ 🙏

LiaCastaneda · 2026-05-18T10:38:43Z

+        match list.data_type() {
+            DataType::List(_) | DataType::LargeList(_) => {}
+            other => return plan_err!("expected list, got {other}"),
+        }


I think this check is redundant -- during planning, coerce_value_types is called before this function, so it should have already errored out.

LiaCastaneda · 2026-05-18T11:00:18Z

+        };
+
+        let values_param = || Ok(Arc::clone(&list_values));
+        let predicate_cv = lambda.evaluate(&[&values_param], |arrays| {


nit: I think a name like predicate_output would be clearer

LiaCastaneda · 2026-05-18T12:16:33Z

+            let keep = predicate.is_valid(j) && predicate.value(j);
+            selection.append(keep);
+            if keep {
+                count += 1;
+            }


Is the convention of array_filter to always to drop the nulls? if we have something like array_filter([1, 2, 3], x -> if x = 2 then null else x > 1) is the convention to return [3] or [null,3]?

Maybe we should just follow what the arrow filter kernel does https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html, if it skips the nulls it's probably fine. Also if the kernel skips/treats as false the nulls then I think we can skip manually builsing a mask here

edit: looks like arrow already does this https://github.com/apache/arrow-rs/blob/fd1c5b391e169762a0981870c4e94baa3372d7a3/arrow-select/src/filter.rs#L171

hm. conventionally array filter should take a lambda which return boolean.

Null predicate values are treated as false (element dropped) — this matches arrow::compute::filter exactly, which internally ANDs the predicate values with its validity bitmap (prep_null_mask_filter), so null → false. Added a test for this in array_filter.slt.

LiaCastaneda · 2026-05-18T12:24:55Z

nit: I think the structure we want is one sqllogic test per higher order function #21903 (review) under test_files/array

Done — moved all array_filter tests to a dedicated array_filter.slt file.

gabotechs · 2026-05-18T12:04:14Z


-query error
+query error DataFusion error: Error during planning: array_transform requires 1 value argument, got 0
 select array_transform();
----
-DataFusion error: Error during planning: array_transform function requires 1 value arguments, got 0


 query error DataFusion error: Error during planning: array_transform expected a list as first argument, got Int64


🤔 why did this change? I'd expect this test to remain the same vs main

Those changes (v@0 → v@1) were introduced by the lambda column capture PR (#21323), which this PR depends on. After rebasing onto main (which now includes #21323), our PR no longer touches array_transform.slt at all.

gabotechs · 2026-05-18T12:36:18Z

+        _step: usize,
+        fields: &[ValueOrLambda<FieldRef, Option<FieldRef>>],
+    ) -> Result<LambdaParametersProgress> {
+        crate::lambda_utils::single_list_lambda_parameters(self.name(), fields)


This is very typical about how LLMs import other modules. I'd try to avoid it in favor of being consistent with the rest of import statements in the project, and instead, place the appropriate imports at the top of this file.

gabotechs · 2026-05-18T12:37:14Z

+        let (list, _lambda) = value_lambda_pair(self.name(), args.arg_fields)?;
+
+        match list.data_type() {
+            DataType::List(_) | DataType::LargeList(_) => {}


Are ListView and LargeListView not supported? it's fine not to do it in this PR, but I'd probably leave a comment for the future stating that there is no technical limitation preventing View variants to be implemented.

iirc they get casted to List during type coercion during planning #18921 (comment)

gabotechs · 2026-05-18T12:39:43Z

+    ))
+}
+
+use crate::lambda_utils::value_lambda_pair;


🤔 there seems to be a random import here in the middle of the file. How about just placing it above like any other import?

gabotechs · 2026-05-18T12:57:08Z

+
+/// Filters flat list values using a boolean predicate, returning filtered values and
+/// recomputed per-sublist offsets. Null predicate values are treated as false.
+fn filter_list_values<O: OffsetSizeTrait>(


Isn't this function essentially re-implementing arrow::compute::filter? Typically, it's recommended to delegate to arrow compute kernels, as they already know how to perform computation efficiently.

But maybe I'm missing something, is there a reason why we cannot use arrow::compute::filter?

arrow::compute::filter operates on flat arrays — it has no concept of list offsets, so we still need to recompute per-sublist offsets manually. That said, we now delegate the actual value filtering entirely to the arrow kernel (which already treats null predicate values as false), removing the manual BooleanBufferBuilder you flagged.

gabotechs · 2026-05-18T12:59:58Z

+query error DataFusion error: Error during planning: array_filter expected a list as first argument, got Int64
+SELECT array_filter(1, v -> v > 0);
+


How about another test case that inverts the order of the arguments? somethign like:

SELECT array_filter(v -> v > 0, [1, 2, 3]);

Added — SELECT array_filter(v -> v > 0, [1, 2, 3]) errors with array_filter expects a value followed by a lambda.

gabotechs · 2026-05-18T13:00:39Z

+##############
+## array_filter tests
+##############
+


Really nice tests!

ologlogn · 2026-05-18T17:09:11Z

@LiaCastaneda @gabotechs addressed your comments~ thank you for reviewing!

github-actions Bot added the functions Changes to functions implementation label Apr 28, 2026

ologlogn force-pushed the array-filter-lambda branch from 6ff8773 to 07e4548 Compare April 28, 2026 16:39

ologlogn force-pushed the array-filter-lambda branch from 07e4548 to 44715ac Compare April 28, 2026 18:24

github-actions Bot added sqllogictest SQL Logic Tests (.slt) documentation Improvements or additions to documentation labels Apr 28, 2026

ologlogn force-pushed the array-filter-lambda branch from cbf076a to 36c8f36 Compare April 29, 2026 12:06

ologlogn mentioned this pull request Apr 29, 2026

add any_match higher-order function #21903

Merged

ologlogn closed this Apr 29, 2026

ologlogn force-pushed the array-filter-lambda branch from cb94b16 to ec92925 Compare April 29, 2026 13:00

ologlogn reopened this Apr 29, 2026

ologlogn force-pushed the array-filter-lambda branch from 36c8f36 to 406f85b Compare April 29, 2026 13:03

LiaCastaneda mentioned this pull request Apr 29, 2026

[EPIC] Full lambda support #21172

Open

27 tasks

benbellick reviewed Apr 29, 2026

View reviewed changes

ologlogn force-pushed the array-filter-lambda branch from 406f85b to 4e0caa6 Compare April 29, 2026 14:19

LiaCastaneda reviewed Apr 29, 2026

View reviewed changes

davidlghellin mentioned this pull request Apr 30, 2026

feat: add array_filter function lakehq/sail#1320

Draft

ologlogn force-pushed the array-filter-lambda branch from 8190f18 to efd9b30 Compare May 4, 2026 14:46

ologlogn requested a review from benbellick May 7, 2026 15:25

ologlogn force-pushed the array-filter-lambda branch from efd9b30 to c701da7 Compare May 7, 2026 16:06

ologlogn requested a review from LiaCastaneda May 11, 2026 13:59

LiaCastaneda reviewed May 18, 2026

View reviewed changes

gabotechs reviewed May 18, 2026

View reviewed changes

ologlogn force-pushed the array-filter-lambda branch from c701da7 to 6003157 Compare May 18, 2026 16:54

feat(functions-nested): add array_filter higher-order function

6208b0b

ologlogn force-pushed the array-filter-lambda branch from 6003157 to 6208b0b Compare May 18, 2026 17:03

		query error DataFusion error: Error during planning: array_filter expected a list as first argument, got Int64
		SELECT array_filter(1, v -> v > 0);

Conversation

ologlogn commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ologlogn commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ologlogn commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ologlogn commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ologlogn commented Apr 28, 2026 •

edited

Loading

LiaCastaneda May 18, 2026 •

edited

Loading