Skip to content

feat(functions-nested): add array_filter higher-order function#21895

Open
ologlogn wants to merge 1 commit into
apache:mainfrom
ologlogn:array-filter-lambda
Open

feat(functions-nested): add array_filter higher-order function#21895
ologlogn wants to merge 1 commit into
apache:mainfrom
ologlogn:array-filter-lambda

Conversation

@ologlogn
Copy link
Copy Markdown

@ologlogn ologlogn commented Apr 28, 2026

Which issue does this PR close?

Partially addresses #14509 — implements array_filter / list_filter.

Rationale for this change

array_transform (#21679) added the first HigherOrderUDF. array_filter is the natural companion: filter array elements with a boolean lambda, matching Spark filter / DuckDB list_filter semantics.

What changes are included in this PR?

  • New HigherOrderUDF ArrayFilter (array_filter / list_filter alias)
    • Boolean lambda per element; true keeps, false/null drops (matches Spark semantics)
    • Handles List, LargeList, sliced arrays, null sublists
    • Scalar predicate short-circuit (x -> true / x -> false)
    • No-copy fast path when nothing is filtered (skips arrow::compute::filter)
  • lambda_utils.rs: shared HOF helpers extracted from array_transform (value_lambda_pair, coerce_single_list_arg, single_list_lambda_parameters, extract_list_values)
  • test_utils.rs: shared unit test helpers (create_i32_list, eval_hof_on_i32_list)

Are these changes tested?

  • Unit tests: basic filter, multiple sublists, sliced arrays, null sublists, all-filtered-out, nothing-filtered (fast path), scalar true/false predicates
  • SQL logic tests in array_filter.slt: filter variants, array_filter + array_transform combinations, error cases

Are there any user-facing changes?

Yes — array_filter(array, lambda) and alias list_filter(array, lambda) are now available as SQL functions.

@github-actions github-actions Bot added the functions Changes to functions implementation label Apr 28, 2026
@ologlogn ologlogn force-pushed the array-filter-lambda branch from 6ff8773 to 07e4548 Compare April 28, 2026 16:39
@ologlogn
Copy link
Copy Markdown
Author

Hi @gabotechs, could you please trigger CI? Thanks!

@ologlogn ologlogn force-pushed the array-filter-lambda branch from 07e4548 to 44715ac Compare April 28, 2026 18:24
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) documentation Improvements or additions to documentation labels Apr 28, 2026
@ologlogn ologlogn force-pushed the array-filter-lambda branch from cbf076a to 36c8f36 Compare April 29, 2026 12:06
@ologlogn ologlogn closed this Apr 29, 2026
@ologlogn ologlogn force-pushed the array-filter-lambda branch from cb94b16 to ec92925 Compare April 29, 2026 13:00
@ologlogn ologlogn reopened this Apr 29, 2026
@ologlogn ologlogn force-pushed the array-filter-lambda branch from 36c8f36 to 406f85b Compare April 29, 2026 13:03
@LiaCastaneda LiaCastaneda mentioned this pull request Apr 29, 2026
27 tasks
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me if I missed something, but how does this function behave if the array argument is null or the lambda itself is null? Could we have some tests for those as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null array -> should return null.
if lambda returns null for some elements -> those will be filtered out. Null is treated as false.
if lambda always returns null -> output will be empty list.

i will try to add sql tests for this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in array_filter.slt: null array input returns null, lambda returning null for some elements drops those elements (null treated as false), lambda always returning null returns empty list.

@ologlogn ologlogn force-pushed the array-filter-lambda branch from 406f85b to 4e0caa6 Compare April 29, 2026 14:19
impl ArrayFilter {
pub fn new() -> Self {
Self {
signature: HigherOrderSignature::user_defined(Volatility::Immutable),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to open a PR soon so we can be more specific about the Lambda signature we want (e.g. exact types) so all the validation can be hidden into the planner (and potentially be able to remove value_lambda_pair)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LiaCastaneda , do you wanna merge your before?
or we can merge it after this PR?

@ologlogn
Copy link
Copy Markdown
Author

ologlogn commented May 4, 2026

hey @comphead is it possible to review this too? It's on the top of lambda PR~ 🙏

@ologlogn ologlogn requested a review from benbellick May 7, 2026 15:25
@ologlogn ologlogn force-pushed the array-filter-lambda branch from efd9b30 to c701da7 Compare May 7, 2026 16:06
@ologlogn ologlogn requested a review from LiaCastaneda May 11, 2026 13:59
Comment on lines +118 to +121
match list.data_type() {
DataType::List(_) | DataType::LargeList(_) => {}
other => return plan_err!("expected list, got {other}"),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check is redundant -- during planning, coerce_value_types is called before this function, so it should have already errored out.

};

let values_param = || Ok(Arc::clone(&list_values));
let predicate_cv = lambda.evaluate(&[&values_param], |arrays| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think a name like predicate_output would be clearer

Comment on lines +287 to +291
let keep = predicate.is_valid(j) && predicate.value(j);
selection.append(keep);
if keep {
count += 1;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the convention of array_filter to always to drop the nulls? if we have something like array_filter([1, 2, 3], x -> if x = 2 then null else x > 1) is the convention to return [3] or [null,3]?

Copy link
Copy Markdown
Contributor

@LiaCastaneda LiaCastaneda May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just follow what the arrow filter kernel does https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html, if it skips the nulls it's probably fine. Also if the kernel skips/treats as false the nulls then I think we can skip manually builsing a mask here

edit: looks like arrow already does this https://github.com/apache/arrow-rs/blob/fd1c5b391e169762a0981870c4e94baa3372d7a3/arrow-select/src/filter.rs#L171

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. conventionally array filter should take a lambda which return boolean.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null predicate values are treated as false (element dropped) — this matches arrow::compute::filter exactly, which internally ANDs the predicate values with its validity bitmap (prep_null_mask_filter), so null → false. Added a test for this in array_filter.slt.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the structure we want is one sqllogic test per higher order function #21903 (review) under test_files/array

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved all array_filter tests to a dedicated array_filter.slt file.

Comment on lines 395 to 400

query error
query error DataFusion error: Error during planning: array_transform requires 1 value argument, got 0
select array_transform();
----
DataFusion error: Error during planning: array_transform function requires 1 value arguments, got 0


query error DataFusion error: Error during planning: array_transform expected a list as first argument, got Int64
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 why did this change? I'd expect this test to remain the same vs main

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those changes (v@0v@1) were introduced by the lambda column capture PR (#21323), which this PR depends on. After rebasing onto main (which now includes #21323), our PR no longer touches array_transform.slt at all.

_step: usize,
fields: &[ValueOrLambda<FieldRef, Option<FieldRef>>],
) -> Result<LambdaParametersProgress> {
crate::lambda_utils::single_list_lambda_parameters(self.name(), fields)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very typical about how LLMs import other modules. I'd try to avoid it in favor of being consistent with the rest of import statements in the project, and instead, place the appropriate imports at the top of this file.

let (list, _lambda) = value_lambda_pair(self.name(), args.arg_fields)?;

match list.data_type() {
DataType::List(_) | DataType::LargeList(_) => {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are ListView and LargeListView not supported? it's fine not to do it in this PR, but I'd probably leave a comment for the future stating that there is no technical limitation preventing View variants to be implemented.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc they get casted to List during type coercion during planning #18921 (comment)

))
}

use crate::lambda_utils::value_lambda_pair;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 there seems to be a random import here in the middle of the file. How about just placing it above like any other import?


/// Filters flat list values using a boolean predicate, returning filtered values and
/// recomputed per-sublist offsets. Null predicate values are treated as false.
fn filter_list_values<O: OffsetSizeTrait>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this function essentially re-implementing arrow::compute::filter? Typically, it's recommended to delegate to arrow compute kernels, as they already know how to perform computation efficiently.

But maybe I'm missing something, is there a reason why we cannot use arrow::compute::filter?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow::compute::filter operates on flat arrays — it has no concept of list offsets, so we still need to recompute per-sublist offsets manually. That said, we now delegate the actual value filtering entirely to the arrow kernel (which already treats null predicate values as false), removing the manual BooleanBufferBuilder you flagged.

Comment on lines +527 to +529
query error DataFusion error: Error during planning: array_filter expected a list as first argument, got Int64
SELECT array_filter(1, v -> v > 0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about another test case that inverts the order of the arguments? somethign like:

SELECT array_filter(v -> v > 0, [1, 2, 3]);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added — SELECT array_filter(v -> v > 0, [1, 2, 3]) errors with array_filter expects a value followed by a lambda.

Comment on lines +424 to +427
##############
## array_filter tests
##############

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice tests!

@ologlogn ologlogn force-pushed the array-filter-lambda branch from c701da7 to 6003157 Compare May 18, 2026 16:54
@ologlogn ologlogn force-pushed the array-filter-lambda branch from 6003157 to 6208b0b Compare May 18, 2026 17:03
@ologlogn
Copy link
Copy Markdown
Author

@LiaCastaneda @gabotechs addressed your comments~ thank you for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants