feat(functions-nested): add array_filter higher-order function#21895
feat(functions-nested): add array_filter higher-order function#21895ologlogn wants to merge 1 commit into
Conversation
6ff8773 to
07e4548
Compare
|
Hi @gabotechs, could you please trigger CI? Thanks! |
07e4548 to
44715ac
Compare
cbf076a to
36c8f36
Compare
cb94b16 to
ec92925
Compare
36c8f36 to
406f85b
Compare
There was a problem hiding this comment.
Forgive me if I missed something, but how does this function behave if the array argument is null or the lambda itself is null? Could we have some tests for those as well?
There was a problem hiding this comment.
null array -> should return null.
if lambda returns null for some elements -> those will be filtered out. Null is treated as false.
if lambda always returns null -> output will be empty list.
i will try to add sql tests for this
There was a problem hiding this comment.
Added in array_filter.slt: null array input returns null, lambda returning null for some elements drops those elements (null treated as false), lambda always returning null returns empty list.
406f85b to
4e0caa6
Compare
| impl ArrayFilter { | ||
| pub fn new() -> Self { | ||
| Self { | ||
| signature: HigherOrderSignature::user_defined(Volatility::Immutable), |
There was a problem hiding this comment.
I plan to open a PR soon so we can be more specific about the Lambda signature we want (e.g. exact types) so all the validation can be hidden into the planner (and potentially be able to remove value_lambda_pair)
There was a problem hiding this comment.
@LiaCastaneda , do you wanna merge your before?
or we can merge it after this PR?
8190f18 to
efd9b30
Compare
|
hey @comphead is it possible to review this too? It's on the top of lambda PR~ 🙏 |
efd9b30 to
c701da7
Compare
| match list.data_type() { | ||
| DataType::List(_) | DataType::LargeList(_) => {} | ||
| other => return plan_err!("expected list, got {other}"), | ||
| } |
There was a problem hiding this comment.
I think this check is redundant -- during planning, coerce_value_types is called before this function, so it should have already errored out.
| }; | ||
|
|
||
| let values_param = || Ok(Arc::clone(&list_values)); | ||
| let predicate_cv = lambda.evaluate(&[&values_param], |arrays| { |
There was a problem hiding this comment.
nit: I think a name like predicate_output would be clearer
| let keep = predicate.is_valid(j) && predicate.value(j); | ||
| selection.append(keep); | ||
| if keep { | ||
| count += 1; | ||
| } |
There was a problem hiding this comment.
Is the convention of array_filter to always to drop the nulls? if we have something like array_filter([1, 2, 3], x -> if x = 2 then null else x > 1) is the convention to return [3] or [null,3]?
There was a problem hiding this comment.
Maybe we should just follow what the arrow filter kernel does https://docs.rs/arrow/latest/arrow/compute/kernels/filter/fn.filter.html, if it skips the nulls it's probably fine. Also if the kernel skips/treats as false the nulls then I think we can skip manually builsing a mask here
edit: looks like arrow already does this https://github.com/apache/arrow-rs/blob/fd1c5b391e169762a0981870c4e94baa3372d7a3/arrow-select/src/filter.rs#L171
There was a problem hiding this comment.
hm. conventionally array filter should take a lambda which return boolean.
There was a problem hiding this comment.
Null predicate values are treated as false (element dropped) — this matches arrow::compute::filter exactly, which internally ANDs the predicate values with its validity bitmap (prep_null_mask_filter), so null → false. Added a test for this in array_filter.slt.
There was a problem hiding this comment.
nit: I think the structure we want is one sqllogic test per higher order function #21903 (review) under test_files/array
There was a problem hiding this comment.
Done — moved all array_filter tests to a dedicated array_filter.slt file.
|
|
||
| query error | ||
| query error DataFusion error: Error during planning: array_transform requires 1 value argument, got 0 | ||
| select array_transform(); | ||
| ---- | ||
| DataFusion error: Error during planning: array_transform function requires 1 value arguments, got 0 | ||
|
|
||
|
|
||
| query error DataFusion error: Error during planning: array_transform expected a list as first argument, got Int64 |
There was a problem hiding this comment.
🤔 why did this change? I'd expect this test to remain the same vs main
| _step: usize, | ||
| fields: &[ValueOrLambda<FieldRef, Option<FieldRef>>], | ||
| ) -> Result<LambdaParametersProgress> { | ||
| crate::lambda_utils::single_list_lambda_parameters(self.name(), fields) |
There was a problem hiding this comment.
This is very typical about how LLMs import other modules. I'd try to avoid it in favor of being consistent with the rest of import statements in the project, and instead, place the appropriate imports at the top of this file.
| let (list, _lambda) = value_lambda_pair(self.name(), args.arg_fields)?; | ||
|
|
||
| match list.data_type() { | ||
| DataType::List(_) | DataType::LargeList(_) => {} |
There was a problem hiding this comment.
Are ListView and LargeListView not supported? it's fine not to do it in this PR, but I'd probably leave a comment for the future stating that there is no technical limitation preventing View variants to be implemented.
There was a problem hiding this comment.
iirc they get casted to List during type coercion during planning #18921 (comment)
| )) | ||
| } | ||
|
|
||
| use crate::lambda_utils::value_lambda_pair; |
There was a problem hiding this comment.
🤔 there seems to be a random import here in the middle of the file. How about just placing it above like any other import?
|
|
||
| /// Filters flat list values using a boolean predicate, returning filtered values and | ||
| /// recomputed per-sublist offsets. Null predicate values are treated as false. | ||
| fn filter_list_values<O: OffsetSizeTrait>( |
There was a problem hiding this comment.
Isn't this function essentially re-implementing arrow::compute::filter? Typically, it's recommended to delegate to arrow compute kernels, as they already know how to perform computation efficiently.
But maybe I'm missing something, is there a reason why we cannot use arrow::compute::filter?
There was a problem hiding this comment.
arrow::compute::filter operates on flat arrays — it has no concept of list offsets, so we still need to recompute per-sublist offsets manually. That said, we now delegate the actual value filtering entirely to the arrow kernel (which already treats null predicate values as false), removing the manual BooleanBufferBuilder you flagged.
| query error DataFusion error: Error during planning: array_filter expected a list as first argument, got Int64 | ||
| SELECT array_filter(1, v -> v > 0); | ||
|
|
There was a problem hiding this comment.
How about another test case that inverts the order of the arguments? somethign like:
SELECT array_filter(v -> v > 0, [1, 2, 3]);There was a problem hiding this comment.
Added — SELECT array_filter(v -> v > 0, [1, 2, 3]) errors with array_filter expects a value followed by a lambda.
| ############## | ||
| ## array_filter tests | ||
| ############## | ||
|
|
c701da7 to
6003157
Compare
6003157 to
6208b0b
Compare
|
@LiaCastaneda @gabotechs addressed your comments~ thank you for reviewing! |
Which issue does this PR close?
Partially addresses #14509 — implements
array_filter/list_filter.Rationale for this change
array_transform(#21679) added the firstHigherOrderUDF.array_filteris the natural companion: filter array elements with a boolean lambda, matching Sparkfilter/ DuckDBlist_filtersemantics.What changes are included in this PR?
HigherOrderUDFArrayFilter(array_filter/list_filteralias)truekeeps,false/null drops (matches Spark semantics)List,LargeList, sliced arrays, null sublistsx -> true/x -> false)arrow::compute::filter)lambda_utils.rs: shared HOF helpers extracted fromarray_transform(value_lambda_pair,coerce_single_list_arg,single_list_lambda_parameters,extract_list_values)test_utils.rs: shared unit test helpers (create_i32_list,eval_hof_on_i32_list)Are these changes tested?
array_filter.slt: filter variants,array_filter+array_transformcombinations, error casesAre there any user-facing changes?
Yes —
array_filter(array, lambda)and aliaslist_filter(array, lambda)are now available as SQL functions.