perf: Optimize `array_sort` for arrays of non-primitive types#21006

Open

neilconway wants to merge 4 commits intoapache:mainfrom

neilconway:neilc/optimize-array-sort

Contributor

neilconway commented Mar 17, 2026 •

edited

Loading

Which issue does this PR close?

Closes Optimize array_sort #21005.

Rationale for this change

array_sort is fast for arrays of primitive types, because it dispatches to Arrow's sort kernel, which supports in-place sorting for primitive types. However, for arrays of non-primitive types, Arrow's sort kernel does sort_by_indices and then take; we call sort for every row and then concat the results. It would be faster to call sort_by_indices for each row ourselves, accumulate all the sorted indices for the entire batch, and then do a single take to construct the result set.

For arrays of strings, this improves array_sort performance by ~40% for 5-element arrays, ~15% for 20-element arrays, and 5% for 100-element arrays. The improvement diminishes as array length increases because the cost of the sort itself begins to dominate.

What changes are included in this PR?

Add benchmarks for array_sort
Improve unit test coverage for array_sort
Improve docs for array_sort
Implement optimization as described above

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

neilconway added 2 commits

March 17, 2026 11:41


          Add array_sort benchmark

141b474


          Optimize array_sort for short string arrays

f51e313

Contributor Author

neilconway commented Mar 17, 2026

Benchmarks (M4 Max):

  ┌─────────────┬─────────┬─────────┬───────────┐
  │  Benchmark  │ Before  │  After  │  Change   │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/5    │ 2.12 ms │ 1.26 ms │ -41%      │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/20   │ 5.80 ms │ 4.98 ms │ -14%      │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/100  │ 25.8 ms │ 24.5 ms │ -5%       │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/1000 │ 400 ms  │ 396 ms  │ -1%       │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ int32/*     │ —       │ —       │ no change │
  └─────────────┴─────────┴─────────┴───────────┘


          Fix clippy

aa49a4d

neilconway changed the title ~~Neilc/optimize array sort~~ perf: Optimize array_sort for arrays of non-primitive types

neilconway commented

View reviewed changes

datafusion/functions-nested/src/sort.rs

-                  let list_arr = if elements.is_empty() {
-                      GenericListArray::<OffsetSize>::new_null(field, row_count)
+              /// Select elements from `values` at the given `indices` using `compute::take`.
+              fn take_by_indices<OffsetSize: OffsetSizeTrait>(

Contributor Author

neilconway Mar 17, 2026

We use the same pattern in a few different files (e.g., array_reverse, array_distinct) -- probably worth moving to a shared helper. I'll look at that as a separate PR though.


          Merge remote-tracking branch 'origin/main' into neilc/optimize-array-…

6c255fb

…sort

github-actions bot added documentation sqllogictest functions labels

Dandandan reviewed

View reviewed changes

datafusion/functions-nested/src/sort.rs

                       if list_array.is_null(i) {
                           array_lengths.push(0);
                       } else {
                           let arr_ref = list_array.value(i);

Contributor

Dandandan Mar 18, 2026

This existing code seems very inefficient to me as it converts each list to an individual array and sorts those and then also concatenates those small arrays.
I think one could make a sort kernel that sorts the lists directly in a target buffer, that would be much faster.

Dandandan reviewed

View reviewed changes

datafusion/functions-nested/src/sort.rs

+                                  values: sliced,
+                                  options: sort_options,
+                              }];
+                              compute::lexsort_to_indices(&sort_columns, None)?

Contributor

Dandandan Mar 18, 2026

Ideally you could run a kernel on an entire array as well instead of slicing.

Dandandan reviewed

View reviewed changes

datafusion/functions-nested/src/sort.rs

-                      .map(|a| a.as_ref())
-                      .collect::<Vec<&dyn Array>>();
+                  let sorted_values = if indices.is_empty() {
+                      values.slice(0, 0)

Contributor

Dandandan Mar 18, 2026

This might keep on to the slice (not sure)? Better would be to create a new empty array.

Dandandan reviewed

View reviewed changes

datafusion/functions-nested/src/sort.rs

+                  indices: &[OffsetSize],
+              ) -> Result<ArrayRef> {
+                  let indices_array: ArrayRef = if OffsetSize::IS_LARGE {
+                      Arc::new(UInt64Array::from(

Contributor

Dandandan Mar 18, 2026

Could save this UInt64Array as indices is already Vec<OffsetSize> and could be converted to UInt64Array / UInt32Array without allocation

Dandandan reviewed

View reviewed changes

datafusion/functions-nested/src/sort.rs

-                          array_lengths.push(sorted_array.len());
-                          arrays.push(sorted_array);
+                          for &local_idx in sorted_indices.values() {
+                              indices.push(start + OffsetSize::usize_as(local_idx as usize));

Contributor

Dandandan Mar 18, 2026

can use indices.extend here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation functions sqllogictest