Skip to content

perf: Optimize array_sort for arrays of non-primitive types#21006

Open
neilconway wants to merge 4 commits intoapache:mainfrom
neilconway:neilc/optimize-array-sort
Open

perf: Optimize array_sort for arrays of non-primitive types#21006
neilconway wants to merge 4 commits intoapache:mainfrom
neilconway:neilc/optimize-array-sort

Conversation

@neilconway
Copy link
Contributor

@neilconway neilconway commented Mar 17, 2026

Which issue does this PR close?

Rationale for this change

array_sort is fast for arrays of primitive types, because it dispatches to Arrow's sort kernel, which supports in-place sorting for primitive types. However, for arrays of non-primitive types, Arrow's sort kernel does sort_by_indices and then take; we call sort for every row and then concat the results. It would be faster to call sort_by_indices for each row ourselves, accumulate all the sorted indices for the entire batch, and then do a single take to construct the result set.

For arrays of strings, this improves array_sort performance by ~40% for 5-element arrays, ~15% for 20-element arrays, and 5% for 100-element arrays. The improvement diminishes as array length increases because the cost of the sort itself begins to dominate.

What changes are included in this PR?

  • Add benchmarks for array_sort
  • Improve unit test coverage for array_sort
  • Improve docs for array_sort
  • Implement optimization as described above

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@neilconway
Copy link
Contributor Author

Benchmarks (M4 Max):

  ┌─────────────┬─────────┬─────────┬───────────┐
  │  Benchmark  │ Before  │  After  │  Change   │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/5    │ 2.12 ms │ 1.26 ms │ -41%      │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/20   │ 5.80 ms │ 4.98 ms │ -14%      │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/100  │ 25.8 ms │ 24.5 ms │ -5%       │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ string/1000 │ 400 ms  │ 396 ms  │ -1%       │
  ├─────────────┼─────────┼─────────┼───────────┤
  │ int32/*     │ —       │ —       │ no change │
  └─────────────┴─────────┴─────────┴───────────┘

@neilconway neilconway changed the title Neilc/optimize array sort perf: Optimize array_sort for arrays of non-primitive types Mar 17, 2026
let list_arr = if elements.is_empty() {
GenericListArray::<OffsetSize>::new_null(field, row_count)
/// Select elements from `values` at the given `indices` using `compute::take`.
fn take_by_indices<OffsetSize: OffsetSizeTrait>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same pattern in a few different files (e.g., array_reverse, array_distinct) -- probably worth moving to a shared helper. I'll look at that as a separate PR though.

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Mar 18, 2026
if list_array.is_null(i) {
array_lengths.push(0);
} else {
let arr_ref = list_array.value(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This existing code seems very inefficient to me as it converts each list to an individual array and sorts those and then also concatenates those small arrays.
I think one could make a sort kernel that sorts the lists directly in a target buffer, that would be much faster.

values: sliced,
options: sort_options,
}];
compute::lexsort_to_indices(&sort_columns, None)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally you could run a kernel on an entire array as well instead of slicing.

.map(|a| a.as_ref())
.collect::<Vec<&dyn Array>>();
let sorted_values = if indices.is_empty() {
values.slice(0, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might keep on to the slice (not sure)? Better would be to create a new empty array.

indices: &[OffsetSize],
) -> Result<ArrayRef> {
let indices_array: ArrayRef = if OffsetSize::IS_LARGE {
Arc::new(UInt64Array::from(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could save this UInt64Array as indices is already Vec<OffsetSize> and could be converted to UInt64Array / UInt32Array without allocation

array_lengths.push(sorted_array.len());
arrays.push(sorted_array);
for &local_idx in sorted_indices.values() {
indices.push(start + OffsetSize::usize_as(local_idx as usize));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can use indices.extend here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize array_sort

2 participants