feat(arrow/compute): sort support#749
Open
hamilton-earthscope wants to merge 6 commits intoapache:mainfrom
Open
feat(arrow/compute): sort support#749hamilton-earthscope wants to merge 6 commits intoapache:mainfrom
hamilton-earthscope wants to merge 6 commits intoapache:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements stable
sort_indices(andsortviatake) for arrays, chunked arrays, record batches, and tables using logical row indices overChunkeddata without concatenating chunks. The control flow and ordering rules are modeled on Apache Arrow C++vector_sort.cc/vector_sort_internal.h, with a few Go- and performance-driven differences called out below.Parity with Arrow C++ (
vector_sort.cc/vector_sort_internal.h)Same overall structure
Single sort key, one column
null_count == 0and there are no null-likes).vector_sort_internal.go).Multiple sort keys
len(keys) <= kMaxRadixSortKeys(8): MSD radix path per record-batch range (radixRecordBatchSortRange↔ ConcreteRecordBatchColumnSorter::SortRange).multipleKeyRecordBatchSortRange).Same ordering semantics (intended match to C++)
slices.SortStableFuncare used so tie-breaking matches the C++ “left before right” stable merge behavior where documented in code.Same “column comparator” role
columnComparatorinterface ↔ C++ColumnComparator:compareRowsForKey, null / null-like metadata,columnHasValidityNulls(skip PartitionNullsOnly when there are no validity nulls).Physical types
vector_sort_physical.go, analogous to C++ConcreteColumnComparator<T>(concrete*array.T+ directValue/Cmp/ special cases for bool and intervals).Intentional differences and rationale
logicalRowMap: onerowMapCell{chunk, local}per logical row whenlen(chunks) > 1;pair(i,j)resolves two rows in one shot. Why: random compares during sort/merge need O(1) resolution; a flat table + co-located fields beats repeated resolver work and improves locality vs separatechunk/localslices.physicalColumnBasemethodspair/isNullAtGlobal/cell. Why: value receivers would copy slice headers (and map state) on every compare.std::stable_sortslices.SortStableFunc(Go 1.21+). Why: library primitive; semantics aligned with stable weak ordering used elsewhere in the port.columnComparatorinterface for “which column” in multi-key and merge loops. Why: idiomatic Go; per-type work stays in concretecompareRowsForKeyimplementations.lessover full row order after per-chunk partitioning/sort. Why: simpler merge while preserving order as long as per-chunk phases match C++; documented invector_sort.gocomments.File Layout
arrow/compute/vector_sort.go—sort_indices/sortregistration and datum dispatch.arrow/compute/vector_sort_test.go— functional tests.arrow/compute/internal/kernels/vector_sort.go— orchestration, merge,SortIndiceskernel.arrow/compute/internal/kernels/vector_sort_internal.go— null partitions, radix / multi-key batch sort.arrow/compute/internal/kernels/vector_sort_support.go—logicalRowMapand ordering helpers.arrow/compute/internal/kernels/vector_sort_physical.go— per-type column comparators.arrow/compute/internal/kernels/vector_sort_bench_test.go— benchmarks.Testing
go test ./arrow/compute -run TestSort -count=1go test ./arrow/compute/internal/kernels -bench=BenchmarkSortIndices -benchmem.References
cpp/src/arrow/compute/kernels/vector_sort.ccandvector_sort_internal.h(and related comparators).Related Issues