refactor: unify SQL planning for ORDER BY, HAVING, DISTINCT, etc #19974
+425
−58
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This PR unifies SQL planning logic for ORDER BY with HAVING and other clauses by adopting the merged schema approach (#10326).
Problem: DataFusion previously had two different code paths for handling ORDER BY:
add_missing_columns- traverses the plan tree to find Projection nodes and injects missing columnsThis duality caused complex, hard-to-maintain code, non-intuitive handling of queries like
SELECT x FROM foo ORDER BY y,and unnecessarily wrapped execution plans.Solution: Refactor ORDER BY planning to use the merged schema approach, directly adding missing expressions to the SELECT list instead of traversing the plan tree.
Key Changes:
• Simplified
order_by_to_sort_expr: Removed the additional_schema parameter and the internal schema merging logic - now input_schema directly instead of building a combined schema• Added
add_missing_order_by_exprs(): A new function that handles missing ORDER BY expressions by adding them directly the SELECT list before planning• Properly handles aggregate/window functions, column references, and aliases
• Returns error for SELECT DISTINCT with missing ORDER BY expressions
• Supports strict mode (when GROUP BY is present) to preserve proper error messages
• Unified planning flow: ORDER BY expressions are now resolved against the SELECT list first, then missing ones are adde eliminating the need for complex plan tree traversal (add_missing_columns)
This makes the code cleaner, more maintainable, and produces simpler execution plans.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?