HIVE-29368: limiting the NDV fix to the CASE clause handler only#6308
HIVE-29368: limiting the NDV fix to the CASE clause handler only#6308konstantinb wants to merge 3 commits intoapache:masterfrom
Conversation
|
|
CC @okumin |
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFWhen.java
Outdated
Show resolved
Hide resolved
|
zabetak
left a comment
There was a problem hiding this comment.
From my perspective the improvement should land in PessimisticStatCombiner since COALESCE, and IF are also directly affected.
In addition, I feel that PessimisticStatCombiner is not working as expected when it comes to NDV. Taking the max(NDV(branch_i)) is not really a pessimistic estimate.
A better formula for the estimation of NDV in CASE/Branch statements is:
min(rows, Sum NDV(branch_i))
This is formula is used by some other DBMS systems and should also handle this use-case with constant branches. We could also opt for more complex variants of the formula above taking into account the selectivity of the WHEN condition but we can leave that for future work.
All in all, I don't think we need to add additional bookeeping in the function itself but just modify the formula in PessimisticStatCombiner.
@zabetak thank you very much for your comment. Changing PessimisticStatCombiner was my first idea too, however, the "truly pessimistic" estimate logic I used opened quite a large can of worms, broke many existing tests and has quickly ballooned to a rather big PR #6244 While COALESCE and IF are, technically, affected in the same way, their "mis-estimations" are usually limited to 2x. The CASE with multiple constants was the most severe, and the code change of this PR has dramatically improved the situation in a private Hive implementation. The key change is counting CONSTANT branches, and PessimisticStatCombiner simply does not have access to the info. |
|
As it turns out, the fix for using "in(rows, Sum NDV(branch_i))" could require a very small code change; Checking how many tests #6359 will impact... |



What changes were proposed in this pull request?
HIVE-29368: perform more accurate NDV estimations for CASE/WHEN clauses with multiple constant branches
If the resulting NDV off PessimisticStatCombiner is smaller than the number of const branches, use the number of const branches as the expression NDV
Why are the changes needed?
The original code uses PessimisticStatCombiner to combine stats of all branches of a CASE/WHEN clause. Since every const branch has a natural NDV of 1, the resulting NDV estimate of even very large clauses is also 1 (PessimisticStatCombiner calculates MAX NDV of all branches). If such a column is subsequently used in a GROUP BY, especially multiple time, this under-estimation might lead to pretty bad exscution plan decisions. E.g. a 20x under-estimation alone could be not so bad, but when this happens 3x times in one query, then the underestimation of 111 = 1 could quickly grow up to 202020 = 8000x times row count underestimation
Does this PR introduce any user-facing change?
No
How was this patch tested?
See the .q file in the PR; also extensively tested in a private custom installation