Skip to content

HIVE-29368: limiting the NDV fix to the CASE clause handler only#6308

Draft
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29368-case-only
Draft

HIVE-29368: limiting the NDV fix to the CASE clause handler only#6308
konstantinb wants to merge 3 commits intoapache:masterfrom
konstantinb:HIVE-29368-case-only

Conversation

@konstantinb
Copy link
Contributor

@konstantinb konstantinb commented Feb 9, 2026

What changes were proposed in this pull request?

HIVE-29368: perform more accurate NDV estimations for CASE/WHEN clauses with multiple constant branches

If the resulting NDV off PessimisticStatCombiner is smaller than the number of const branches, use the number of const branches as the expression NDV

Why are the changes needed?

The original code uses PessimisticStatCombiner to combine stats of all branches of a CASE/WHEN clause. Since every const branch has a natural NDV of 1, the resulting NDV estimate of even very large clauses is also 1 (PessimisticStatCombiner calculates MAX NDV of all branches). If such a column is subsequently used in a GROUP BY, especially multiple time, this under-estimation might lead to pretty bad exscution plan decisions. E.g. a 20x under-estimation alone could be not so bad, but when this happens 3x times in one query, then the underestimation of 111 = 1 could quickly grow up to 202020 = 8000x times row count underestimation

Does this PR introduce any user-facing change?

No

How was this patch tested?

See the .q file in the PR; also extensively tested in a private custom installation

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2026

@konstantinb
Copy link
Contributor Author

CC @okumin

Copy link
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 with a minor comment

@sonarqubecloud
Copy link

Copy link
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective the improvement should land in PessimisticStatCombiner since COALESCE, and IF are also directly affected.

In addition, I feel that PessimisticStatCombiner is not working as expected when it comes to NDV. Taking the max(NDV(branch_i)) is not really a pessimistic estimate.

A better formula for the estimation of NDV in CASE/Branch statements is:

min(rows, Sum NDV(branch_i))

This is formula is used by some other DBMS systems and should also handle this use-case with constant branches. We could also opt for more complex variants of the formula above taking into account the selectivity of the WHEN condition but we can leave that for future work.

All in all, I don't think we need to add additional bookeeping in the function itself but just modify the formula in PessimisticStatCombiner.

@konstantinb
Copy link
Contributor Author

konstantinb commented Mar 13, 2026

From my perspective the improvement should land in PessimisticStatCombiner since COALESCE, and IF are also directly affected.

In addition, I feel that PessimisticStatCombiner is not working as expected when it comes to NDV. Taking the max(NDV(branch_i)) is not really a pessimistic estimate.

A better formula for the estimation of NDV in CASE/Branch statements is:

min(rows, Sum NDV(branch_i))

This is formula is used by some other DBMS systems and should also handle this use-case with constant branches. We could also opt for more complex variants of the formula above taking into account the selectivity of the WHEN condition but we can leave that for future work.

All in all, I don't think we need to add additional bookeeping in the function itself but just modify the formula in PessimisticStatCombiner.

@zabetak thank you very much for your comment.

Changing PessimisticStatCombiner was my first idea too, however, the "truly pessimistic" estimate logic I used opened quite a large can of worms, broke many existing tests and has quickly ballooned to a rather big PR #6244
(I admit that it did not, however, use your formula "min(rows, Sum NDV(branch_i))")

While COALESCE and IF are, technically, affected in the same way, their "mis-estimations" are usually limited to 2x. The CASE with multiple constants was the most severe, and the code change of this PR has dramatically improved the situation in a private Hive implementation. The key change is counting CONSTANT branches, and PessimisticStatCombiner simply does not have access to the info.

@konstantinb
Copy link
Contributor Author

As it turns out, the fix for using "in(rows, Sum NDV(branch_i))" could require a very small code change; Checking how many tests #6359 will impact...

@konstantinb konstantinb marked this pull request as draft March 13, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants