HIVE-29473: prevent combining stats between SELECT and LV fields by konstantinb · Pull Request #6331 · apache/hive

konstantinb · 2026-02-21T00:13:07Z

What changes were proposed in this pull request?

HIVE-29473: preventing stats override of select columns with 2+ LVs

This PR fixes namespace collision in LateralViewJoinStatsRule.process() by enforcing strict parent operator boundaries when computing column statistics.

The problem in the existing code:

LateralViewJoinStatsRule passes identical columnExprMap and RowSchema references to StatsUtils.getColStatisticsFromExprMap() for both the SELECT and UDTF branches. Since both branches can have identically-named columns (_col0, _col1, etc.), the utility method incorrectly matches UDTF statistics against SELECT columns.

The fix:

Split RowSchema into selectSchema and udtfSchema using SELECT_TAG/UDTF_TAG boundaries from the LateralViewJoinOperator's column internal names
Build separate selectExprMap and udtfExprMap by filtering the parent's columnExprMap to only include columns present in the respective schema
Pass isolated collections to getColStatisticsFromExprMap() for each branch, ensuring each branch only sees its own columns

Additional changes:

Added unit tests in TestLateralViewJoinStatsRule.java to verify namespace isolation
Added lvj_stats_isolation.q test file demonstrating the bug with single lateral view
Updated .q.out files reflecting corrected statistics estimates

Why are the changes needed?

The bug causes the CBO to combine statistics of completely unrelated columns, leading to incorrect cardinality and data size estimates for downstream operators (Group By, Join, etc.).

When the collision occurs:

The UDTF branch always generates output columns starting from _col0, _col1, etc. The SELECT branch uses original column names in simple cases, but internal names (_col0, _col1) are assigned by:

ReduceSinkOperator (normalizes output columns)
GroupByOperator (outputs aggregated columns with internal names)
genInputSelectForUnion() in SemanticAnalyzer (forces column renaming for UNION queries)

When both branches have identically-named columns (e.g., both have _col0), StatsUtils.getColStatisticsFromExprMap() matches them incorrectly, combining statistics of unrelated columns.

Impact examples:

A Group By that should estimate 2 rows instead estimates 6, because _col0 resolves to the UDTF's expression (NDV=6) rather than the base table's column
Data size estimates can be inflated by orders of magnitude when UDTF's avgColLen overwrites SELECT's smaller values

These incorrect estimates cause the optimizer to choose suboptimal execution plans.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added a new .q file confirming the new logic;
performed mathematical calculations on the updates to preexisting .q files to confirm better accuracy of new size estimations
extensive volume testing in a private Hive/Hadoop environment

new test ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out
new unit test ql/src/test/org/apache/hadoop/hive/ql/optimizer/stats/annotation/TestLateralViewJoinStatsRule.java

…d corrected.out files

konstantinb · 2026-02-23T18:31:46Z

ql/src/test/results/clientpositive/llap/union26.q.out

                            expressions: _col0 (type: string), _col1 (type: string)
                            outputColumnNames: _col0, _col1
-                            Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE
+                            Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE


This is a typical example of LV column stats impacting the data size estimations of SELECT columns:

` Column Naming

Context Column Name Represents avgColLen

LVJ output schema _col0 SELECT's key 2.812

LVJ output schema _col1 SELECT's value 6.812

LVJ output schema _col8 UDTF's exploded element —

UDTF internal stats _col0 array expression input 56.0

The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.

Processing Comparison

Step Original Code Proposed Fix

Expression Map Shared: {_col0, _col1, _col8} Split: SELECT {_col0, _col1}, UDTF {_col8}

Schema Full: [_col0, _col1, _col8] Split by numSelColumns

UDTF lookup for _col0 Looks up _col0 in udtfStats → finds array's _col0 (56.0) _col0 not in udtfExprMap → skipped

UDTF lookup for _col8 _col8 → Column[col], not found in udtfStats _col8 → Column[col], not found in udtfStats

Merge _col0 MAX(2.812, 56.0) = 56.0 No collision → 2.812

Final Column Statistics

Column Original Code Proposed Fix

_col0 avgColLen 56.0 ✗ 2.812 ✓

_col1 avgColLen 6.812 6.812

Per-row total 62.812 bytes 9.624 bytes

Data Size — LVJ Debug Output (500 rows)

Original Code Proposed Fix

Calculation 62.812 × 500 9.624 × 500

Total 31,406 bytes 4,812 bytes

Data Size — EXPLAIN Output (500 rows)

Column Original Code Proposed Fix

key avgColLen 140 ✗ 87 ✓

value avgColLen 91 91

Per-row total 231 bytes 178 bytes

Original Code Proposed Fix

Calculation 231 × 500 178 × 500

Total 115,500 bytes 89,000 bytes

konstantinb · 2026-02-23T21:06:41Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                          minReductionHashAggr: 0.4
+                          mode: hash
+                          outputColumnNames: _col0, _col1, _col2
+                          Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:07:11Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                            null sort order: zz
+                            sort order: ++
+                            Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
+                            Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:08:05Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                            minReductionHashAggr: 0.4
+                            mode: hash
+                            outputColumnNames: _col0, _col1, _col2
+                            Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:08:31Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              null sort order: zz
+                              sort order: ++
+                              Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
+                              Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:08:55Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                keys: KEY._col0 (type: string), KEY._col1 (type: string)
+                mode: mergepartial
+                outputColumnNames: _col0, _col1, _col2
+                Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:09:11Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE
+                File Output Operator
+                  compressed: false
+                  Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:10:07Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                        Group By Operator
+                          aggregations: count()
+                          keys: _col0 (type: string)
+                          minReductionHashAggr: 0.99


original code output:
minReductionHashAggr: 0.4

konstantinb · 2026-02-23T21:10:29Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                          minReductionHashAggr: 0.99
+                          mode: hash
+                          outputColumnNames: _col0, _col1
+                          Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:10:48Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                            null sort order: z
+                            sort order: +
+                            Map-reduce partition columns: _col0 (type: string)
+                            Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:11:26Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                          Group By Operator
+                            aggregations: count()
+                            keys: _col0 (type: string)
+                            minReductionHashAggr: 0.99


original code output:
minReductionHashAggr: 0.4

konstantinb · 2026-02-23T21:11:44Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                            minReductionHashAggr: 0.99
+                            mode: hash
+                            outputColumnNames: _col0, _col1
+                            Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:12:02Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                              null sort order: z
+                              sort order: +
+                              Map-reduce partition columns: _col0 (type: string)
+                              Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:12:18Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                keys: KEY._col0 (type: string)
+                mode: mergepartial
+                outputColumnNames: _col0, _col1
+                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:12:38Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE
+                File Output Operator
+                  compressed: false
+                  Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE


original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE

konstantinb · 2026-02-23T21:13:55Z

Original output for the new .q file: lvj_stats_isolation.q.out.txt

sonarqubecloud · 2026-02-23T22:25:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

okumin · 2026-03-12T13:04:05Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

okumin · 2026-03-13T07:20:50Z

ql/src/test/queries/clientpositive/lvj_stats_isolation.q

@@ -0,0 +1,34 @@
+create table lvj_stats (id string, f1 string);


Can we add set hive.stats.udtf.factor=2.0 to ensure the scaling is working as expected?

okumin · 2026-03-13T08:38:21Z

ql/src/test/results/clientpositive/llap/lvj_stats_isolation.q.out

+                    outputColumnNames: _col0
+                    Statistics: Num rows: 6 Data size: 11520 Basic stats: COMPLETE Column stats: COMPLETE
+                    UDTF Operator
+                      Statistics: Num rows: 6 Data size: 11520 Basic stats: COMPLETE Column stats: COMPLETE


I could be wrong. I am feeling the bug is not in LateralViewJoinStatsRule but in UDTFStatsRule. The posexclude accepts a single array, _col0, emitted by the SelectOperator, and emits pos1 and val1. However, it retains the original column statistics for the single array column, named _col0 and something accidentally gets wrong.

I guess the current rules construct the statistics tree like this.

I'd say this needs to be like this one?

I think the current logic works as expected when we fully discard the output from the right-hand side, i.e., UDTF. However, if we pick up values from the right hand side, something might get wrong because it has no chance to resolve the col stats of pos1 or val1.

select id, f1, pos1, count(*) from (select id, f1 from lvj_stats group by id, f1) sub lateral view posexplode(array(f1, f1)) t1 as pos1, val1 group by id, f1, pos1;

HIVE-29473: preventing stats override of select columns with 2+ LVs

533bce5

asf-ci-hive added tests pending tests unstable and removed tests pending labels Feb 21, 2026

HIVE-29473: better use of existing methods/libraries, unit testing an…

9f854a3

…d corrected.out files

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels Feb 22, 2026

konstantinb changed the title ~~HIVE-29473: preventing stats override of select columns with 2+ LVs~~ HIVE-29473: prevent combining stats between SELECT and LV fields Feb 23, 2026

konstantinb commented Feb 23, 2026

View reviewed changes

HIVE-29473: further code optimizxations + bug-specific test file

7f48c9d

asf-ci-hive added tests pending and removed tests passed labels Feb 23, 2026

konstantinb commented Feb 23, 2026

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels Feb 23, 2026

konstantinb marked this pull request as ready for review February 26, 2026 01:17

okumin reviewed Mar 13, 2026

View reviewed changes

Context	Column Name	Represents	avgColLen
LVJ output schema	_col0	SELECT's key	2.812
LVJ output schema	_col1	SELECT's value	6.812
LVJ output schema	_col8	UDTF's exploded element	—
UDTF internal stats	_col0	array expression input	56.0

Step	Original Code	Proposed Fix
Expression Map	Shared: {_col0, _col1, _col8}	Split: SELECT {_col0, _col1}, UDTF {_col8}
Schema	Full: [_col0, _col1, _col8]	Split by numSelColumns
UDTF lookup for _col0	Looks up _col0 in udtfStats → finds array's _col0 (56.0)	_col0 not in udtfExprMap → skipped
UDTF lookup for _col8	_col8 → Column[col], not found in udtfStats	_col8 → Column[col], not found in udtfStats
Merge _col0	MAX(2.812, 56.0) = 56.0	No collision → 2.812

Column	Original Code	Proposed Fix
_col0 avgColLen	56.0 ✗	2.812 ✓
_col1 avgColLen	6.812	6.812
Per-row total	62.812 bytes	9.624 bytes

	Original Code	Proposed Fix
Calculation	62.812 × 500	9.624 × 500
Total	31,406 bytes	4,812 bytes

Column	Original Code	Proposed Fix
key avgColLen	140 ✗	87 ✓
value avgColLen	91	91
Per-row total	231 bytes	178 bytes

	Original Code	Proposed Fix
Calculation	231 × 500	178 × 500
Total	115,500 bytes	89,000 bytes

		@@ -0,0 +1,34 @@
		create table lvj_stats (id string, f1 string);

Conversation

konstantinb commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konstantinb commented Feb 23, 2026

Uh oh!

sonarqubecloud bot commented Feb 23, 2026

Quality Gate passed

Uh oh!

okumin commented Mar 12, 2026

Code review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

okumin Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

konstantinb commented Feb 21, 2026 •

edited

Loading

okumin Mar 13, 2026 •

edited

Loading