HIVE-29473: prevent combining stats between SELECT and LV fields#6331
HIVE-29473: prevent combining stats between SELECT and LV fields#6331konstantinb wants to merge 3 commits intoapache:masterfrom
Conversation
…d corrected.out files
| expressions: _col0 (type: string), _col1 (type: string) | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 500 Data size: 115500 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
This is a typical example of LV column stats impacting the data size estimations of SELECT columns:
` Column Naming
| Context | Column Name | Represents | avgColLen |
|---|---|---|---|
| LVJ output schema | _col0 | SELECT's key | 2.812 |
| LVJ output schema | _col1 | SELECT's value | 6.812 |
| LVJ output schema | _col8 | UDTF's exploded element | — |
| UDTF internal stats | _col0 | array expression input | 56.0 |
The UDTF branch's column generator restarts at 0, so its internal stats use _col0 for the array expression — colliding with SELECT's _col0.
Processing Comparison
| Step | Original Code | Proposed Fix |
|---|---|---|
| Expression Map | Shared: {_col0, _col1, _col8} | Split: SELECT {_col0, _col1}, UDTF {_col8} |
| Schema | Full: [_col0, _col1, _col8] | Split by numSelColumns |
| UDTF lookup for _col0 | Looks up _col0 in udtfStats → finds array's _col0 (56.0) | _col0 not in udtfExprMap → skipped |
| UDTF lookup for _col8 | _col8 → Column[col], not found in udtfStats | _col8 → Column[col], not found in udtfStats |
| Merge _col0 | MAX(2.812, 56.0) = 56.0 | No collision → 2.812 |
Final Column Statistics
| Column | Original Code | Proposed Fix |
|---|---|---|
| _col0 avgColLen | 56.0 ✗ | 2.812 ✓ |
| _col1 avgColLen | 6.812 | 6.812 |
| Per-row total | 62.812 bytes | 9.624 bytes |
Data Size — LVJ Debug Output (500 rows)
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 62.812 × 500 | 9.624 × 500 |
| Total | 31,406 bytes | 4,812 bytes |
Data Size — EXPLAIN Output (500 rows)
| Column | Original Code | Proposed Fix |
|---|---|---|
| key avgColLen | 140 ✗ | 87 ✓ |
| value avgColLen | 91 | 91 |
| Per-row total | 231 bytes | 178 bytes |
| Original Code | Proposed Fix | |
|---|---|---|
| Calculation | 231 × 500 | 178 × 500 |
| Total | 115,500 bytes | 89,000 bytes |
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| minReductionHashAggr: 0.4 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: zz | ||
| sort order: ++ | ||
| Map-reduce partition columns: _col0 (type: string), _col1 (type: string) | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string), KEY._col1 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1, _col2 | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 6 Data size: 1074 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12588 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Group By Operator | ||
| aggregations: count() | ||
| keys: _col0 (type: string) | ||
| minReductionHashAggr: 0.99 |
There was a problem hiding this comment.
original code output:
minReductionHashAggr: 0.4
| minReductionHashAggr: 0.99 | ||
| mode: hash | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| null sort order: z | ||
| sort order: + | ||
| Map-reduce partition columns: _col0 (type: string) | ||
| Statistics: Num rows: 3 Data size: 279 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| keys: KEY._col0 (type: string) | ||
| mode: mergepartial | ||
| outputColumnNames: _col0, _col1 | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE | ||
| File Output Operator | ||
| compressed: false | ||
| Statistics: Num rows: 1 Data size: 93 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
original code output:
Statistics: Num rows: 6 Data size: 12072 Basic stats: COMPLETE Column stats: COMPLETE
|
Original output for the new .q file: lvj_stats_isolation.q.out.txt |
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
| @@ -0,0 +1,34 @@ | |||
| create table lvj_stats (id string, f1 string); | |||
There was a problem hiding this comment.
Can we add set hive.stats.udtf.factor=2.0 to ensure the scaling is working as expected?
| outputColumnNames: _col0 | ||
| Statistics: Num rows: 6 Data size: 11520 Basic stats: COMPLETE Column stats: COMPLETE | ||
| UDTF Operator | ||
| Statistics: Num rows: 6 Data size: 11520 Basic stats: COMPLETE Column stats: COMPLETE |
There was a problem hiding this comment.
I could be wrong. I am feeling the bug is not in LateralViewJoinStatsRule but in UDTFStatsRule. The posexclude accepts a single array, _col0, emitted by the SelectOperator, and emits pos1 and val1. However, it retains the original column statistics for the single array column, named _col0 and something accidentally gets wrong.
I guess the current rules construct the statistics tree like this.
I'd say this needs to be like this one?
I think the current logic works as expected when we fully discard the output from the right-hand side, i.e., UDTF. However, if we pick up values from the right hand side, something might get wrong because it has no chance to resolve the col stats of pos1 or val1.
select id, f1, pos1, count(*)
from (select id, f1 from lvj_stats group by id, f1) sub
lateral view posexplode(array(f1, f1)) t1 as pos1, val1
group by id, f1, pos1;



What changes were proposed in this pull request?
HIVE-29473: preventing stats override of select columns with 2+ LVs
This PR fixes namespace collision in LateralViewJoinStatsRule.process() by enforcing strict parent operator boundaries when computing column statistics.
The problem in the existing code:
LateralViewJoinStatsRule passes identical columnExprMap and RowSchema references to StatsUtils.getColStatisticsFromExprMap() for both the SELECT and UDTF branches. Since both branches can have identically-named columns (_col0, _col1, etc.), the utility method incorrectly matches UDTF statistics against SELECT columns.
The fix:
Additional changes:
Why are the changes needed?
The bug causes the CBO to combine statistics of completely unrelated columns, leading to incorrect cardinality and data size estimates for downstream operators (Group By, Join, etc.).
When the collision occurs:
The UDTF branch always generates output columns starting from _col0, _col1, etc. The SELECT branch uses original column names in simple cases, but internal names (_col0, _col1) are assigned by:
When both branches have identically-named columns (e.g., both have _col0), StatsUtils.getColStatisticsFromExprMap() matches them incorrectly, combining statistics of unrelated columns.
Impact examples:
These incorrect estimates cause the optimizer to choose suboptimal execution plans.
Does this PR introduce any user-facing change?
No
How was this patch tested?