Feat/hll dict size optimization clean#18144
Feat/hll dict size optimization clean#18144deeppatel710 wants to merge 9 commits intoapache:masterfrom
Conversation
|
@Jackie-Jiang can you take a look at this optimization PR and leave a feedback please? Thanks |
xiangfu0
left a comment
There was a problem hiding this comment.
Found one high-signal correctness issue; see inline comment.
| if (dictionary != null) { | ||
| int[] dictIds = blockValSet.getDictionaryIdsSV(); | ||
| getDictIdBitmap(aggregationResultHolder, dictionary).addN(dictIds, 0, length); | ||
| if (dictionary.length() > _dictSizeThreshold) { |
There was a problem hiding this comment.
This dictionary-size branch is not safe because the same aggregation result holder is reused across all scanned segments. If one segment stores a DictIdsWrapper and a later segment crosses the threshold, the next getHyperLogLog() call will cast the existing holder state to the wrong type and fail with ClassCastException. DISTINCTCOUNTHLL needs a stable holder representation here, or an explicit conversion when switching between bitmap-backed and direct-HLL aggregation.
There was a problem hiding this comment.
Thanks for the review, @xiangfu0! Great catch on the type-safety concern.
You're right that mixing the bitmap and HLL paths in the same result holder is unsafe. To be precise about when this can bite: within a single consuming (realtime) segment, the dictionary grows as new data is ingested. If one doc-batch sees a dictionary below the threshold (stores DictIdsWrapper) and a subsequent batch on the same scan sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper to HyperLogLog and throw a ClassCastException.
The fix centralizes the defense in both getHyperLogLog() helpers — they now check instanceof DictIdsWrapper and convert to HyperLogLog before returning. This covers all 6 aggregation paths since they all funnel through these two methods. Added a regression test that simulates the dictionary-growth scenario directly.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18144 +/- ##
============================================
- Coverage 63.48% 63.39% -0.10%
- Complexity 1627 1668 +41
============================================
Files 3244 3252 +8
Lines 197365 198673 +1308
Branches 30540 30774 +234
============================================
+ Hits 125306 125940 +634
- Misses 62019 62660 +641
- Partials 10040 10073 +33
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Jackie-Jiang
left a comment
There was a problem hiding this comment.
If the bottleneck is from poor performance of RoaringBitmap insertion, should we switch to a BitSet solution? Even with 1M cardinality, BitSet memory overhead is negligible (128KB). I believe the performance should be much better than directly aggregating HLL
|
@Jackie-Jiang The root cause of the RoaringBitmap slowdown for high-cardinality dicts isn't just insertion count — it's the container-type transition that happens at ~4096 entries (ArrayContainer → BitmapContainer), which involves an O(n) copy of all existing entries. This makes random high-cardinality insertions particularly painful. BitSet sidesteps this entirely: O(1) set() with no container management, and memory is a flat dictSize / 8 bytes (128 KB for a 1M-entry dict as you noted — negligible). Also realized this allows us to remove the dictSizeThreshold complexity from the previous iteration entirely. The threshold was meant to avoid bitmap overhead by falling back to direct-HLL for large dicts but since HLL uses max-register semantics, offering the same hash twice has no effect. BitSet deduplication and direct-HLL insertion produce identical accuracy, so the optimization is always worthwhile with no Updated in the latest commit:
The Pinot Binary Compatibility Check failure is pre-existing on master and affects all open PRs currently (e.g. #18189). It's related to ColumnStatistics/MapColumnStatistics changes in pinot-segment-spi, which this PR doesn't touch. |
…high-cardinality columns For dictionary-encoded columns with high cardinality (e.g., 14M+ distinct values), DISTINCTCOUNTHLL spent O(n log n) time inserting dictionary IDs into a RoaringBitmap before converting to HLL at finalization. This mirrors the performance issue originally reported for DISTINCTCOUNTSMARTHLL (fixed in apache#17411). This commit introduces an optional third argument `dictSizeThreshold` (default: 100,000). When the dictionary size exceeds the threshold, dictionary values are offered directly to the HyperLogLog without going through a RoaringBitmap first. Since DISTINCTCOUNTHLL already produces an approximate result, bitmap deduplication is not needed for correctness in high-cardinality scenarios — HLL handles duplicate offers gracefully. The optimization applies to all aggregation paths: - Non-group-by SV and MV - Group-by SV (both SV and MV group keys) - Group-by MV (both SV and MV group keys) Usage: DISTINCTCOUNTHLL(col) -- default threshold (100K) DISTINCTCOUNTHLL(col, 12) -- custom log2m, default threshold DISTINCTCOUNTHLL(col, 12, 50000) -- custom log2m and threshold DISTINCTCOUNTHLL(col, 12, 0) -- disable optimization (threshold = MAX_VALUE) Expected speedup for high-cardinality columns: 4x-10x, consistent with the benchmark results demonstrated for DISTINCTCOUNTSMARTHLL in apache#17411. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n DISTINCTCOUNTHLL In consuming (realtime) segments, the dictionary grows during ingestion. If a prior block used the bitmap path (dict size below threshold) and a later block sees the grown dictionary above the threshold, getHyperLogLog() would cast the existing DictIdsWrapper result to HyperLogLog and throw ClassCastException. Fix: check instanceof DictIdsWrapper in both getHyperLogLog() helpers and convert to HyperLogLog before returning, so the holder type always stays consistent regardless of when the threshold is crossed. Also adds a regression test that simulates this exact scenario. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch DictIdsWrapper from RoaringBitmap to java.util.BitSet for deduplicating dictionary IDs before offering to HyperLogLog. BitSet gives O(1) set operations with no container-type transition overhead (RoaringBitmap transitions ArrayContainer→BitmapContainer at ~4096 entries, causing O(n) copies for random high-cardinality inserts). Memory overhead is dictSize/8 bytes (e.g. 128 KB for a 1M-entry dict), which is negligible compared to the segment data already in memory. This also removes the _dictSizeThreshold complexity introduced in the previous iteration. The threshold was added to avoid the RoaringBitmap overhead for large dicts by falling back to direct-HLL, but since HLL is idempotent (max-register semantics), deduplication via BitSet and direct-HLL insertion produce identical accuracy. With BitSet, the optimization is always beneficial and needs no threshold knob. Changes: - DictIdsWrapper now holds a BitSet instead of a RoaringBitmap - Removed _dictSizeThreshold, getDictSizeThreshold(), and the 3-arg constructor; function signature reverts to 1 or 2 arguments - Removed all threshold-branch logic from 6 aggregation methods - getDictIdBitmap() helper renamed to getDictIdBitSet() - convertToHyperLogLog() uses BitSet.nextSetBit() iteration - Updated tests: removed threshold/bypass tests, added BitSet deduplication correctness, large-dict, and multi-batch accumulation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…of per-group BitSet BitSet deduplication in aggregateSVGroupBySV/aggregateMVGroupBySV/aggregateSVGroupByMV/aggregateMVGroupByMV was allocating one BitSet(dictSize/8 bytes) per group. With 100K groups × 3M-entry dict → 37.5 GB → OOM. HyperLogLog uses max-register semantics so duplicate offer() calls are no-ops — deduplication before HLL is unnecessary for accuracy. For group-by, offers values directly to HLL. BitSet deduplication is kept only for the non-group-by aggregation path (one BitSet per segment, bounded and cheap). Also removes the now-unused getDictIdBitSet(GroupByResultHolder, int, Dictionary) helper and simplifies extractGroupByResult to remove the DictIdsWrapper branch that can no longer occur. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
271b095 to
86fd8d7
Compare
…erLogLog per group HyperLogLog(log2m=14) pre-allocates ~16KB of registers upfront per group, regardless of how many distinct values that group actually sees. With many groups (e.g. numGroupsLimit=MAX_INT) and a high-cardinality group-by key, this causes OOM before the CPU kill threshold is reached. Restore the group-by dict-encoded path to use a RoaringBitmap (via new GroupByDictIdsWrapper), which is sparse: memory is proportional to the number of distinct dict IDs seen per group (~2 bytes each), not to 2^log2m. At extraction time, convert to HyperLogLog by iterating the bitmap once. The BitSet optimization (DictIdsWrapper) is preserved for the non-group-by aggregation path, where a single BitSet per segment is bounded and cheap (dictSize/8 bytes regardless of numGroups). Summary of design: - Non-group-by aggregation: ONE BitSet across full dict (fast, bounded) - Group-by: ONE RoaringBitmap per group (sparse, scales with cardinality) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@xiangfu0 all checks have passed - can you review this PR? this could be a massive improvement |
Binary Compatibility Check Failure — Pre-existing on MasterThe It was introduced by PR #18170 ("Refactor ColumnStatistics: clean up interface, add isAscii"), merged to master on Apr 12. Our branch has zero diff against master for → no output. This failure will affect every PR rebased on current master until the compat issue in master is resolved. |
| * Uses a {@link RoaringBitmap} (sparse) so memory scales with distinct values per group, not dictionary size. | ||
| */ | ||
| protected static RoaringBitmap getDictIdBitmap(AggregationResultHolder aggregationResultHolder, | ||
| protected static GroupByDictIdsWrapper getDictIdBitmap(GroupByResultHolder groupByResultHolder, int groupKey, |
There was a problem hiding this comment.
Consider directly returning the RoaringBitmap to align with the method name and be more explicit
There was a problem hiding this comment.
Done — both getDictIdBitmap and getDictIdBitSet now return RoaringBitmap and BitSet directly. The wrappers (GroupByDictIdsWrapper / DictIdsWrapper) are still stored in the result holder so the dictionary is available at extraction time, but the helper methods now return the inner data structure directly, matching the pattern in BaseDistinctAggregateAggregationFunction. The now-unused helper methods (add, addDictIds, set) were removed from both wrapper classes.
| /** | ||
| * Returns the {@link DictIdsWrapper} from the result holder, creating a new one if absent. | ||
| */ | ||
| protected static DictIdsWrapper getDictIdBitSet(AggregationResultHolder aggregationResultHolder, |
There was a problem hiding this comment.
Consider directly returning the BitSet to align with the method name and be more explicit
There was a problem hiding this comment.
Done — same as above.
There was a problem hiding this comment.
@Jackie-Jiang can you merge this too? Thanks
|
@deeppatel710 Did you push the change? |
…tmap/getDictIdBitSet - getDictIdBitmap now returns wrapper._dictIdBitmap (RoaringBitmap) directly - getDictIdBitSet now returns dictIdsWrapper._bitSet (BitSet) directly - Removed now-unused helper methods from DictIdsWrapper and GroupByDictIdsWrapper - Updated all call sites to use the raw types directly - Test: assert reference equality in batch-reuse test to directly verify wrapper reuse Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@Jackie-Jiang sorry, missed that commit. Just pushed! Can you review it? Thanks |
Summary
Fixes the performance bottleneck in
DISTINCTCOUNTHLLfor dictionary-encoded columns with high cardinality (reported in #17336).DISTINCTCOUNTHLLcurrently always accumulates dictionary IDs into aRoaringBitmapbefore converting to HLL at finalization. For high-cardinality columns(14M+ distinct values), bitmap insertions dominate execution time at O(n log n), causing queries to take 6-10 seconds.
This adds an optional third argument
dictSizeThreshold(default:100,000). Whendictionary.length() > dictSizeThreshold, dictionary values are offereddirectly to the HyperLogLog, bypassing the bitmap entirely. Since
DISTINCTCOUNTHLLalready returns an approximate result, exact bitmap deduplication isunnecessary — HLL handles duplicate offers gracefully.
The same root cause was fixed for
DISTINCT_COUNT_SMART_HLLin #17411, which demonstrated 4x-10x CPU reduction for high-cardinality workloads. This PRapplies the equivalent optimization to the more commonly used
DISTINCTCOUNTHLLfunction.Usage