Skip to content

Conversation

@ajroetker
Copy link
Contributor

@ajroetker ajroetker commented Nov 18, 2025

This PR combines the aggregations framework and additional aggregation types into a unified implementation.

Builds off discussions in #2243 and ports #2242.

Aggregations Framework

Enable powerful analytics and data exploration capabilities that go beyond simple faceting. Users can now compute metrics (sum, avg, min, max, count, sumsquares, stats) across search results and group them by field values or ranges with nested sub-aggregations for multi-dimensional analysis.

Problems addressed:

  • Computing statistics across filtered result sets (e.g., "average price of products matching 'laptop'")
  • Multi-level grouping and metrics (e.g., "total sales per region per category")
  • Complex analytics queries without requiring separate aggregation passes

Notes:

  • Metric aggregations: sum, avg, min, max, count, sumsquares, stats
  • Bucket aggregations: terms (group by values), range (group by ranges)
  • Nested sub-aggregations for multi-dimensional analytics
  • Computed efficiently during query execution using visitor pattern
  • Fully backward compatible - Facets API unchanged

Prefix and Regex Filtering for Terms Aggregations

(Port of #2242)

Enable search-as-you-type style aggregations where bucket terms dynamically match user input. Users can now aggregate by field values that match what's being typed in a search box, making autosuggestions cleaner and more focused (e.g., as user types "ste", show matching authors, titles, categories all filtered to terms starting with "ste").

Problems addressed:

  • Dynamic faceted autosuggestions that update as users type
  • Filtering high-cardinality fields to relevant matches only
  • Consistent filtering API between facets and aggregations

Notes:

  • Add TermPrefix and TermPattern fields to AggregationRequest
  • Pre-compile regex patterns in NewTermsAggregation (now returns error)
  • Add NewTermsAggregationWithFilter helper

Additional Aggregation Types

Cardinality aggregation:

  • Unique value counting using HyperLogLog++
  • Configurable precision (10-18) with ~1% standard error at default (14)

Bucket aggregations:

  • histogram: Fixed-interval numeric buckets with minDocCount filtering
  • date_histogram: minute/hour/day/week/month/quarter/year time intervals
  • geohash_grid: Geo point clustering by geohash cells (precision 1-12)
  • geo_distance: Distance range buckets from a center point

Significant terms aggregation:

  • Identifies terms uncommonly common in results vs entire index
  • Four algorithms: JLH, Mutual Information, Chi-Squared, Percentage
  • Two-phase architecture using pre-search infrastructure for background stats
  • Configurable size, minDocCount, and scoring algorithm

All aggregations support sub-aggregations and distributed queries.

Dependencies: Added github.com/axiomhq/hyperloglog for HLL++

ajroetker and others added 8 commits November 12, 2025 16:52
Enable powerful analytics and data exploration capabilities that go beyond
simple faceting. Users can now compute metrics (sum, avg, min, max, count,
sumsquares, stats) across search results and group them by field values or
ranges with nested sub-aggregations for multi-dimensional analysis.

This addresses the need for:
- Computing statistics across filtered result sets (e.g., "average price of
  products matching 'laptop'")
- Multi-level grouping and metrics (e.g., "total sales per region per category")
- Complex analytics queries without requiring separate aggregation passes

Key features:
- Metric aggregations: sum, avg, min, max, count, sumsquares, stats
- Bucket aggregations: terms (group by values), range (group by ranges)
- Nested sub-aggregations for multi-dimensional analytics
- Computed efficiently during query execution using visitor pattern
- Fully backward compatible - Facets API unchanged

Example - average price per brand:
  byBrand := bleve.NewTermsAggregation("brand", 10)
  byBrand.AddSubAggregation("avg_price", bleve.NewAggregationRequest("avg", "price"))
  searchRequest.Aggregations = bleve.AggregationsRequest{"by_brand": byBrand}
Enable search-as-you-type style aggregations where bucket terms dynamically
match user input. Users can now aggregate by field values that match what's
being typed in a search box, making autosuggestions cleaner and more focused
(e.g., as user types "ste", show matching authors, titles, categories all
filtered to terms starting with "ste").

This addresses the need for:
- Dynamic faceted autosuggestions that update as users type
- Filtering high-cardinality fields to relevant matches only
- Consistent filtering API between facets and aggregations (ports existing
  facet filtering feature)

Performance benefits:
- Zero-allocation filtering - only matching terms convert from []byte to string
- Filters apply before bucket creation and sub-aggregation processing
- Fast prefix checks with bytes.HasPrefix before regex evaluation

Key changes:
- Add TermPrefix and TermPattern fields to AggregationRequest
- Pre-compile regex patterns in NewTermsAggregation (now returns error)
- Add NewTermsAggregationWithFilter helper

Example - autocomplete aggregation:
  agg, _ := bleve.NewTermsAggregationWithFilter("brand", 10, userInput, "")
Fixes bug in nested bucket aggregations where metric values were
duplicated due to duplicate field registration in SubAggregationFields().
Also fixes StartDoc/EndDoc lifecycle for bucket sub-aggregations and
min/max comparison logic in optimized aggregations.

Adds Clone() method to AggregationBuilder interface for proper deep
copying of nested aggregation hierarchies. Adopts setter pattern for
aggregation filters (SetPrefixFilter, SetRegexFilter).
- Fix double-counting in bucket aggregations with sawValue guard
- Remove unused count fields from Sum and SumSquares aggregations
- Move StatsResult to search package for cleaner stats merging
- Add field deduplication and validation for term filters
Also properly adds support for average for merging
…distance, and significant_terms aggregations

Cardinality aggregation:
- Unique value counting using HyperLogLog++
- Configurable precision (10-18) with ~1% standard error at default (14)

Bucket aggregations:
- histogram: Fixed-interval numeric buckets with minDocCount filtering
- date_histogram:  minute/hour/day/week/month/quarter/year time intervals
- geohash_grid: Geo point clustering by geohash cells (precision 1-12)
- geo_distance: Distance range buckets from a center point

Significant terms aggregation:
- Identifies terms uncommonly common in results vs entire index
- Four algorithms: JLH, Mutual Information, Chi-Squared, Percentage
- Two-phase architecture using pre-search infrastructure for background stats
- Configurable size, minDocCount, and scoring algorithm

All aggregations support sub-aggregations and distributed queries.

Dependencies: Added github.com/axiomhq/hyperloglog for HLL++
Add AddNumericRange, AddDateTimeRange, AddDateTimeRangeString, and
AddDistanceRange methods to AggregationRequest, matching the pattern
used by FacetRequest. This allows external code to add range buckets
without needing access to the unexported range types.
@ajroetker ajroetker changed the title (feat) Add cardinality, histogram, date_histogram, geohash_grid, geo_distance, and significant_terms aggregations (feat) Add aggregations framework with cardinality, histogram, date_histogram, geohash_grid, geo_distance, and significant_terms Jan 1, 2026
@ajroetker ajroetker changed the title (feat) Add aggregations framework with cardinality, histogram, date_histogram, geohash_grid, geo_distance, and significant_terms (feat) Add aggregations framework with analytics and bucket aggregations Jan 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants