Skip to content

opt: improve histogram intersection logic for non-numeric types #159433

@DrewKimball

Description

@DrewKimball

When a filter on a non-numeric type (string, bytes, uuid, inet) intersects a histogram bucket, it is desirable to estimate a number of rows less than the total number in the bucket. Currently, the logic that handles this assumes a uniform distribution of data values across the first 8 bytes (ignoring any common prefix):

case types.StringFamily, types.BytesFamily, types.UuidFamily, types.INetFamily:
// For non-numeric types, convert the datums to encoded keys to
// approximate the range. We utilize an array to reduce repetitive code.

This likely works well for UUID columns, but can cause catastrophic underestimates for STRING columns, which are often clustered around certain values. A common example is when the STRING column represents a path. We should consider relaxing the uniformity assumption for non-UUID types.

Jira issue: CRDB-57841

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-sql-optimizerSQL logical planning and optimizations.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsT-sql-queriesSQL Queries Team

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions