opt: improve histogram intersection logic for non-numeric types

When a filter on a non-numeric type (string, bytes, uuid, inet) intersects a histogram bucket, it is desirable to estimate a number of rows less than the total number in the bucket. Currently, the logic that handles this assumes a uniform distribution of data values across the first 8 bytes (ignoring any common prefix): https://github.com/cockroachdb/cockroach/blob/3209e33b9528c21be13110c1cb99471ebd85c5a8/pkg/sql/sem/tree/datumrange/range.go#L180-L182

This likely works well for UUID columns, but can cause catastrophic underestimates for STRING columns, which are often clustered around certain values. A common example is when the STRING column represents a path. We should consider relaxing the uniformity assumption for non-UUID types.

Jira issue: CRDB-57841

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opt: improve histogram intersection logic for non-numeric types #159433

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	case types.StringFamily, types.BytesFamily, types.UuidFamily, types.INetFamily:
	// For non-numeric types, convert the datums to encoded keys to
	// approximate the range. We utilize an array to reduce repetitive code.

opt: improve histogram intersection logic for non-numeric types #159433

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions