Skip to content

CASSANDRA-20086: Use score ordered iterators for ANN search#4615

Open
michaeljmarshall wants to merge 2 commits intoapache:trunkfrom
michaeljmarshall:cassandra-20086-trunk
Open

CASSANDRA-20086: Use score ordered iterators for ANN search#4615
michaeljmarshall wants to merge 2 commits intoapache:trunkfrom
michaeljmarshall:cassandra-20086-trunk

Conversation

@michaeljmarshall
Copy link
Member

@michaeljmarshall michaeljmarshall commented Feb 13, 2026

(cherry picked from commit f2e3411)

This PR implements the feature proposed in https://issues.apache.org/jira/browse/CASSANDRA-20086.

Motivation

Current ANN search does the following:

  1. Get top k vectors from each index segment (memory or sstable)
  2. Sort them in Primary Key order
  3. Consume and materialize all of of them (LIMIT * numSegments)
  4. Re-query with higher LIMIT if shadow keys are encountered (updates/overwrites)
  5. Re-score to sort by similarity score
  6. Take LIMIT best vectors
  7. Sort by Primary Key

This flow is both inefficient due to multiple sorts/requeries and can produce invalid results in cases where a row has an update to a vector to a worse score.

The new flow:

  1. Get top k vectors from each index segment (memory or sstable)
  2. Build score ordered iterators
  3. Merge those iterators, sorting by score descending
  4. Consume, materialize, and validate rows from the global iterator until we have LIMIT rows
  5. If an iterator is exhausted during step 4, requery the exhausted iterator's graph to get more vectors, using a bitset to ignore already visited nodes, until the graph is exhausted.
  6. Sort LIMIT rows by primary key.

This change removes two sort operations. It also fixes all update cases by asserting that a row's position within the score ordered iterator is only valid if the vector cell's materialized value is from the same sstable.

Other minor changes

  1. Increased test coverage by parameterizing the logic to test brute force, graph search, and "graph's choice".
  2. Implemented custom brute force iterator that utilizes two priority queues to do the PQ based ranking then the full resolution ranking
  3. Fixed a bug in the mapping of limit to topk where we could get topk < limit for limit 1000

I plan on completing some basic bench marks next week to demonstrate the value of this change. For now, I am opening this to start to get some reviews.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant