Skip to content

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610

Merged
tlwillke merged 3 commits intodatastax:mainfrom
michaeljmarshall:jvector-590
Feb 12, 2026
Merged

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610
tlwillke merged 3 commits intodatastax:mainfrom
michaeljmarshall:jvector-590

Conversation

@michaeljmarshall
Copy link
Member

Fixes #590

Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. Here is a reference to the algorithm: https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin.

The algorithm is essentially constant time, which is an improvement on what we currently had. We will now only generate MAX_PQ_TRAINING_SET_SIZE random numbers instead of ravv.size() random numbers. The slight increase cost is checking a hash set for containment.

This change also handles the boundary case where the vector values object has at most MAX_PQ_TRAINING_SET_SIZE.

Adding this as its own commit
since the canonical implementation
is shown as 1-based and I want to
make it clear how I've modified it.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the training vector extraction process in ProductQuantization by implementing Floyd's random sampling algorithm to ensure at most MAX_PQ_TRAINING_SET_SIZE vectors are selected. This replaces the previous approach of filtering all vectors with a random probability check.

Changes:

  • Replaced probabilistic filtering with deterministic sampling using Floyd's algorithm
  • Added special handling for cases where total vectors ≤ MAX_PQ_TRAINING_SET_SIZE
  • Changed from ThreadLocalRandom to SplittableRandom with a fixed seed for reproducibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@tlwillke tlwillke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok with the deterministic sampling. Performance for extracting the vectors checks out. LGTM.

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tlwillke tlwillke merged commit d9ddce5 into datastax:main Feb 12, 2026
15 of 22 checks passed
@michaeljmarshall michaeljmarshall deleted the jvector-590 branch February 13, 2026 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extractTrainingVectors may produce more than MAX_PQ_TRAINING_SET_SIZE vectors

3 participants