Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE by michaeljmarshall · Pull Request #610 · datastax/jvector

michaeljmarshall · 2026-02-04T05:43:49Z

Fixes #590

Uses floyd's random sampling algorithm to select random training vectors from the RandomAccessVectorValues. The solution has two phases. The first is to select MAX_PQ_TRAINING_SET_SIZE random ordinals. Then, it maps those ordinals to vectors. Here is a reference to the algorithm: https://math.stackexchange.com/questions/178690/whats-the-proof-of-correctness-for-robert-floyds-algorithm-for-selecting-a-sin.

The algorithm is essentially constant time, which is an improvement on what we currently had. We will now only generate MAX_PQ_TRAINING_SET_SIZE random numbers instead of ravv.size() random numbers. The slight increase cost is checking a hash set for containment.

This change also handles the boundary case where the vector values object has at most MAX_PQ_TRAINING_SET_SIZE.

…G_SET_SIZE

Adding this as its own commit since the canonical implementation is shown as 1-based and I want to make it clear how I've modified it.

Copilot

Pull request overview

This PR optimizes the training vector extraction process in ProductQuantization by implementing Floyd's random sampling algorithm to ensure at most MAX_PQ_TRAINING_SET_SIZE vectors are selected. This replaces the previous approach of filtering all vectors with a random probability check.

Changes:

Replaced probabilistic filtering with deterministic sampling using Floyd's algorithm
Added special handling for cases where total vectors ≤ MAX_PQ_TRAINING_SET_SIZE
Changed from ThreadLocalRandom to SplittableRandom with a fixed seed for reproducibility

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

jvector-base/src/main/java/io/github/jbellis/jvector/quantization/ProductQuantization.java

tlwillke

Ok with the deterministic sampling. Performance for extracting the vectors checks out. LGTM.

MarkWolters

LGTM

michaeljmarshall added 3 commits February 3, 2026 23:28

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAININ…

98acb5b

…G_SET_SIZE

Make the sampling alg. 0-based

517af31

Adding this as its own commit since the canonical implementation is shown as 1-based and I want to make it clear how I've modified it.

Add an assertion

5c2214b

michaeljmarshall self-assigned this Feb 4, 2026

michaeljmarshall requested review from MarkWolters, jshook and tlwillke as code owners February 4, 2026 05:43

michaeljmarshall requested a review from Copilot February 4, 2026 05:44

Copilot AI reviewed Feb 4, 2026

View reviewed changes

jvector-base/src/main/java/io/github/jbellis/jvector/quantization/ProductQuantization.java Show resolved Hide resolved

jvector-base/src/main/java/io/github/jbellis/jvector/quantization/ProductQuantization.java Show resolved Hide resolved

tlwillke approved these changes Feb 12, 2026

View reviewed changes

MarkWolters approved these changes Feb 12, 2026

View reviewed changes

tlwillke merged commit d9ddce5 into datastax:main Feb 12, 2026
15 of 22 checks passed

michaeljmarshall deleted the jvector-590 branch February 13, 2026 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610

Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE#610
tlwillke merged 3 commits intodatastax:mainfrom
michaeljmarshall:jvector-590

michaeljmarshall commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

tlwillke left a comment

Uh oh!

MarkWolters left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

michaeljmarshall commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

tlwillke left a comment

Choose a reason for hiding this comment

Uh oh!

MarkWolters left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants