Use bulk scoring in NeighborArray#isWorstNonDiverse by ML-dev-crypto · Pull Request #15667 · apache/lucene

ML-dev-crypto · 2026-02-05T07:15:18Z

This change updates NeighborArray#isWorstNonDiverse to use the
RandomVectorScorer#bulkScore API, avoiding repeated per-neighbor
score calls in a hot loop.

The bulkScore method is a default interface method and safely falls
back to per-node scoring when not overridden by a scorer.

Fixes #15606

benwtrent · 2026-02-05T21:37:04Z

@ML-dev-crypto Could you benchmark with lucene util to validate if this is actually worth it?

benwtrent · 2026-02-05T21:37:54Z

+      // Allocate a temporary buffer for scores.
+      // NeighborArray size is typically small (M=16 or 32), so this allocation is acceptable
+      // and keeps the change localized to this class.
+      float[] neighborScores = new float[numNodesToCheck];
+
+      // Bulk score all neighbors.
+      // The default implementation in RandomVectorScorer handles the fallback if needed.
+      scorer.bulkScore(nodes.buffer, neighborScores, numNodesToCheck);


please, delete all the useless LLM-esque comments. We can see what its doing. Comments describing "what" are fairly useless.

Understood. I've removed the verbose comments and kept the documentation focused on the 'why' rather than the 'what'.

benwtrent · 2026-02-05T21:38:17Z

+      }
+
+      // Bulk score all unchecked neighbors
+      scorer.bulkScore(nodesToCheck, neighborScores, numNodesToCheck);


Why can't use use the max score returned from bulk score?

THANKS FOR YOUR GUIDANCE I've updated the logic to capture the return value from bulkScore and use it for an early exit. If the maximum similarity in the batch is worse than the candidate, we can return immediately without iterating.

benwtrent · 2026-02-05T21:39:21Z

    float minAcceptedSimilarity = scores.get(candidateIndex);
    if (candidateIndex == uncheckedIndexes[uncheckedCursor]) {
      // the candidate itself is unchecked
-      for (int i = candidateIndex - 1; i >= 0; i--) {


Could we have the scratch data passed in here?

You should be able to pass in int[] and float[]. I am not sure why you only passed in one.

I suggest maybe using the DocAndFloatFeatureBuffer

THANKS FOR YOUR GUIDANCE AND TIME.Right now this PR removes the dominant allocation by reusing the float[] in the hot path. There is still a small int[] allocation when gathering unchecked nodes, which could be avoided by passing both int[] and float[] or by using DocAndFloatFeatureBuffer.
I kept this PR minimal and behavior-preserving, but I’m happy to extend it to reuse both buffers if you’d prefer that in this change.

Reuse a scratch buffer during HNSW diversity checks and use RandomVectorScorer.bulkScore with early termination to avoid per-call allocations on the hot path.

navneet1v · 2026-02-07T07:20:23Z

@ML-dev-crypto do we know what is the impact in performance because of this change?

ML-dev-crypto · 2026-02-07T12:25:37Z

@ML-dev-crypto do we know what is the impact in performance because of this change?

Thanks for the question. Theoretically, this change eliminates array allocations in the hot path of graph construction and utilizes vector API bulk scoring, which should reduce GC pressure and improve throughput.
I am currently setting up the standard luceneutil benchmark to quantify the exact impact. I will post the results here as soon as they are available.

github-actions · 2026-02-22T00:36:28Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

github-actions Bot added the module:core/hnsw label Feb 5, 2026

benwtrent reviewed Feb 5, 2026

View reviewed changes

github-actions Bot added this to the 10.5.0 milestone Feb 6, 2026

Use bulk scoring with scratch buffer for HNSW diversity checks

f9073c4

Reuse a scratch buffer during HNSW diversity checks and use RandomVectorScorer.bulkScore with early termination to avoid per-call allocations on the hot path.

ML-dev-crypto force-pushed the branch_10x branch from eaecdc4 to f9073c4 Compare February 6, 2026 12:40

github-actions Bot added the Stale label Feb 22, 2026

leng25 mentioned this pull request Apr 14, 2026

Use bulk scoring more places for HNSW graphs #15606

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bulk scoring in NeighborArray#isWorstNonDiverse#15667

Use bulk scoring in NeighborArray#isWorstNonDiverse#15667
ML-dev-crypto wants to merge 1 commit intoapache:branch_10xfrom
ML-dev-crypto:branch_10x

ML-dev-crypto commented Feb 5, 2026

Uh oh!

benwtrent commented Feb 5, 2026

Uh oh!

benwtrent Feb 5, 2026

Uh oh!

ML-dev-crypto Feb 6, 2026

Uh oh!

benwtrent Feb 5, 2026

Uh oh!

ML-dev-crypto Feb 6, 2026

Uh oh!

benwtrent Feb 5, 2026

Uh oh!

benwtrent Feb 6, 2026

Uh oh!

ML-dev-crypto Feb 6, 2026

Uh oh!

navneet1v commented Feb 7, 2026

Uh oh!

ML-dev-crypto commented Feb 7, 2026

Uh oh!

github-actions Bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ML-dev-crypto commented Feb 5, 2026

Uh oh!

benwtrent commented Feb 5, 2026

Uh oh!

benwtrent Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ML-dev-crypto Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ML-dev-crypto Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

ML-dev-crypto Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

navneet1v commented Feb 7, 2026

Uh oh!

ML-dev-crypto commented Feb 7, 2026

Uh oh!

github-actions Bot commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants