On-disk index cache for the Grid benchmark harness by tlwillke · Pull Request #612 · datastax/jvector

tlwillke · 2026-02-05T03:55:09Z

This PR adds a deterministic on-disk index cache for the Grid benchmark harness and wires it in end-to-end so repeated runs can reuse previously-built graph indexes to save time.

Key changes

Introduced OnDiskGraphIndexCache (flat directory cache, one file per index) keyed by a stable signature derived from:

dataset base name
feature set (per-index; not dependent on the list of feature sets)
build params (M, efConstruction, neighborOverflow, addHierarchy, refineFinalGraph)
build compressor identity

Cache filenames are derived from the signature (sanitized for filesystem safety). This allows multiple cached indexes to coexist in a single flat cache directory without collisions.
Updated Grid.runOneGraph to treat caching per-index (per feature set):

load cached indexes when present
build only the missing ones and merge results
keep non-cached builds writing into the temp work directory using the original graphN naming

Refactored buildOnDisk so it can write either:

the original graph0..graphN temp files (cache disabled), or
signature-named files in the cache directory (cache enabled), while preserving existing build behavior and minimizing churn.

Updated Bench, BenchYAML dataset files, and HelloVectorWorld

disabled the index cache by default
easy enabling of the cache if desired (useSavedIndexIfExists or enableIndexCache)

Improved logging so it’s obvious when a cached index is used vs built from scratch, and added a cache-enabled startup message pointing to the cache directory and how to reclaim disk.

Behavior

Cache enabled: reuse cached indexes when signatures match; build only missing feature sets; artifacts persist in the cache directory.
Cache disabled: no cache reads/writes; build into temp work directory and clean up as before. NOTE: cached files indeed persist and must be explicitly deleted.

Notes

Signatures are per-index (per feature set) and include all build-defining params to prevent accidental reuse across incompatible builds.
Filenames are sanitized to avoid filesystem issues from compressor IDs or other signature components.
No impact on graph construction or other benchmarking performance
Cleaned up a few minor bugs in the existing code
Does not address disk use metrics (disk space or file count). They report 0 for cached indexes (a TODO)

…o on-disk-index-cache

github-actions · 2026-02-05T03:55:23Z

Before you submit for review:

Does your PR follow guidelines from CONTRIBUTIONS.md?
Did you summarize what this PR does clearly and concisely?
Did you include performance data for changes which may be performance impacting?
Did you include useful docs for any user-facing changes or features?
Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
Did you trigger and review regression testing results against the base branch via Run Bench Main?
Did you adhere to the code formatting guidelines (TBD)
Did you group your changes for easy review, providing meaningful descriptions for each commit?
Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

jvector-examples/src/main/java/io/github/jbellis/jvector/example/AutoBenchYAML.java

...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Bench.java

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java

jshook

Just some comments to think about.

MarkWolters

I think the invalidation / versioning would be well handled by maintaining a properties file that get updated as necessary with the current version from the pom file when a build is run and then can be read in by the java process. Looks good though, should be a good time saver!

tlwillke · 2026-02-06T00:36:25Z

I think the invalidation / versioning would be well handled by maintaining a properties file that get updated as necessary with the current version from the pom file when a build is run and then can be read in by the java process. Looks good though, should be a good time saver!

Thanks. I think including the JV revision in the key is a good idea, but I'm afraid it's insufficient since changes to main happen between releases as well.

I think we're headed toward invalidation based on:

Any change to the base vectors in the dataset
Any change to the index construction algorithm
Any change to the featureSet (quantization math, etc.)

Essentially, we need something like version numbers that reflect logic / math changes to indexing, quantization, etc., and cryptographic hashing to monitor dataset changes.

I'll open an issue.

ashkrisk

I've been wanting this feature for a while. Left some suggestions.

...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java

…d index deletes. Index cached marked Experimental.

…ts. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.

tlwillke · 2026-02-07T00:07:16Z

I have addressed all outstanding feedback and opened two related issues.

jshook

Thanks for taking the feedback and turning around the updates quickly.

* Initial implementation of index cache for Bench / Grid. * Initial implementation of index cache for Bench / Grid. * Uncommented datasets that were not working prior to PR613. * Improved Exception handling. Added alpha to key signature. Lazy cached index deletes. Index cached marked Experimental. * Integrated index caching into autoBenchYAML and runAllAndCollectResults. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.

tlwillke added 3 commits February 4, 2026 19:27

Initial implementation of index cache for Bench / Grid.

7f32bb1

Initial implementation of index cache for Bench / Grid.

ee1ceee

Merge branch 'on-disk-index-cache' of github.com:datastax/jvector int…

aba1930

…o on-disk-index-cache

tlwillke requested review from MarkWolters and jshook as code owners February 5, 2026 03:55

tlwillke requested a review from ashkrisk February 5, 2026 03:58

MarkWolters reviewed Feb 5, 2026

View reviewed changes

jvector-examples/src/main/java/io/github/jbellis/jvector/example/AutoBenchYAML.java Show resolved Hide resolved

Uncommented datasets that were not working prior to PR613.

48ce5c8

tlwillke requested a review from MarkWolters February 5, 2026 18:28

jshook reviewed Feb 5, 2026

View reviewed changes

...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java Show resolved Hide resolved

jshook reviewed Feb 5, 2026

View reviewed changes

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Bench.java Show resolved Hide resolved

jshook reviewed Feb 5, 2026

View reviewed changes

jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java Outdated Show resolved Hide resolved

jshook approved these changes Feb 5, 2026

View reviewed changes

MarkWolters approved these changes Feb 5, 2026

View reviewed changes

ashkrisk reviewed Feb 6, 2026

View reviewed changes

tlwillke added 2 commits February 6, 2026 14:38

Improved Exception handling. Added alpha to key signature. Lazy cache…

0655ae0

…d index deletes. Index cached marked Experimental.

Integrated index caching into autoBenchYAML and runAllAndCollectResul…

cb421d3

…ts. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.

This was referenced Feb 6, 2026

Improve signature used by the index caching feature (PR #612) #615

Open

Make disk usage monitoring work for the index caching feature (PR #612) #616

Open

tlwillke self-assigned this Feb 7, 2026

jshook added this to the Enhanced Resiliency Initiative milestone Feb 7, 2026

jshook approved these changes Feb 7, 2026

View reviewed changes

jshook merged commit 7e493ee into main Feb 7, 2026
12 checks passed

jshook deleted the on-disk-index-cache branch February 7, 2026 01:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-disk index cache for the Grid benchmark harness#612

On-disk index cache for the Grid benchmark harness#612
jshook merged 6 commits intomainfrom
on-disk-index-cache

tlwillke commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026 •

edited by tlwillke

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshook left a comment

Uh oh!

MarkWolters left a comment

Uh oh!

tlwillke commented Feb 6, 2026

Uh oh!

ashkrisk left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlwillke commented Feb 7, 2026

Uh oh!

jshook left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

tlwillke commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026 • edited by tlwillke Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

MarkWolters left a comment

Choose a reason for hiding this comment

Uh oh!

tlwillke commented Feb 6, 2026

Uh oh!

ashkrisk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlwillke commented Feb 7, 2026

Uh oh!

jshook left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Feb 5, 2026 •

edited by tlwillke

Loading