Skip to content

On-disk index cache for the Grid benchmark harness#612

Merged
jshook merged 6 commits intomainfrom
on-disk-index-cache
Feb 7, 2026
Merged

On-disk index cache for the Grid benchmark harness#612
jshook merged 6 commits intomainfrom
on-disk-index-cache

Conversation

@tlwillke
Copy link
Collaborator

@tlwillke tlwillke commented Feb 5, 2026

This PR adds a deterministic on-disk index cache for the Grid benchmark harness and wires it in end-to-end so repeated runs can reuse previously-built graph indexes to save time.

Key changes

  1. Introduced OnDiskGraphIndexCache (flat directory cache, one file per index) keyed by a stable signature derived from:
  • dataset base name
  • feature set (per-index; not dependent on the list of feature sets)
  • build params (M, efConstruction, neighborOverflow, addHierarchy, refineFinalGraph)
  • build compressor identity
  1. Cache filenames are derived from the signature (sanitized for filesystem safety). This allows multiple cached indexes to coexist in a single flat cache directory without collisions.

  2. Updated Grid.runOneGraph to treat caching per-index (per feature set):

  • load cached indexes when present
  • build only the missing ones and merge results
  • keep non-cached builds writing into the temp work directory using the original graphN naming
  1. Refactored buildOnDisk so it can write either:
  • the original graph0..graphN temp files (cache disabled), or
  • signature-named files in the cache directory (cache enabled), while preserving existing build behavior and minimizing churn.
  1. Updated Bench, BenchYAML dataset files, and HelloVectorWorld
  • disabled the index cache by default
  • easy enabling of the cache if desired (useSavedIndexIfExists or enableIndexCache)
  1. Improved logging so it’s obvious when a cached index is used vs built from scratch, and added a cache-enabled startup message pointing to the cache directory and how to reclaim disk.

Behavior

  • Cache enabled: reuse cached indexes when signatures match; build only missing feature sets; artifacts persist in the cache directory.
  • Cache disabled: no cache reads/writes; build into temp work directory and clean up as before. NOTE: cached files indeed persist and must be explicitly deleted.

Notes

  • Signatures are per-index (per feature set) and include all build-defining params to prevent accidental reuse across incompatible builds.
  • Filenames are sanitized to avoid filesystem issues from compressor IDs or other signature components.
  • No impact on graph construction or other benchmarking performance
  • Cleaned up a few minor bugs in the existing code
  • Does not address disk use metrics (disk space or file count). They report 0 for cached indexes (a TODO)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

Before you submit for review:

  • Does your PR follow guidelines from CONTRIBUTIONS.md?
  • Did you summarize what this PR does clearly and concisely?
  • Did you include performance data for changes which may be performance impacting?
  • Did you include useful docs for any user-facing changes or features?
  • Did you include useful javadocs for developer oriented changes, explaining new concepts or key changes?
  • Did you trigger and review regression testing results against the base branch via Run Bench Main?
  • Did you adhere to the code formatting guidelines (TBD)
  • Did you group your changes for easy review, providing meaningful descriptions for each commit?
  • Did you ensure that all files contain the correct copyright header?

If you did not complete any of these, then please explain below.

@tlwillke tlwillke requested a review from ashkrisk February 5, 2026 03:58
@tlwillke tlwillke requested a review from MarkWolters February 5, 2026 18:28
Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some comments to think about.

Copy link
Contributor

@MarkWolters MarkWolters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the invalidation / versioning would be well handled by maintaining a properties file that get updated as necessary with the current version from the pom file when a build is run and then can be read in by the java process. Looks good though, should be a good time saver!

@tlwillke
Copy link
Collaborator Author

tlwillke commented Feb 6, 2026

I think the invalidation / versioning would be well handled by maintaining a properties file that get updated as necessary with the current version from the pom file when a build is run and then can be read in by the java process. Looks good though, should be a good time saver!

Thanks. I think including the JV revision in the key is a good idea, but I'm afraid it's insufficient since changes to main happen between releases as well.

I think we're headed toward invalidation based on:

  1. Any change to the base vectors in the dataset
  2. Any change to the index construction algorithm
  3. Any change to the featureSet (quantization math, etc.)

Essentially, we need something like version numbers that reflect logic / math changes to indexing, quantization, etc., and cryptographic hashing to monitor dataset changes.

I'll open an issue.

Copy link
Contributor

@ashkrisk ashkrisk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wanting this feature for a while. Left some suggestions.

…d index deletes. Index cached marked Experimental.
…ts. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.
@tlwillke
Copy link
Collaborator Author

tlwillke commented Feb 7, 2026

I have addressed all outstanding feedback and opened two related issues.

@tlwillke tlwillke self-assigned this Feb 7, 2026
@jshook jshook added this to the Enhanced Resiliency Initiative milestone Feb 7, 2026
Copy link
Contributor

@jshook jshook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the feedback and turning around the updates quickly.

@jshook jshook merged commit 7e493ee into main Feb 7, 2026
12 checks passed
@jshook jshook deleted the on-disk-index-cache branch February 7, 2026 01:22
jshook pushed a commit that referenced this pull request Feb 12, 2026
* Initial implementation of index cache for Bench / Grid.

* Initial implementation of index cache for Bench / Grid.

* Uncommented datasets that were not working prior to PR613.

* Improved Exception handling.  Added alpha to key signature.  Lazy cached index deletes.  Index cached marked Experimental.

* Integrated index caching into autoBenchYAML and runAllAndCollectResults. useSavedIndexIfExists is No by default.  Added refineFinalGraph to construction parameters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants