On-disk index cache for the Grid benchmark harness#612
Conversation
|
Before you submit for review:
If you did not complete any of these, then please explain below. |
jvector-examples/src/main/java/io/github/jbellis/jvector/example/AutoBenchYAML.java
Show resolved
Hide resolved
...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java
Show resolved
Hide resolved
jvector-examples/src/main/java/io/github/jbellis/jvector/example/Bench.java
Show resolved
Hide resolved
jvector-examples/src/main/java/io/github/jbellis/jvector/example/Grid.java
Outdated
Show resolved
Hide resolved
jshook
left a comment
There was a problem hiding this comment.
Just some comments to think about.
MarkWolters
left a comment
There was a problem hiding this comment.
I think the invalidation / versioning would be well handled by maintaining a properties file that get updated as necessary with the current version from the pom file when a build is run and then can be read in by the java process. Looks good though, should be a good time saver!
Thanks. I think including the JV revision in the key is a good idea, but I'm afraid it's insufficient since changes to main happen between releases as well. I think we're headed toward invalidation based on:
Essentially, we need something like version numbers that reflect logic / math changes to indexing, quantization, etc., and cryptographic hashing to monitor dataset changes. I'll open an issue. |
ashkrisk
left a comment
There was a problem hiding this comment.
I've been wanting this feature for a while. Left some suggestions.
...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java
Show resolved
Hide resolved
...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java
Show resolved
Hide resolved
...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java
Outdated
Show resolved
Hide resolved
...tor-examples/src/main/java/io/github/jbellis/jvector/example/util/OnDiskGraphIndexCache.java
Outdated
Show resolved
Hide resolved
…d index deletes. Index cached marked Experimental.
…ts. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.
|
I have addressed all outstanding feedback and opened two related issues. |
jshook
left a comment
There was a problem hiding this comment.
Thanks for taking the feedback and turning around the updates quickly.
* Initial implementation of index cache for Bench / Grid. * Initial implementation of index cache for Bench / Grid. * Uncommented datasets that were not working prior to PR613. * Improved Exception handling. Added alpha to key signature. Lazy cached index deletes. Index cached marked Experimental. * Integrated index caching into autoBenchYAML and runAllAndCollectResults. useSavedIndexIfExists is No by default. Added refineFinalGraph to construction parameters.
This PR adds a deterministic on-disk index cache for the Grid benchmark harness and wires it in end-to-end so repeated runs can reuse previously-built graph indexes to save time.
Key changes
OnDiskGraphIndexCache(flat directory cache, one file per index) keyed by a stable signature derived from:Cache filenames are derived from the signature (sanitized for filesystem safety). This allows multiple cached indexes to coexist in a single flat cache directory without collisions.
Updated
Grid.runOneGraphto treat caching per-index (per feature set):graphNnamingbuildOnDiskso it can write either:graph0..graphNtemp files (cache disabled), orBehavior
Notes